Optimizing

    There are optimizations that can be applied based on specific use cases.

    Optimizations can apply to different properties of the running environment, be it the time tasks take to execute, the amount of memory used, or responsiveness at times of high load.

    In the book Programming Pearls, Jon Bentley presents the concept of back-of-the-envelope calculations by asking the question;

    The point of this exercise is to show that there is a limit to how much data a system can process in a timely manner. Back of the envelope calculations can be used as a means to plan for this ahead of time.

    In Celery; If a task takes 10 minutes to complete, and there are 10 new tasks coming in every minute, the queue will never be empty. This is why it’s very important that you monitor queue lengths!

    A way to do this is by using Munin. You should set up alerts, that will notify you as soon as any queue has reached an unacceptable size. This way you can take appropriate action like adding new worker nodes, or revoking unnecessary tasks.

    If you’re using RabbitMQ (AMQP) as the broker then you can install the librabbitmq module to use an optimized client written in C:

    The ‘amqp’ transport will automatically use the librabbitmq module if it’s installed, or you can also specify the transport you want directly by using the pyamqp:// or librabbitmq:// prefixes.

    Broker Connection Pools

    You can tweak the BROKER_POOL_LIMIT setting to minimize contention, and the value should be based on the number of active threads/greenthreads using broker connections.

    Queues created by Celery are persistent by default. This means that the broker will write messages to disk to ensure that the tasks will be executed even if the broker is restarted.

    But in some cases it’s fine that the message is lost, so not all tasks require durability. You can create a transient queue for these tasks to improve performance:

    1. CELERY_QUEUES = (
    2. Queue('transient', routing_key='transient',
    3. delivery_mode=1),

    The delivery_mode changes how the messages to this queue are delivered. A value of 1 means that the message will not be written to disk, and a value of 2 (default) means that the message can be written to disk.

    To direct a task to your new transient queue you can specify the queue argument (or use the setting):

    For more information see the routing guide.

    Prefetch Limits

    Prefetch is a term inherited from AMQP that is often misunderstood by users.

    The prefetch limit is a limit for the number of tasks (messages) a worker can reserve for itself. If it is zero, the worker will keep consuming messages, not respecting that there may be other available worker nodes that may be able to process them sooner [†], or that the messages may not even fit in memory.

    The workers’ default prefetch count is the setting multiplied by the number of concurrency slots[*]_ (processes/threads/greenthreads).

    However – If you have many short-running tasks, and throughput/round trip latency is important to you, this number should be large. The worker is able to process more tasks per second if the messages have already been prefetched, and is available in memory. You may have to experiment to find the best value that works for you. Values like 50 or 150 might make sense in these circumstances. Say 64, or 128.

    If you have a combination of long- and short-running tasks, the best option is to use two worker nodes that are configured separately, and route the tasks according to the run-time. (see Routing Tasks).

    When using early acknowledgement (default), a prefetch multiplier of 1 means the worker will reserve at most one extra task for every active worker process.

    When users ask if it’s possible to disable “prefetching of tasks”, often what they really want is to have a worker only reserve as many tasks as there are child processes.

    But this is not possible without enabling late acknowledgements acknowledgements; A task that has been started, will be retried if the worker crashes mid execution so the task must be (see also notes at Should I use retry or acks_late?).

    You can enable this behavior by using the following configuration options:

    1. CELERY_ACKS_LATE = True
    2. CELERYD_PREFETCH_MULTIPLIER = 1

    Prefork pool prefetch settings

    The prefork pool will asynchronously send as many tasks to the processes as it can and this means that the processes are, in effect, prefetching tasks.

    This benefits performance but it also means that tasks may be stuck waiting for long running tasks to complete:

    The worker will send tasks to the process as long as the pipe buffer is writable. The pipe buffer size varies based on the operating system: some may have a buffer as small as 64kb but on recent Linux versions the buffer size is 1MB (can only be changed system wide).

    With this option enabled the worker will only write to workers that are available for work, disabling the prefetch behavior.