We started with a single Celery worker handling everything. Eight months and three architecture changes later, here's what scaled and what we learned about queue design.
A team I worked with last year started a backend with a simple shape: Celery workers behind Redis, handling all async work — image processing, email sending, billing reconciliation, webhook deliveries. It worked beautifully for about six months. Then it didn't, and the way it broke taught us a lot about how worker queues actually scale.
[ web ] → [ Redis ] → [ Celery worker pool (3 instances) ]
Three workers, default Celery concurrency (8 threads each), one Redis instance, one queue. Tasks ran fine. Throughput at 100-200 jobs/min was painless. Median latency from enqueue to start was ~50ms.
A single slow task type took down the queue.
Around month seven, we added a "generate PDF report" task that took 15-30 seconds each. Volume was modest — maybe 50/hour. Nothing alarming. But each PDF held a worker thread for its entire duration. With 24 total threads (3 × 8) and 25 PDFs running at a peak, we'd routinely have all threads busy on PDFs while urgent tasks (password reset emails, payment webhooks) sat in the queue for minutes.
The pattern: head-of-line blocking from a slow task type. The queue wasn't full; the workers were. From outside, it looked like a queue problem; the actual issue was thread allocation.
The first response was to give different task types different queues:
[ web ] → [ Redis ]
├── default (general tasks)
├── urgent (password resets, payment webhooks)
└── slow (PDF generation, exports, reconciliation)
[ workers ]
├── 2 instances watching: urgent, default
└── 1 instance watching: slow
Workers configured to consume from specific queues:
# urgent + default workers
celery -A app worker -Q urgent,default --concurrency=8
# slow workers
celery -A app worker -Q slow --concurrency=4
The key was that no worker was watching ALL queues. The urgent worker can never get tied up on a slow PDF; it physically isn't subscribed to that queue.
Effect: urgent task latency stopped degrading during PDF spikes. P99 enqueue-to-start dropped from ~3 minutes back to ~80ms even when slow workers were saturated.
A retry storm.
Our webhook delivery task retried on failure with exponential backoff. When a downstream provider had an outage, our queue filled with retries. Celery's default retry uses countdown parameter, which means the retried task is re-enqueued with a delay. Redis handled the volume fine; what broke was that legitimate new webhooks got mixed in with thousands of retries on the same queue.
Symptom from outside: new webhook deliveries took 20+ minutes during the upstream outage, even after the upstream came back. The backlog had to clear first.
[ webhook delivery task ]
→ first attempt: queue = webhook
→ after first fail: queue = webhook-retry (with backoff)
→ after 5 retries: queue = webhook-dead (manual review)
A dedicated worker pool consumed the retry queue, separate from the live webhook queue. New webhook deliveries weren't blocked by the retry storm.
The dead letter queue is a Redis list that nothing consumes automatically. When a task hits it, an alert fires; an engineer triages. We've kept this conservative — dead letter is "I have given up; help me." The threshold for moving to dead letter is real failure, not transient.
This pattern is the queue version of circuit-breaking. The retry queue absorbs the storm without contaminating live traffic.
Slow database queries inside tasks.
A task that "just" updated a row started taking 2-5 seconds each because the table had grown and was missing an index. Default Celery worker concurrency is threaded; threads share a Python process and a database connection pool. With 8 threads on a worker and a 50-connection DB pool, we were per-worker connection-bound during slow query periods.
We saw this as queue depth growing during certain hours, but workers reporting low CPU. Workers were waiting on DB.
We split workers by characteristic:
gevent)Each got its own queue, deployment, and resource profile. One Celery codebase but multiple deployments with different --pool and --concurrency settings.
For I/O-bound work we moved to gevent-based concurrency, which lets a worker handle 32+ concurrent I/O operations on a single thread. Massive throughput improvement on the DB-update path; CPU on those workers stayed under 30%.
Redis as a Celery broker handles a remarkable amount of load before becoming the bottleneck. We were at ~3000 tasks/min before any Redis-side metric started looking concerning. The first Redis issue we hit was actually memory: Celery stores task results in Redis by default, and we'd been keeping them forever.
Fix: set result_backend_transport_options = {'result_chain_lifespan': 3600} and task_ignore_result = True for tasks that don't need their result fetched (most of them). Memory usage dropped 80%.
For very high throughput (10k+ tasks/min sustained), Redis becomes the bottleneck and you'd want to consider RabbitMQ or a streaming platform like Kafka. We never got there.
If I were starting this stack tomorrow knowing what we learned:
default, urgent, slow. The cost of having extra queues is low; the cost of finding out you needed them at month seven is high.We never moved to RabbitMQ or Kafka. Redis Celery handled our peak (~6k tasks/min) without falling over. The complexity of moving brokers wasn't justified.
We didn't adopt async task frameworks like arq or Dramatiq. Celery has a long learning curve but is well-understood by the team. Switching was tempting; the cost of training and migration outweighed the benefit.
We never built per-customer queue isolation. We've been close to needing it (one large customer's batch jobs sometimes saturate a worker pool), but for now, fairness scheduling at the worker level (Celery has it; we use it sparingly) has been enough.
The team I work with now has avoided some of our mistakes by knowing about them:
time_limit on every task.requests with timeout= always.Three queue-side metrics we always monitor:
Plus three task-side metrics:
The combination tells you whether the issue is the queue (backlog with idle workers = nothing dispatching), the workers (high utilisation, slow tasks), or the tasks themselves (failure or retry spikes).
Use Celery if you already know it. It's powerful and the Python community knows it. The shape we ended up with — multiple queues, multiple worker pools, retry/dead-letter discipline — works well, but takes effort to get right.
If you don't know Celery, look at simpler options first. arq for Redis-backed async tasks. Dramatiq if you want Celery's shape with simpler internals. The complexity of Celery is worth it once you have the scaling problems we hit; not worth it before.
Whichever framework you use, the architectural patterns transfer: separate queues for separate latency profiles, retry storms get their own lane, sized concurrency per task type. The framework matters less than the queue design.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
We cut our average CI build time from 28 minutes to 6 minutes. The changes that mattered, ranked by impact.
A field report from rolling out retrieval-augmented generation in production, including cache bugs, bad embeddings, and how we fixed them.
Explore more articles in this category
Embedding indexes degrade silently. The signals that catch drift, how often to re-embed, and the operational patterns we built after one quiet quality regression.
Streaming LLM responses is easy until the client disconnects, the model stalls, or the user cancels. The patterns that keep streaming responsive without leaking spend.
When LLMs can call tools that change real state, the design decisions that matter most are about what's gated, what's automatic, and what triggers a human checkpoint.
Evergreen posts worth revisiting.