We started with a single Celery worker handling everything. Eight months and three architecture changes later, here's what scaled and what we learned about queue design.

On this page

Python Worker Queue Scaling Patterns: Architecture Review

A team I worked with last year started a backend with a simple shape: Celery workers behind Redis, handling all async work — image processing, email sending, billing reconciliation, webhook deliveries. It worked beautifully for about six months. Then it didn't, and the way it broke taught us a lot about how worker queues actually scale.

What we started with #

code

[ web ] → [ Redis ] → [ Celery worker pool (3 instances) ]

Three workers, default Celery concurrency (8 threads each), one Redis instance, one queue. Tasks ran fine. Throughput at 100-200 jobs/min was painless. Median latency from enqueue to start was ~50ms.

What broke first #

A single slow task type took down the queue.

Around month seven, we added a "generate PDF report" task that took 15-30 seconds each. Volume was modest — maybe 50/hour. Nothing alarming. But each PDF held a worker thread for its entire duration. With 24 total threads (3 × 8) and 25 PDFs running at a peak, we'd routinely have all threads busy on PDFs while urgent tasks (password reset emails, payment webhooks) sat in the queue for minutes.

The pattern: head-of-line blocking from a slow task type. The queue wasn't full; the workers were. From outside, it looked like a queue problem; the actual issue was thread allocation.

Architecture change 1: priority queues #

The first response was to give different task types different queues:

code

[ web ] → [ Redis ]
            ├── default (general tasks)
            ├── urgent (password resets, payment webhooks)
            └── slow  (PDF generation, exports, reconciliation)

[ workers ]
  ├── 2 instances watching: urgent, default
  └── 1 instance watching: slow

Workers configured to consume from specific queues:

python.python

# urgent + default workers
celery -A app worker -Q urgent,default --concurrency=8

# slow workers
celery -A app worker -Q slow --concurrency=4

The key was that no worker was watching ALL queues. The urgent worker can never get tied up on a slow PDF; it physically isn't subscribed to that queue.

Effect: urgent task latency stopped degrading during PDF spikes. P99 enqueue-to-start dropped from ~3 minutes back to ~80ms even when slow workers were saturated.

What broke next #

A retry storm.

Our webhook delivery task retried on failure with exponential backoff. When a downstream provider had an outage, our queue filled with retries. Celery's default retry uses countdown parameter, which means the retried task is re-enqueued with a delay. Redis handled the volume fine; what broke was that legitimate new webhooks got mixed in with thousands of retries on the same queue.

Symptom from outside: new webhook deliveries took 20+ minutes during the upstream outage, even after the upstream came back. The backlog had to clear first.

Architecture change 2: separate retry queue + dead letter #

code

[ webhook delivery task ]
    → first attempt:    queue = webhook
    → after first fail: queue = webhook-retry  (with backoff)
    → after 5 retries:  queue = webhook-dead   (manual review)

A dedicated worker pool consumed the retry queue, separate from the live webhook queue. New webhook deliveries weren't blocked by the retry storm.

The dead letter queue is a Redis list that nothing consumes automatically. When a task hits it, an alert fires; an engineer triages. We've kept this conservative — dead letter is "I have given up; help me." The threshold for moving to dead letter is real failure, not transient.

This pattern is the queue version of circuit-breaking. The retry queue absorbs the storm without contaminating live traffic.

What broke next #

Slow database queries inside tasks.

A task that "just" updated a row started taking 2-5 seconds each because the table had grown and was missing an index. Default Celery worker concurrency is threaded; threads share a Python process and a database connection pool. With 8 threads on a worker and a 50-connection DB pool, we were per-worker connection-bound during slow query periods.

We saw this as queue depth growing during certain hours, but workers reporting low CPU. Workers were waiting on DB.

Architecture change 3: explicit pool sizing per task class #

We split workers by characteristic:

CPU-bound (image processing, PDF gen): low concurrency (2-4), more processes
I/O-bound (DB updates, API calls): high concurrency (16-32 threads or gevent)
Memory-heavy (data exports): low concurrency (1-2), large memory limits

Each got its own queue, deployment, and resource profile. One Celery codebase but multiple deployments with different --pool and --concurrency settings.

For I/O-bound work we moved to gevent-based concurrency, which lets a worker handle 32+ concurrent I/O operations on a single thread. Massive throughput improvement on the DB-update path; CPU on those workers stayed under 30%.

What we learned about Redis as a broker #

Redis as a Celery broker handles a remarkable amount of load before becoming the bottleneck. We were at ~3000 tasks/min before any Redis-side metric started looking concerning. The first Redis issue we hit was actually memory: Celery stores task results in Redis by default, and we'd been keeping them forever.

Fix: set result_backend_transport_options = {'result_chain_lifespan': 3600} and task_ignore_result = True for tasks that don't need their result fetched (most of them). Memory usage dropped 80%.

For very high throughput (10k+ tasks/min sustained), Redis becomes the bottleneck and you'd want to consider RabbitMQ or a streaming platform like Kafka. We never got there.

What we'd do from scratch #

If I were starting this stack tomorrow knowing what we learned:

Multiple queues from day one. At minimum: default, urgent, slow. The cost of having extra queues is low; the cost of finding out you needed them at month seven is high.
Separate workers per queue type. Not "all workers consume all queues." Type-specific worker pools with sized concurrency.
Retry queue + dead letter from the start. The patterns are well-known; baking them in initially is cheaper than retrofitting during an incident.
Result backend off by default, on per-task as needed. Memory use you don't budget for is memory use that surprises you.
Latency dashboards per queue. Not "queue depth" but "enqueue-to-start time, p50 / p95 / p99." Tells you which queue is healthy and which is starving.

What we didn't do #

We never moved to RabbitMQ or Kafka. Redis Celery handled our peak (~6k tasks/min) without falling over. The complexity of moving brokers wasn't justified.

We didn't adopt async task frameworks like arq or Dramatiq. Celery has a long learning curve but is well-understood by the team. Switching was tempting; the cost of training and migration outweighed the benefit.

We never built per-customer queue isolation. We've been close to needing it (one large customer's batch jobs sometimes saturate a worker pool), but for now, fairness scheduling at the worker level (Celery has it; we use it sparingly) has been enough.

Common mistakes to avoid #

The team I work with now has avoided some of our mistakes by knowing about them:

Mixing task types on a single queue. Looks simple; falls down at scale.
Default Celery concurrency for everything. The right concurrency depends on the task type. Pick deliberately.
No timeout on tasks. A stuck task can hold a thread indefinitely. Set time_limit on every task.
Synchronous downstream calls without timeouts. A worker waiting forever on a hung HTTP call is a worker not doing useful work. requests with timeout= always.
Logging every task execution at INFO. At high volume this fills disk and floods aggregation pipelines. Use DEBUG for routine, INFO for unusual.

Operational metrics that mattered #

Three queue-side metrics we always monitor:

Queue depth, per queue — backlog
Enqueue-to-start time, p95 per queue — latency under load
Worker utilisation, per queue — are we saturated, idle, or balanced

Plus three task-side metrics:

Task duration, p95 per task type — drift in task complexity
Task failure rate, per task type — quality regression
Retry rate, per task type — reliability of the underlying operation

The combination tells you whether the issue is the queue (backlog with idle workers = nothing dispatching), the workers (high utilisation, slow tasks), or the tasks themselves (failure or retry spikes).

What I'd tell someone setting up Celery for the first time #

Use Celery if you already know it. It's powerful and the Python community knows it. The shape we ended up with — multiple queues, multiple worker pools, retry/dead-letter discipline — works well, but takes effort to get right.

If you don't know Celery, look at simpler options first. arq for Redis-backed async tasks. Dramatiq if you want Celery's shape with simpler internals. The complexity of Celery is worth it once you have the scaling problems we hit; not worth it before.

Whichever framework you use, the architectural patterns transfer: separate queues for separate latency profiles, retry storms get their own lane, sized concurrency per task type. The framework matters less than the queue design.

Architecture Review: Python Worker Queue Scaling Patterns

Python Worker Queue Scaling Patterns: Architecture Review

What we started with #

What broke first #

Architecture change 1: priority queues #

What broke next #

Architecture change 2: separate retry queue + dead letter #

What broke next #

Architecture change 3: explicit pool sizing per task class #

What we learned about Redis as a broker #

What we'd do from scratch #

What we didn't do #

Common mistakes to avoid #

Operational metrics that mattered #

What I'd tell someone setting up Celery for the first time #

Stay Updated

CI/CD Pipeline Optimization: Speeding Up Your Builds

Real-World RAG Incidents: Lessons from a Production Rollout

More from AI

Token Budgeting for Long-Context Prompts: What to Cut First

Multi-Provider LLM Gateways: Routing, Fallback, and Cost Control

Streaming LLM Responses: SSE, Backpressure, and Cancellation

Token Budgeting for Long-Context Prompts: What to Cut First

Multi-Provider LLM Gateways: Routing, Fallback, and Cost Control

Streaming LLM Responses: SSE, Backpressure, and Cancellation

Choosing an Embedding Model: Dimensions, Cost, and MTEB Reality

External Secrets Operator: One Secrets Workflow Across Clouds

Four Signals That Matter: Choosing SLIs Users Actually Feel

You might have missed

GitOps with Argo CD: Best Practices for 2025

Process Management and Monitoring in Linux

Linux Performance Tuning for Containers and Kubernetes Nodes

About Kiril Urbonas