We run three different job queue systems across our services. The patterns that work across all of them, the differences that matter, and the operational gotchas.

On this page

Job Queues: Sidekiq, Celery, BullMQ Patterns That Hold Up

We run three different job queue systems in production — Sidekiq for our Ruby services, Celery for Python, BullMQ for Node. Different ecosystems, same fundamental design. The patterns that matter are the same across all three; the differences are mostly syntactic. This post is what we've learned running them at modest scale.

Why background jobs at all #

Three reasons recur:

Latency — work the user doesn't need to wait for. Sending a welcome email, indexing a document, regenerating a thumbnail. Do it after returning the response; the user doesn't care.

Reliability — work that needs to retry on failure. Calling an unreliable third-party API; processing a webhook; running periodic reconciliation. Jobs persist; retries are automatic.

Bursty load — work that arrives in waves. A queue smooths the spike; workers process at their own pace.

Each is a real need. Most apps that grow past a few users have some background-job system.

What the three have in common #

All three job queues share:

A queue backend (usually Redis; sometimes Postgres or a dedicated service)
Workers that poll the queue and execute jobs
Job serialization — typically JSON, with the job class name plus arguments
Retry logic with configurable backoff
Dead-letter queues for jobs that exhaust retries
Scheduled / delayed jobs for "do this in 5 minutes"
A web UI for inspection (Sidekiq Web, Flower, Bull Board)

The mental model is identical across them. Switching languages is annoying (re-learn the syntax) but not architecturally meaningful.

What's different #

A few real differences worth knowing:

Sidekiq is the slickest of the three. Excellent Web UI out of the box, mature ecosystem, well-thought-out APIs, paid Pro/Enterprise tiers with serious features. Standard for Ruby. Memory-efficient because Ruby's MRI threads work fine for I/O-bound jobs.

Celery is the oldest and most flexible — supports multiple backends (RabbitMQ, Redis, etc.), multiple result stores, multiple worker models (thread, process, gevent, eventlet). The flexibility is also the problem: lots of ways to misconfigure it, lots of edge cases, awkward defaults. Mature but it shows its age.

BullMQ (the Node successor to Bull) is the newest. Modern API, TypeScript-first, Redis-only (which is a feature — fewer choices). Strong on its dashboard. Has matured significantly in the last couple of years.

For new projects we pick the queue that matches the language ecosystem. Cross-language workflows happen via message buses (Kafka, SQS) instead of trying to run one queue system across languages.

The patterns that earn their place #

Regardless of which queue you're using:

Idempotency. Jobs can run more than once (retries, network blips causing duplicate sends, etc.). Every job is designed to be safe on second execution. Either:

The work is naturally idempotent (UPDATE WHERE state=X)
Or we pass an idempotency key and check it at the destination

We've been burned multiple times by non-idempotent jobs. Always design for at-least-once execution.

Small, focused jobs. A job should do one thing. Don't bundle "send email + update database + call API" into one job. If one step fails and retries, the others run twice. Split into separate jobs chained together.

Pass IDs, not objects. A job takes a user_id, not a user object. Serializing a full object means stale data on retry (the object you serialized 5 minutes ago is older than the current state). Look up the current state when the job runs.

Bounded retries. Default retries (often 25 in Sidekiq, infinite in some Celery configs) are too high. We use 5-10 retries with exponential backoff for transient failures. After that, dead-letter the job and alert.

Timeouts. Every job has a max runtime. Jobs without timeouts hang forever and tie up workers. We set per-job timeouts (typically 5-30 minutes depending on the job).

Retry policies #

Two patterns work well for different failure types:

Exponential backoff for transient failures. Network blips, rate limits, temporary 503s. Retry at 30s, 1m, 5m, 15m, 60m. By the time you've burned through 5 retries with backoff, the transient issue has either resolved or there's a real problem.

Don't retry on permanent failures. Validation errors, 404s, malformed input. Retrying won't help. We classify exceptions: transient → retry with backoff; permanent → dead-letter immediately, alert someone.

python.python

# Celery example
@celery_app.task(bind=True, max_retries=5)
def send_email(self, user_id, template):
    try:
        send(user_id, template)
    except TransientError as exc:
        raise self.retry(exc=exc, countdown=30 * (2 ** self.request.retries))
    except PermanentError:
        # Don't retry; let it fail to dead-letter
        raise

The pattern works across all three queues with syntax changes.

Dead-letter handling #

A job that exhausts retries goes to a "dead letter" queue (or "failed" in Sidekiq). The dead-letter is not the end — it's a triage queue for humans.

Our dead-letter discipline:

Alert on dead-letter growth — if more than ~5 jobs land in dead-letter in an hour, page on-call. Indicates a systemic issue.
Daily review — even at low rates, a human checks the dead-letter queue daily, classifies each: replay (transient that just needs more time), discard (bad input we can't fix), investigate (something's actually wrong).
Dead-letter aging — anything in dead-letter > 7 days gets archived (or discarded). Forces triage.

Without discipline, dead-letter queues become graveyards of forgotten jobs. With it, they're a useful safety net.

Scheduled / delayed jobs #

All three queues support "run this job at time X" or "run this job in 5 minutes." Useful for:

Reminders ("email user 24 hours after signup if they haven't completed onboarding")
Cleanup ("delete this temp file in 1 hour")
Reconciliation ("retry this in 30 minutes if it hasn't succeeded by then")

A few patterns:

Delay capping. Don't schedule jobs more than 30 days out. Long-delayed jobs survive in Redis state forever, take space, and are forgotten about. For longer-term "in 6 months" jobs, use a real schedule (cron, systemd timer, or a recurring scheduled job that fires daily and checks a database).

Jitter. Schedule retries with random jitter (e.g., "30 to 60 seconds") not exact times. Otherwise retries can synchronize after a downstream outage, creating thundering herds.

Idempotency keys on scheduled jobs. Especially important — a job scheduled twice (by accident or by a bug) will run twice unless dedupe is built in.

Operating the workers #

Workers are processes. Some patterns:

Don't over-thread. Sidekiq runs many threads in one process; Celery can do the same. Sounds efficient. In practice, more threads = more contention on shared state = more weird bugs. We run modestly threaded workers (10-25 threads per process) and scale horizontally instead.

Memory bloat. All three frameworks have a tendency to accumulate memory over time (Ruby's GC, Python references held by job code, Node closures). We set memory limits per worker and let the OS kill them periodically. The orchestrator restarts them.

Long-running vs queue-poll workers. Most jobs are short. Long-running jobs (multi-minute or longer) should be on dedicated worker pools to avoid blocking the short-job pool. We have separate queues for "fast" and "slow" work; workers consume from one or the other, not both.

Graceful shutdown. Workers should finish current jobs before exiting on SIGTERM. Otherwise deployments interrupt jobs mid-execution, triggering retries. All three queues support this; configure the grace period.

Observability #

What we monitor across all three:

Queue depth per queue — if it grows steadily, workers can't keep up
Job execution time p50/p95/p99 — catches slow drift
Job error rate per type — surfaces broken jobs
Retry rate — high retry rate means transient issues somewhere
Dead-letter count — see above
Worker count / health — alert when workers crash repeatedly

All three have either built-in metrics or Prometheus exporters. We unify them into a Grafana dashboard per service.

What I'd tell a team starting #

Pick the queue that matches your language ecosystem. Sidekiq for Ruby, Celery for Python, BullMQ for Node. Cross-language workflows go through Kafka or SQS, not a shared queue.

Idempotency is non-negotiable. Every job assumes at-least-once execution. Plan for it.

Pass IDs, not objects. Look up current state when the job runs.

Bounded retries with classified exceptions. Transient → backoff; permanent → dead-letter.

Dead-letter queues require discipline. Without daily review, they become useless. Plan for review time.

Don't over-thread. Modest threading, horizontal scaling.

Monitor queue depth and dead-letter count. Two most important signals.

Job queues are mostly boring infrastructure that work fine once set up right. The patterns above are what makes them survive at modest scale. The teams that struggle with job queues usually have one or two of these missing — non-idempotent jobs, infinite retries, no dead-letter discipline. Fix those three and most of the operational pain disappears.

Job Queues — Sidekiq, Celery, BullMQ Patterns That Hold Up

Job Queues: Sidekiq, Celery, BullMQ Patterns That Hold Up

Why background jobs at all #

What the three have in common #

What's different #

The patterns that earn their place #

Retry policies #

Dead-letter handling #

Scheduled / delayed jobs #

Operating the workers #

Observability #

What I'd tell a team starting #

Stay Updated

systemd Timers vs Cron — What We Learned Switching

Embeddings Drift Detection — When "Similar Enough" Stops Being Similar

More from DevOps

Helm Chart Anti-Patterns We've Stopped Using

Internal Developer Platforms — Backstage in Practice

Chaos Engineering — What We Actually Run as Game Days

Helm Chart Anti-Patterns We've Stopped Using

Internal Developer Platforms — Backstage in Practice

Chaos Engineering — What We Actually Run as Game Days

Kubernetes 101 — Pods, Deployments, and Services Explained

About Admin

You might have missed

AI Agents in DevOps: From Copilots to Autonomous Automation in 2025

Prompt Engineering Best Practices: Maximizing LLM Performance

AI Model Deployment Strategies: From Development to Production