We run three different job queue systems across our services. The patterns that work across all of them, the differences that matter, and the operational gotchas.
We run three different job queue systems in production — Sidekiq for our Ruby services, Celery for Python, BullMQ for Node. Different ecosystems, same fundamental design. The patterns that matter are the same across all three; the differences are mostly syntactic. This post is what we've learned running them at modest scale.
Three reasons recur:
Latency — work the user doesn't need to wait for. Sending a welcome email, indexing a document, regenerating a thumbnail. Do it after returning the response; the user doesn't care.
Reliability — work that needs to retry on failure. Calling an unreliable third-party API; processing a webhook; running periodic reconciliation. Jobs persist; retries are automatic.
Bursty load — work that arrives in waves. A queue smooths the spike; workers process at their own pace.
Each is a real need. Most apps that grow past a few users have some background-job system.
All three job queues share:
The mental model is identical across them. Switching languages is annoying (re-learn the syntax) but not architecturally meaningful.
A few real differences worth knowing:
Sidekiq is the slickest of the three. Excellent Web UI out of the box, mature ecosystem, well-thought-out APIs, paid Pro/Enterprise tiers with serious features. Standard for Ruby. Memory-efficient because Ruby's MRI threads work fine for I/O-bound jobs.
Celery is the oldest and most flexible — supports multiple backends (RabbitMQ, Redis, etc.), multiple result stores, multiple worker models (thread, process, gevent, eventlet). The flexibility is also the problem: lots of ways to misconfigure it, lots of edge cases, awkward defaults. Mature but it shows its age.
BullMQ (the Node successor to Bull) is the newest. Modern API, TypeScript-first, Redis-only (which is a feature — fewer choices). Strong on its dashboard. Has matured significantly in the last couple of years.
For new projects we pick the queue that matches the language ecosystem. Cross-language workflows happen via message buses (Kafka, SQS) instead of trying to run one queue system across languages.
Regardless of which queue you're using:
Idempotency. Jobs can run more than once (retries, network blips causing duplicate sends, etc.). Every job is designed to be safe on second execution. Either:
We've been burned multiple times by non-idempotent jobs. Always design for at-least-once execution.
Small, focused jobs. A job should do one thing. Don't bundle "send email + update database + call API" into one job. If one step fails and retries, the others run twice. Split into separate jobs chained together.
Pass IDs, not objects. A job takes a user_id, not a user object. Serializing a full object means stale data on retry (the object you serialized 5 minutes ago is older than the current state). Look up the current state when the job runs.
Bounded retries. Default retries (often 25 in Sidekiq, infinite in some Celery configs) are too high. We use 5-10 retries with exponential backoff for transient failures. After that, dead-letter the job and alert.
Timeouts. Every job has a max runtime. Jobs without timeouts hang forever and tie up workers. We set per-job timeouts (typically 5-30 minutes depending on the job).
Two patterns work well for different failure types:
Exponential backoff for transient failures. Network blips, rate limits, temporary 503s. Retry at 30s, 1m, 5m, 15m, 60m. By the time you've burned through 5 retries with backoff, the transient issue has either resolved or there's a real problem.
Don't retry on permanent failures. Validation errors, 404s, malformed input. Retrying won't help. We classify exceptions: transient → retry with backoff; permanent → dead-letter immediately, alert someone.
# Celery example
@celery_app.task(bind=True, max_retries=5)
def send_email(self, user_id, template):
try:
send(user_id, template)
except TransientError as exc:
raise self.retry(exc=exc, countdown=30 * (2 ** self.request.retries))
except PermanentError:
# Don't retry; let it fail to dead-letter
raise
The pattern works across all three queues with syntax changes.
A job that exhausts retries goes to a "dead letter" queue (or "failed" in Sidekiq). The dead-letter is not the end — it's a triage queue for humans.
Our dead-letter discipline:
Without discipline, dead-letter queues become graveyards of forgotten jobs. With it, they're a useful safety net.
All three queues support "run this job at time X" or "run this job in 5 minutes." Useful for:
A few patterns:
Delay capping. Don't schedule jobs more than 30 days out. Long-delayed jobs survive in Redis state forever, take space, and are forgotten about. For longer-term "in 6 months" jobs, use a real schedule (cron, systemd timer, or a recurring scheduled job that fires daily and checks a database).
Jitter. Schedule retries with random jitter (e.g., "30 to 60 seconds") not exact times. Otherwise retries can synchronize after a downstream outage, creating thundering herds.
Idempotency keys on scheduled jobs. Especially important — a job scheduled twice (by accident or by a bug) will run twice unless dedupe is built in.
Workers are processes. Some patterns:
Don't over-thread. Sidekiq runs many threads in one process; Celery can do the same. Sounds efficient. In practice, more threads = more contention on shared state = more weird bugs. We run modestly threaded workers (10-25 threads per process) and scale horizontally instead.
Memory bloat. All three frameworks have a tendency to accumulate memory over time (Ruby's GC, Python references held by job code, Node closures). We set memory limits per worker and let the OS kill them periodically. The orchestrator restarts them.
Long-running vs queue-poll workers. Most jobs are short. Long-running jobs (multi-minute or longer) should be on dedicated worker pools to avoid blocking the short-job pool. We have separate queues for "fast" and "slow" work; workers consume from one or the other, not both.
Graceful shutdown. Workers should finish current jobs before exiting on SIGTERM. Otherwise deployments interrupt jobs mid-execution, triggering retries. All three queues support this; configure the grace period.
What we monitor across all three:
All three have either built-in metrics or Prometheus exporters. We unify them into a Grafana dashboard per service.
Pick the queue that matches your language ecosystem. Sidekiq for Ruby, Celery for Python, BullMQ for Node. Cross-language workflows go through Kafka or SQS, not a shared queue.
Idempotency is non-negotiable. Every job assumes at-least-once execution. Plan for it.
Pass IDs, not objects. Look up current state when the job runs.
Bounded retries with classified exceptions. Transient → backoff; permanent → dead-letter.
Dead-letter queues require discipline. Without daily review, they become useless. Plan for review time.
Don't over-thread. Modest threading, horizontal scaling.
Monitor queue depth and dead-letter count. Two most important signals.
Job queues are mostly boring infrastructure that work fine once set up right. The patterns above are what makes them survive at modest scale. The teams that struggle with job queues usually have one or two of these missing — non-idempotent jobs, infinite retries, no dead-letter discipline. Fix those three and most of the operational pain disappears.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
We migrated most scheduled jobs from cron to systemd timers. The wins, the gotchas, and the cases we kept on cron anyway.
Embedding indexes degrade silently. The signals that catch drift, how often to re-embed, and the operational patterns we built after one quiet quality regression.
Explore more articles in this category
Helm gives you a lot of rope. The patterns we used that backfired, the ones we replaced them with, and what to skip if you're starting today.
We adopted Backstage for service catalogs and templates. What works, what was over-engineered for our size, and what we'd do differently.
We run a chaos game day each quarter. The scenarios that surfaced real problems, the ones that didn't, and the operational discipline that makes the practice pay back.