Picking partition counts and keys decides whether your Kafka consumers scale linearly or hit a wall. The patterns that survived rebalances, partition-count changes, and consumer-group ops.
Kafka partition counts and keying decisions look like throwaway config — until you hit a hot partition, or try to add consumers and find out partitions limit your parallelism, or change the partition count and discover all your downstream ordering assumptions broke. The patterns below are what survived two years of running Kafka behind ~40 producer-consumer pairs.
Strip away the marketing: partitions decide two things and only two:
Every other consideration (throughput, durability, retention) flows from those two.
The advice "pick lots of partitions" is half right and half misleading. The actual rule we use:
partitions = max( peak_target_throughput_MB_per_sec / per_partition_throughput_MB_per_sec,
expected_max_consumers )
Per-partition throughput on modest hardware is ~5–10 MB/s for sustained writes. So if you expect peak 50 MB/s and might want 8 parallel consumers eventually, pick max(10, 8) = 10.
Common mistakes:
We aim for "enough headroom for 2–3× current consumer count, sized for peak throughput, rounded to a small number you'll remember." Typically 6, 12, 24, or 48 partitions per topic — not 7 or 19.
The partition key determines which partition a message lands on. partition = hash(key) mod num_partitions. So:
This is the most under-thought decision in Kafka design. What do you actually need ordered?
For an order-processing service: probably "events for the same order_id" must arrive in order. Key = order_id.
For a user-event tracker: probably "events for the same user_id" — key = user_id.
For application logs: usually no ordering needed across messages. Key = null → Kafka round-robins. Maximum parallelism.
The biggest mistake here: keying on something coarse (like "tenant_id" for a multi-tenant SaaS) when many tenants are small but one is huge. All the big tenant's traffic lands on one partition. Hot partition.
You key by tenant_id. Tenant 42 produces 80% of the traffic. The partition holding tenant 42's messages is the bottleneck; the other partitions are mostly idle. You scaled to 12 partitions hoping for 12× throughput; you got ~1.2×.
Three real fixes:
Add randomness to the key. Key = tenant_id + ":" + random_bucket(0..N-1). Spreads tenant 42 across N partitions. Trade: messages for the same tenant aren't strictly ordered anymore, which may or may not matter.
Custom partitioner. Detect hot tenants and route them across more partitions. Most Kafka clients support a custom partitioner. More code; more control.
Per-tenant topics for huge tenants. A separate dedicated topic for tenant 42; consumers handle it specially. Hard to scale to many huge tenants; works for the "a few whales" pattern.
We've used (1) most often. It's a clear concession in ordering guarantees in exchange for actual parallelism.
Adding partitions to a topic is technically easy (kafka-topics --alter). Doing it without breaking things isn't.
When you add partitions, the partition assignment changes for new messages. If your consumers cache state keyed by partition (e.g., "consumer 5 is responsible for partition 5's state"), that state is stale. Worse, if you were keying by user_id and getting consistent partition assignment, the new partition count means user_id now maps to a different partition — so events for the same user could arrive at a different consumer than before.
What we do:
The "just add more partitions later" mindset works for simple stateless consumers. Anything stateful, you pay later.
A consumer group should have:
For autoscaling consumer pools: scale based on consumer lag, not CPU. CPU on a consumer that's keeping up looks fine; consumer lag tells you the actual story.
We use KEDA (Kubernetes Event-Driven Autoscaling) with the Kafka lag scaler. Target lag of ~1000 messages per partition; scale up if exceeded for >2 min, scale down if under for >10 min. Works well.
When a consumer joins or leaves the group, Kafka rebalances: every partition is reassigned. During rebalance:
This pause can be milliseconds (small groups, small state) or seconds (large groups, large state). For latency-sensitive consumers, rebalance is the worst part of running Kafka.
Mitigations:
session.timeout.ms: how long a consumer can be away before it's considered dead. Bigger = fewer rebalances on transient issues, longer detection of real failures. We use 30 seconds.max.poll.interval.ms: how long between polls. If a consumer takes too long processing one message, it misses the deadline and triggers a rebalance. Tune based on your actual processing time + safety margin.The "consumer keeps getting kicked out of the group" problem is almost always a max.poll.interval.ms issue — processing took longer than expected, the broker thought the consumer died, triggered a rebalance.
enable.idempotence=true). Prevents duplicate messages on producer retry. Default in modern clients; verify.acks=all on producers. Wait for all in-sync replicas to confirm. Durability over latency for anything important.enable.auto.commit=false). Commit after successful processing, not on poll. Prevents data loss on consumer crash.Kafka partitions are simple in concept and tricky in operation. The bulk of issues we've debugged trace back to one of three decisions: wrong partition count, wrong key, or unanticipated hot tenant. Make those right at design time and Kafka is one of the most predictable parts of the stack.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
AI agents for incident triage sound great in demos. We've tried it in production. The patterns that earn their keep, the ones that backfire, and where humans still beat agents.
Three caching patterns, three failure modes. The one we use most, the one that bit us, and the rule that decides which pattern fits which workload.
Explore more articles in this category
Vault + Kubernetes auth + Vault Agent Injector. The setup, the failure modes during pod startup, and the patterns that beat raw Kubernetes Secrets.
Production monitoring catches user-facing issues. CI failures stay invisible until someone notices the merge queue is stuck. The metrics and alerts that make pipelines observable.
Static thresholds on error rate produce noisy alerts. Burn-rate alerting flips the question to "are we burning the error budget faster than we can sustain?" — and pages only on real problems.