We collect ~800GB of logs per day across our fleet. The shape of our logging stack, what we keep, what we drop, and what we'd build differently.
We collect roughly 800GB of logs per day across our fleet. They go through a Fluent Bit → Kafka → Elasticsearch pipeline, with hot/warm/cold tiering for cost. The setup has been stable for about three years; we've also rethought parts of it twice. This post is the current configuration with the production reasoning.
Three distinct uses, with different requirements:
The mistake teams make is using one tool for all three. Each has different cost/performance characteristics; trying to unify them creates a system that's expensive AND slow AND incomplete.
We split them into three pipelines that share collection but diverge after.
Fluent Bit runs as a DaemonSet on every Kubernetes node. It tails container logs, parses where it can, and forwards to Kafka.
Why Fluent Bit (not Fluentd, not Vector, not Loki Promtail):
The Fluent Bit config does:
/var/log/containers/*.logIt does NOT:
Logs go to Kafka before they go to Elasticsearch. Why the middle hop?
Backpressure isolation. When Elasticsearch is slow (busy with a query, indexing, etc.), logs back up in Kafka rather than on the node. Fluent Bit keeps forwarding; Kafka queues; ES catches up. Without Kafka, ES backpressure causes log loss.
Multiple consumers. Kafka feeds Elasticsearch (for hot search), Snowflake (for analytics), and S3 (for archival) in parallel. Each consumer reads at its own pace.
Replayability. When we change parsing or routing rules, we can replay from Kafka. Without it, we'd have to re-collect.
Kafka topics are partitioned by source (one topic per Kubernetes namespace, partitioned by node). Retention is 24 hours — long enough for short outages, not long enough to be expensive.
Storage cost: ~$200/month for our Kafka cluster (3 brokers, NVMe disks). Real but small relative to the value.
For recent logs (last 7 days), full-text search and ad-hoc queries: Elasticsearch.
Our cluster:
logs-production-2024-04-25)Logs in this tier are searchable with millisecond latency. Engineers go to Kibana for "what happened with this request" queries.
What goes here: every log line, fully indexed.
Cost: ~$1,800/month for the cluster. Most of our logging cost.
After 7 days, logs move to warm tier. Same Elasticsearch cluster but with frozen indices (read-only, more compressed). Searchable but slower (~5-30s vs <1s for hot).
We use Index Lifecycle Management (ILM) to handle the transition automatically. The logs become available for incident review for the past 90 days.
Cost: same cluster, slightly more storage (~$300/month additional storage).
After 90 days, logs go to S3 in compressed Parquet format. They're not in Elasticsearch anymore.
For audit/compliance access, we use Athena to query S3-resident logs. Slower (10-60s queries), but vastly cheaper.
S3 storage cost: ~$120/month for the volume we keep. Glacier for the oldest tiers reduces this further.
Cost optimization happens at the source. We drop logs before they hit Kafka:
Health check logs. Every load balancer health check generates logs. We drop them at Fluent Bit (filter on URL pattern).
Successful internal request logs. "200 OK" for internal service calls — we drop the body, keep counts.
Debug-level logs in production. Apps log debug-level only on demand (via a feature flag). Default is INFO.
Verbose third-party libraries. Some libraries (boto3 in DEBUG mode, the Cloudflare SDK) generate massive log volumes. We filter at the source.
These cuts reduce volume by ~50% with essentially no information loss.
Some categories we keep at full fidelity:
Errors. Every error, every stack trace. Errors are precious data; the cost of keeping them is small relative to their value.
Auth events. Every login, logout, token issue, failed auth. Required for security investigation.
Payment-related events. Anything in the payments path. Required for compliance and dispute resolution.
State changes. "User X changed their email," "Subscription Y was canceled." Audit trail.
We've had cases where dropping the wrong category bit us — once we added a too-aggressive filter that dropped some 4xx error logs, then spent half a day debugging an issue that the dropped logs would have explained immediately. Conservative dropping is the right default.
Apps must emit structured JSON logs. The schema:
{
"timestamp": "2024-04-25T18:32:15.234Z",
"level": "INFO",
"service": "checkout-api",
"version": "v3.42.1",
"request_id": "abc-123",
"user_id": "user_456",
"message": "Order placed",
"order_id": "order_789",
"amount_usd": 49.99,
"duration_ms": 142
}
Required fields: timestamp, level, service, message. Conventional fields: request_id, user_id (for joining logs across services).
Non-conventional fields are per-log-line (like order_id, amount_usd).
Why mandate this:
level=ERROR is a fast query; "ERROR appearing in the message" is slow.request_id, we can trace a request across all services that touched it.amount_usd for orders today" works only if amount_usd is a real field.For services that emit unstructured logs, Fluent Bit applies parsers to extract fields. But native structured logging is much cleaner.
Common query types and how we handle them:
"Show me all errors from service X in the last hour"
"What happened with request ID Y across all services"
request_id:"Y" query. Returns all logs from all services with that ID."How many 5xx errors per minute for the last 30 days"
"Total logged events by user X in the last 6 months"
"All audit events for compliance request"
The pattern: hot for fast, recent debugging; warm for incident reviews; cold for compliance/analytics.
We have alerts derived from logs:
level=ERROR over time)We use ElastAlert (Elasticsearch-side) or Prometheus (with logs converted to metrics via mtail/promtail). Both work; we've drifted toward Prometheus because the alert tooling is shared with non-log alerts.
What we don't alert on: arbitrary log volume, individual log lines. The cardinality is too high.
Specific issues we've hit:
Elasticsearch hot-shard contention when one service generated 10x its normal logs. The shard for that service's index couldn't keep up; ingestion lagged for everything. Fix: better sharding strategy (per-day per-service), automated rollover.
Fluent Bit memory growth when downstream Kafka had connectivity issues. FB buffered logs in memory; eventually OOM-killed by kubelet. Fix: tighter memory limits and a restart-on-OOM lifecycle.
Massive index from a runaway service. A bug caused a service to log hundreds of MB/min. Filled disks before we noticed. Now we have per-source volume alerts — > 10x normal volume → page.
Sensitive data in logs. A service started logging full request bodies including PII. Caught in code review but several days of logs had the data. Cleanup: identifying and redacting affected logs in S3 was painful. Now: pre-commit linting for known PII patterns, plus runtime redaction at Fluent Bit for known sensitive fields.
Total monthly cost:
Total: ~$2,200/month for ~24TB of log retention across 7 years. Compared to managed log providers (Datadog, Sumo Logic) which would charge us $5-10k/month for similar volume, the self-hosted setup pays for itself many times over but the engineering time was real (~3 weeks initial setup, ~2 hours/week ongoing).
Pick managed logging unless volume is high. Datadog or similar at < 100GB/day is cheaper than running your own stack. Above ~500GB/day, self-hosted economics start working.
Drop noise at collection. Health checks, access logs for internal services, debug-level. Easy ~50% volume reduction.
Use Kafka in the middle. Backpressure isolation is worth the operational cost.
Mandate structured logs. Get this in place before logs scale. Retrofitting is painful.
Don't index everything for the same duration. Hot 7 days, warm 90 days, cold 7 years. Cost differs 10x between tiers.
Watch for runaway services. Per-source volume alerts catch the bug-that-logs-too-much before it fills your disks.
Logging is one of those infrastructure pieces where teams either have it sorted (and don't think about it) or are constantly fighting it. The pieces aren't exotic; the discipline is in keeping the volume manageable, the schema clean, and the tiers separated by cost. With those three in place, the system is just there, doing its job.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
Practical game day scenarios for CI/CD: broken rollbacks, permission issues, and slow feedback loops—and how we fixed them.
How a small team moved from single-region risk to a simple active/passive multi-region setup without doubling complexity.
Explore more articles in this category
You always have known vulnerabilities. The question is how you triage, patch, and respond. The discipline we run after a few real incidents and a lot of routine work.
Three terms that get mixed up constantly. The actual differences, where each one sits in the request path, when you reach for which, and where the same tool plays all three roles.
Helm gives you a lot of rope. The patterns we used that backfired, the ones we replaced them with, and what to skip if you're starting today.