We collect ~800GB of logs per day across our fleet. The shape of our logging stack, what we keep, what we drop, and what we'd build differently.

On this page

Log Aggregation: A Working Production Setup

We collect roughly 800GB of logs per day across our fleet. They go through a Fluent Bit → Kafka → Elasticsearch pipeline, with hot/warm/cold tiering for cost. The setup has been stable for about three years; we've also rethought parts of it twice. This post is the current configuration with the production reasoning.

What logs are for #

Three distinct uses, with different requirements:

Debugging recent issues. Engineers need fast search over the last few hours/days. High-resolution, full content.
Audit and compliance. Specific log streams (auth events, financial transactions) need long-term retention for regulatory reasons.
Analytics on log content. Dashboards, alerts derived from log patterns. Aggregated metrics, not individual log lines.

The mistake teams make is using one tool for all three. Each has different cost/performance characteristics; trying to unify them creates a system that's expensive AND slow AND incomplete.

We split them into three pipelines that share collection but diverge after.

Collection: Fluent Bit on every node #

Fluent Bit runs as a DaemonSet on every Kubernetes node. It tails container logs, parses where it can, and forwards to Kafka.

Why Fluent Bit (not Fluentd, not Vector, not Loki Promtail):

Lower memory footprint (~30MB per node) than Fluentd
Excellent Kubernetes metadata enrichment built-in
Stable for ~3 years; we know its quirks

The Fluent Bit config does:

Tails /var/log/containers/*.log
Adds Kubernetes metadata (pod name, namespace, labels, container name)
Parses common formats (JSON, plaintext-with-known-prefix)
Drops noise via filter rules (more on this)
Forwards to Kafka

It does NOT:

Try to enrich semantically (parse stack traces, correlate IDs) — that's downstream
Buffer for hours — buffer is small, ~100MB per node
Compress aggressively — bandwidth is cheap, CPU is more valuable

Why Kafka in the middle #

Logs go to Kafka before they go to Elasticsearch. Why the middle hop?

Backpressure isolation. When Elasticsearch is slow (busy with a query, indexing, etc.), logs back up in Kafka rather than on the node. Fluent Bit keeps forwarding; Kafka queues; ES catches up. Without Kafka, ES backpressure causes log loss.

Multiple consumers. Kafka feeds Elasticsearch (for hot search), Snowflake (for analytics), and S3 (for archival) in parallel. Each consumer reads at its own pace.

Replayability. When we change parsing or routing rules, we can replay from Kafka. Without it, we'd have to re-collect.

Kafka topics are partitioned by source (one topic per Kubernetes namespace, partitioned by node). Retention is 24 hours — long enough for short outages, not long enough to be expensive.

Storage cost: ~$200/month for our Kafka cluster (3 brokers, NVMe disks). Real but small relative to the value.

Hot tier: Elasticsearch (or OpenSearch)#

For recent logs (last 7 days), full-text search and ad-hoc queries: Elasticsearch.

Our cluster:

6 nodes (2 master, 3 data, 1 coordinating)
~24TB total storage across the data nodes
Index per day per source (e.g., logs-production-2024-04-25)

Logs in this tier are searchable with millisecond latency. Engineers go to Kibana for "what happened with this request" queries.

What goes here: every log line, fully indexed.

Cost: ~$1,800/month for the cluster. Most of our logging cost.

Warm tier: 30 days to 90 days #

After 7 days, logs move to warm tier. Same Elasticsearch cluster but with frozen indices (read-only, more compressed). Searchable but slower (~5-30s vs <1s for hot).

We use Index Lifecycle Management (ILM) to handle the transition automatically. The logs become available for incident review for the past 90 days.

Cost: same cluster, slightly more storage (~$300/month additional storage).

Cold tier: 90 days to 7 years #

After 90 days, logs go to S3 in compressed Parquet format. They're not in Elasticsearch anymore.

For audit/compliance access, we use Athena to query S3-resident logs. Slower (10-60s queries), but vastly cheaper.

S3 storage cost: ~$120/month for the volume we keep. Glacier for the oldest tiers reduces this further.

What we drop at collection #

Cost optimization happens at the source. We drop logs before they hit Kafka:

Health check logs. Every load balancer health check generates logs. We drop them at Fluent Bit (filter on URL pattern).

Successful internal request logs. "200 OK" for internal service calls — we drop the body, keep counts.

Debug-level logs in production. Apps log debug-level only on demand (via a feature flag). Default is INFO.

Verbose third-party libraries. Some libraries (boto3 in DEBUG mode, the Cloudflare SDK) generate massive log volumes. We filter at the source.

These cuts reduce volume by ~50% with essentially no information loss.

What we don't drop, even when tempted #

Some categories we keep at full fidelity:

Errors. Every error, every stack trace. Errors are precious data; the cost of keeping them is small relative to their value.

Auth events. Every login, logout, token issue, failed auth. Required for security investigation.

Payment-related events. Anything in the payments path. Required for compliance and dispute resolution.

State changes. "User X changed their email," "Subscription Y was canceled." Audit trail.

We've had cases where dropping the wrong category bit us — once we added a too-aggressive filter that dropped some 4xx error logs, then spent half a day debugging an issue that the dropped logs would have explained immediately. Conservative dropping is the right default.

Structured logging: required, not optional #

Apps must emit structured JSON logs. The schema:

json.json

{
  "timestamp": "2024-04-25T18:32:15.234Z",
  "level": "INFO",
  "service": "checkout-api",
  "version": "v3.42.1",
  "request_id": "abc-123",
  "user_id": "user_456",
  "message": "Order placed",
  "order_id": "order_789",
  "amount_usd": 49.99,
  "duration_ms": 142
}

Required fields: timestamp, level, service, message. Conventional fields: request_id, user_id (for joining logs across services).

Non-conventional fields are per-log-line (like order_id, amount_usd).

Why mandate this:

Indexed fields are queryable. level=ERROR is a fast query; "ERROR appearing in the message" is slow.
Joins across services. With consistent request_id, we can trace a request across all services that touched it.
Aggregation. "Total amount_usd for orders today" works only if amount_usd is a real field.

For services that emit unstructured logs, Fluent Bit applies parsers to extract fields. But native structured logging is much cleaner.

Querying patterns #

Common query types and how we handle them:

"Show me all errors from service X in the last hour"

Kibana, hot tier. Fast.

"What happened with request ID Y across all services"

Kibana with request_id:"Y" query. Returns all logs from all services with that ID.

"How many 5xx errors per minute for the last 30 days"

Hot/warm tier with date histogram aggregation. Slower for the warm part.

"Total logged events by user X in the last 6 months"

Athena over S3. Slow (~30s) but feasible.

"All audit events for compliance request"

Athena over S3, filtered by event type. Hours of latency acceptable.

The pattern: hot for fast, recent debugging; warm for incident reviews; cold for compliance/analytics.

Alerting from logs #

We have alerts derived from logs:

Error rate per service (count of level=ERROR over time)
Specific known-bad messages ("connection refused" rate)
Audit anomalies (failed-login rate)
4xx/5xx HTTP status rate from access logs

We use ElastAlert (Elasticsearch-side) or Prometheus (with logs converted to metrics via mtail/promtail). Both work; we've drifted toward Prometheus because the alert tooling is shared with non-log alerts.

What we don't alert on: arbitrary log volume, individual log lines. The cardinality is too high.

What broke at scale #

Specific issues we've hit:

Elasticsearch hot-shard contention when one service generated 10x its normal logs. The shard for that service's index couldn't keep up; ingestion lagged for everything. Fix: better sharding strategy (per-day per-service), automated rollover.

Fluent Bit memory growth when downstream Kafka had connectivity issues. FB buffered logs in memory; eventually OOM-killed by kubelet. Fix: tighter memory limits and a restart-on-OOM lifecycle.

Massive index from a runaway service. A bug caused a service to log hundreds of MB/min. Filled disks before we noticed. Now we have per-source volume alerts — > 10x normal volume → page.

Sensitive data in logs. A service started logging full request bodies including PII. Caught in code review but several days of logs had the data. Cleanup: identifying and redacting affected logs in S3 was painful. Now: pre-commit linting for known PII patterns, plus runtime redaction at Fluent Bit for known sensitive fields.

Cost #

Total monthly cost:

Fluent Bit (negligible — runs on cluster nodes, no separate compute)
Kafka cluster: ~$200
Elasticsearch cluster: ~$1,800
S3 storage: ~$120
Athena: variable, ~$30/month for our query volume

Total: ~$2,200/month for ~24TB of log retention across 7 years. Compared to managed log providers (Datadog, Sumo Logic) which would charge us $5-10k/month for similar volume, the self-hosted setup pays for itself many times over but the engineering time was real (~3 weeks initial setup, ~2 hours/week ongoing).

What I'd tell a team starting #

Pick managed logging unless volume is high. Datadog or similar at < 100GB/day is cheaper than running your own stack. Above ~500GB/day, self-hosted economics start working.

Drop noise at collection. Health checks, access logs for internal services, debug-level. Easy ~50% volume reduction.

Use Kafka in the middle. Backpressure isolation is worth the operational cost.

Mandate structured logs. Get this in place before logs scale. Retrofitting is painful.

Don't index everything for the same duration. Hot 7 days, warm 90 days, cold 7 years. Cost differs 10x between tiers.

Watch for runaway services. Per-source volume alerts catch the bug-that-logs-too-much before it fills your disks.

Logging is one of those infrastructure pieces where teams either have it sorted (and don't think about it) or are constantly fighting it. The pieces aren't exotic; the discipline is in keeping the volume manageable, the schema clean, and the tiers separated by cost. With those three in place, the system is just there, doing its job.

Log Aggregation Strategies: Centralizing Your Logs

Log Aggregation: A Working Production Setup

What logs are for #

Collection: Fluent Bit on every node #

Why Kafka in the middle #

Hot tier: Elasticsearch (or OpenSearch)#

Warm tier: 30 days to 90 days #

Cold tier: 90 days to 7 years #

What we drop at collection #

What we don't drop, even when tempted #

Structured logging: required, not optional #

Querying patterns #

Alerting from logs #

What broke at scale #

Cost #

What I'd tell a team starting #

Stay Updated

What We Learned Running Weekly Game Days on Our CI/CD Pipeline

A Pragmatic Multi-Region Strategy for Small Teams

More from DevOps

Feature Flags for Safe Deploys: Decoupling Release From Deploy

On-Call Without Burnout: Rotations, Runbooks, and Escalation

Blameless Postmortems: The Template and Facilitation That Works

Feature Flags for Safe Deploys: Decoupling Release From Deploy

On-Call Without Burnout: Rotations, Runbooks, and Escalation

Blameless Postmortems: The Template and Facilitation That Works

Four Signals That Matter: Choosing SLIs Users Actually Feel

External Secrets Operator: One Secrets Workflow Across Clouds

Kustomize Overlays That Scale Across Environments

You might have missed

Prompt Engineering Best Practices: Maximizing LLM Performance

AI Agents in DevOps: From Copilots to Autonomous Automation in 2025

Process Management and Monitoring in Linux

About Kiril Urbonas