Why Postgres connection limits bite at unexpected times, the pooling layer we put in front, and the pool-mode tradeoffs we learned the hard way.

On this page

Postgres Connection Pooling: PgBouncer in Front of RDS

A few months into running our biggest application database on RDS Postgres, we started hitting FATAL: remaining connection slots are reserved at unpredictable times. The database had headroom on every metric — CPU under 40%, IOPS modest, memory fine — but new connections were being rejected. That's the canonical signal that you've outgrown direct connections from your app to Postgres. This post is what we learned putting PgBouncer in front, the pool-mode tradeoffs, and the metrics we wired up.

Why Postgres connection limits matter more than you'd think #

Postgres is a process-per-connection database. Every connection is a backing process with its own memory footprint (a few MB minimum, often more for work_mem-heavy queries). The max_connections setting caps how many can exist concurrently — typically 100-500 by default depending on instance size.

The math gets bad fast. A pool size of 50 per app instance × 8 instances × 3 services hitting the same database = 1,200 potential connections. Most of those are idle most of the time, but RDS doesn't know that — slots are reserved as long as the connection is open.

You don't have to be at large scale to hit this. The first time we hit it, we had 4 services × moderate pool sizes = ~600 connections trying to coexist on a default RDS Postgres instance. Most of them did almost nothing all day.

Why not just bump `max_connections`?#

You can. RDS lets you. But each connection costs memory, and at high counts the planner gets slower from internal lock contention. Postgres documentation generally recommends keeping max_connections below ~300 even on beefy hardware. Past that, the per-connection overhead starts dominating.

The right answer at scale is a connection pooler. Your apps connect to the pooler (cheap, in-process or local-network), the pooler maintains a small pool of real Postgres connections.

PgBouncer in one sentence #

PgBouncer is a tiny daemon that speaks the Postgres wire protocol on one side and connects to your real database on the other. It takes many client connections, maintains a small pool of server connections, and multiplexes traffic between them. The daemon itself uses ~5MB of memory and almost no CPU. We've run it for two years and the daemon has never been the bottleneck.

The pool-mode decision #

PgBouncer supports three pool modes:

Session mode — a server connection is dedicated to a client for the entire session. Identical semantics to direct Postgres connections; nothing changes for the app. But you save no connection slots — it's basically a passthrough.

Transaction mode — a server connection is held only for the duration of an open transaction, then returned to the pool. This is where the real pooling happens. One server connection can serve dozens of client connections that aren't actively in a transaction.

Statement mode — server connection returned after every statement. Most aggressive pooling. Breaks anything that needs state between statements (prepared statements, session settings, etc.).

We use transaction mode for almost everything. Statement mode is too restrictive; session mode doesn't pool. The tradeoff with transaction mode is that some features don't work the way you'd expect.

What transaction mode breaks #

Things that need server-side state across the whole session:

SET LOCAL is fine; plain SET is not. SET lives until the connection ends, but in transaction mode the connection goes back to the pool. Next query — maybe a different connection. Use SET LOCAL (transaction-scoped) instead.
Prepared statements (server-side). Before Postgres 14 and PgBouncer 1.21, these were broken in transaction mode entirely. Modern versions have explicit support but you have to opt in (max_prepared_statements config).
LISTEN / NOTIFY. Requires persistent connection. If you need pub/sub, use a different channel.
Temporary tables. Same — they live until the session ends.
HOLD CURSOR and cursor-based pagination. Cursors don't survive transaction boundaries in transaction mode.

We got bitten by the SET vs SET LOCAL distinction early. A service set application_name at connection time to make queries identifiable in pg_stat_activity. After PgBouncer, the name vanished. Switched to passing it as a query comment instead; problem gone.

Sizing the pool #

PgBouncer's default_pool_size controls how many server connections per database+user pair. Tuning it is an iterative process.

Our starting heuristic: pool_size = 2 × (number_of_CPUs_on_database). For a db.r6g.xlarge (4 vCPU), that's 8. Sounds tiny. Works.

The reasoning: queries that aren't I/O bound saturate CPU. Past 2× CPU concurrency, you're just queueing inside Postgres anyway — better to queue at PgBouncer (cheap, observable) than inside Postgres (expensive, opaque).

We bumped it for specific busy DBs (one is at 20 for a 4-vCPU instance because the workload is heavily I/O bound — lots of waits, low average CPU per query). The principle: tune empirically per database.

max_client_conn is the cap on incoming connections. We set this generously (1000+) — clients waiting at PgBouncer is fine; it's clients getting rejected by Postgres that breaks things.

Topology: where to run it #

Three options:

On every app pod (as a sidecar). Simplest. App connects to localhost; PgBouncer maintains the pool. No network hop. Downside: each pod has its own pool, so you don't get cross-pod pooling — the app-to-database fan-in is just shifted from "many app pods → many connections" to "many pods × each-pod-pool → many connections."

Centralized PgBouncer deployment (a small DaemonSet or per-AZ deployment). Real pooling across all pods. Adds one network hop but it's same-AZ usually. Single point of failure unless you run a few.

Cloud-provider pooler (RDS Proxy on AWS). Managed, no ops. AWS handles HA, scaling, patching. Costs ~$0.015 per vCPU-hour of the database it fronts. We use this for the very-large DBs where the operational simplicity wins; PgBouncer for everything else.

We started with sidecar PgBouncer (easy). Moved to centralized deployment when we realized we weren't getting real pooling. Currently a mix — centralized PgBouncer for our two largest internal DBs, RDS Proxy for the production-critical ones.

Metrics that matter #

PgBouncer's SHOW POOLS is your eyes. Six columns matter:

cl_active — clients with an active query
cl_waiting — clients waiting for a server slot ← the alert signal
sv_active — servers currently serving
sv_idle — servers in the pool but idle
sv_used — servers used recently (warm)
maxwait — longest wait time for a client

We scrape these via the pgbouncer_exporter (Prometheus). Alerts on:

cl_waiting > 10 for > 30s → pool saturation, scale up or investigate slow queries
maxwait > 200ms for > 1m → same thing, latency view
PgBouncer pod restart loops → operational concern

The first time cl_waiting spiked, we had a single slow report query holding multiple server slots. Spotted in dashboard within 2 minutes; killed the query; pool drained. Without metrics we'd have spent half an hour figuring out which service was at fault.

What broke in production #

A few specific things we hit:

The pool hung after a database failover. RDS multi-AZ failover changed the underlying primary; PgBouncer's open connections to the old primary were dead. Without explicit handling, new client connections sat waiting on dead server connections. We added server_check_delay = 30 so PgBouncer pings unused server connections — bad ones get evicted.

LISTEN traffic broke. A subscription-pubsub service used LISTEN to react to row changes. Worked direct; broke through PgBouncer transaction mode. We moved that service to a session-mode PgBouncer pool (separate config), or you could just have that one service connect direct.

Prepared statements caused weird errors during a Postgres upgrade. Our Postgres library cached prepared statements by name on the client side. After failover (different server connection), the cached name was stale on the new connection. Adding prepare_threshold=0 to the connection string disabled prepared statements for that service. Fix took longer than it should have because the error message was unhelpful.

What I'd tell a team starting #

Start with PgBouncer in transaction mode for new apps. Cheap insurance against the connection-limit cliff.

Use SET LOCAL, not SET. In transaction-mode PgBouncer this matters; in direct connections it doesn't hurt.

Pool size starts small. Tune empirically. 2× CPU on the database is a reasonable starting point. Most teams over-pool.

Scrape PgBouncer metrics from day one. cl_waiting is the most useful alert signal you'll get for "your database layer is saturated."

RDS Proxy on AWS if you don't want to operate PgBouncer. Costs more, ops-free. Same model — transaction-mode pooling in front of Postgres.

Connection pooling isn't optional once you have a handful of services talking to the same database. PgBouncer is the lightweight, well-understood answer. The mode tradeoffs are real but small; the operational discipline (sizing, metrics) is where the actual work is.

Postgres Connection Pooling — PgBouncer in Front of RDS

Postgres Connection Pooling: PgBouncer in Front of RDS

Why Postgres connection limits matter more than you'd think #

Why not just bump `max_connections`?#

PgBouncer in one sentence #

The pool-mode decision #

What transaction mode breaks #

Sizing the pool #

Topology: where to run it #

Metrics that matter #

What broke in production #

What I'd tell a team starting #

Stay Updated

What Are Embeddings? A Beginner's Guide with Code

Chaos Engineering — What We Actually Run as Game Days

More from Infrastructure

Database Backups — Testing Restores, Not Just Taking Them

Postgres Replication Lag — Monitoring and Failover Practice

Terraform Tutorial — Your First Infrastructure-as-Code Project

Database Backups — Testing Restores, Not Just Taking Them

Postgres Replication Lag — Monitoring and Failover Practice

Terraform Tutorial — Your First Infrastructure-as-Code Project

GitOps Explained — What It Is and Why Teams Adopt It

Helm Chart Anti-Patterns We've Stopped Using

CDN Cache Invalidation — Strategies That Don't Break in Production

About Admin

You might have missed

GitOps with Argo CD: Best Practices for 2025

AI Agents in DevOps: From Copilots to Autonomous Automation in 2025

Prompt Engineering Best Practices: Maximizing LLM Performance