How PgBouncer Saved the day (After We Almost Killed PostgreSQL)

27 Aug, 2025

I woke up to 47 unread messages. Never a good sign.

The first one from my tech lead at 8:23 AM: "production fatt gya hai" (production went down).

The next 46 were variations of panic in the group chat.

Our Stack (Pretty Standard Stuff)

Mobile app: React Native (iOS + Android)
Admin dashboard: Next.js
Backend: NestJS API
Database: PostgreSQL 14 on a beefy EC2 instance (c5.9xlarge - 36 vCPU, 72GB RAM)
Users: ~50k active, all in India

Nothing fancy. We weren't doing microservices or Kubernetes. Just a solid monolithic API, a mobile app, and a dashboard. The kind of setup that's supposed to "just work."

The Problem: Everything Was Slow. Everything.

Opened Slack. The backend channel was a warzone:

[9:18 AM] API response times going crazy
[9:21 AM] Users reporting app freezing  
[9:23 AM] Dashboard won't load
[9:24 AM] @channel PRODUCTION DOWN

I SSH'd into the EC2 instance. First thing I always check:

htop

CPU usage: 13%. RAM: 18GB used out of 72GB. Disk I/O: barely anything.

What the hell?

The server had resources. But nothing was working.

Checked the NestJS logs:

TimeoutError: Connection timeout
TimeoutError: Connection timeout  
TimeoutError: Connection timeout

Hundreds of them. Our app couldn't connect to PostgreSQL.

The Diagnosis: Too Many Cooks in the Database Kitchen

Connected to PostgreSQL:

psql -U appuser -d production

Ran the magic query:

SELECT count(*) FROM pg_stat_activity;

Result: 1,247 connections

Our max_connections was set to 1500. We weren't even at the limit. So why was everything broken?

SELECT state, count(*) 
FROM pg_stat_activity 
GROUP BY state;

  state  | count
---------+-------
 idle    | 1089
 active  |  147
 idle in |   11
 transaction

1,089 idle connections. Just sitting there. Doing absolutely nothing.

That's when it clicked. PostgreSQL wasn't handling queries poorly. It was babysitting a thousand connections that weren't even doing work.

How Did This Happen?

We'd launched a feature two weeks ago: real-time order tracking. Users could see their delivery status update live in the app.

The React Native app was polling our API every 3 seconds:

// In the mobile app Analytics were fetched in response to user actions
onCartCreated(() => {
  fetchUserAnalytics();
});

Simple enough. Worked great in testing with 10 users. Even 100 users.

But we had 40,000+ users actively creating card that morning. Each one holding a connection open through our API.

On the backend, our NestJS connection pool was configured like this:

// src/database/database.config.ts
TypeOrmModule.forRoot({
  type: 'postgres',
  host: 'localhost',
  port: 5432,
  username: 'appuser',
  password: process.env.DB_PASSWORD,
  database: 'production',
  extra: {
    max: 20,  // Connection pool max
    min: 5,   // Connection pool min
  },
})

The problem? We were running 4 instances of our NestJS app (behind nginx for load balancing). Each instance had its own pool of 20 connections.

4 instances × 20 connections = 80 connections (expected)

But somehow we had 1,247 connections. Where were the rest coming from?

Turns out:

Our Next.js dashboard (SSR) was opening connections: ~80 connections
Our analytics cron jobs: ~40 connections
Our payment processing workers: ~30 connections
Old connections that never properly closed: ~900+ connections 😱

We'd been slowly leaking connections for weeks. The real-time feature just pushed us over the edge.

The Temporary "Fix" That Made Things Worse

First instinct: kill idle connections.

SELECT pg_terminate_backend(pid) 
FROM pg_stat_activity 
WHERE state = 'idle' 
AND state_change < current_timestamp - INTERVAL '5 minutes';

Killed 800+ connections. The app came back online. Users stopped complaining.

We thought we'd fixed it.

Three hours later: same problem. 1,100 idle connections. App down again.

We'd treated the symptom, not the disease.

The Real Fix: PgBouncer (The Tool I Wish I'd Known About Earlier)

I found a YouTube talk by a Zerodha engineer: “How Zerodha Optimized PostgreSQL at Scale.”

The takeaway was blunt: most PostgreSQL performance issues aren’t always about queries, they’re also about how you manage connections.

What PgBouncer does:

Instead of letting every app instance open 20 direct connections to PostgreSQL, PgBouncer sits in the middle:

Next.js Frontend → NestJS instances (120 connections) → PgBouncer → PostgreSQL (25 connections)

All those app connections go to PgBouncer. But PostgreSQL only sees 25 active connections, the ones actually doing work.

Installing PgBouncer (On Our EC2 Instance)

We were running everything on one EC2 instance (don't judge, we're a startup). PostgreSQL was already there. Adding PgBouncer was straightforward.

sudo apt update sudo apt install pgbouncer

Configuration (`/etc/pgbouncer/pgbouncer.ini`)

[databases] production = host=127.0.0.1 port=5432 dbname=production

[pgbouncer] listen_addr = 127.0.0.1 listen_port = 6432

auth_type = md5 auth_file = /etc/pgbouncer/userlist.txt

pool_mode = transaction default_pool_size = 25 max_client_conn = 1000

server_idle_timeout = 600 log_connections = 1 log_disconnections = 1

Key settings:

pool_mode = transaction: Connection is used only during active transactions, then released
default_pool_size = 25: PostgreSQL will only see 25 connections max
max_client_conn = 1000: PgBouncer can handle up to 1000 client connections

Setting Up Authentication

Created the user file:

sudo nano /etc/pgbouncer/userlist.txt

"appuser" "md5<password_hash>"

Got the password hash from PostgreSQL:

SELECT usename, passwd FROM pg_shadow WHERE usename = 'appuser';

Started PgBouncer:

sudo systemctl start pgbouncer sudo systemctl enable pgbouncer sudo systemctl status pgbouncer

Updating Our Application

The beauty of PgBouncer: minimal code changes.

NestJS Backend

Changed one line in our database config:

// Before
TypeOrmModule.forRoot({
  host: 'localhost',
  port: 5432,  // Direct to PostgreSQL
  // ...
})

// After  
TypeOrmModule.forRoot({
  host: 'localhost',
  port: 6432,  // Through PgBouncer
  // ...
})

But we also reduced our connection pool size (since PgBouncer is handling it now):

extra: {
  max: 10,   // Down from 20
  min: 2,    // Down from 5
}

Next.js Dashboard

Same change in .env:

# Before
DATABASE_URL=postgresql://appuser:password@localhost:5432/production

# After
DATABASE_URL=postgresql://appuser:password@localhost:6432/production

The One Thing That Broke

Our analytics service had a job that used database cursors:

// This broke with PgBouncer transaction mode
const cursor = await connection.query('DECLARE cursor_name CURSOR FOR SELECT ...');

Cursors require holding a connection across multiple queries. Transaction pooling doesn't allow that.

Quick fix: We gave the analytics service a direct PostgreSQL connection (bypassing PgBouncer):

// analytics.config.ts
const analyticsDb = new Pool({
  host: 'localhost',
  port: 5432,  // Direct to PostgreSQL
  max: 5,      // Small pool just for analytics
});

Everything else went through PgBouncer.

The Results

Before PgBouncer:

Active connections: 1,247 (most idle)
App crashes: 3-4 times per day
API p95 latency: 3,200ms
Query time: actually fast (< 20ms)
Everyone stressed

After PgBouncer:

Active connections: 18-25 (all working)
App crashes: 0 in the last 3 months
API p95 latency: 180ms
Query time: still fast (< 20ms)
Everyone sleeping at night

The database was doing the same work—just not wasting resources managing idle connections.

What We Learned

1. Connection count ≠ database load

We thought PostgreSQL could "handle" 1,500 connections because that was our max_connections setting. Technically true. But each connection is a separate process with its own memory. At 1,200 connections, PostgreSQL was spending more time managing processes than running queries.

2. Not all idle connections are equal

Some frameworks (looking at you, certain ORMs) hold connections even when they're not actively querying. Those "idle" connections still consume resources.

3. Scaling horizontally creates connection pressure

We went from 3 NestJS instances to 6 instances. Connection usage doubled. If we'd scaled to 12 instances without PgBouncer, we would've had 2,400+ connections. Game over.

4. PgBouncer is basically free performance

Zero code changes (except port number). No query rewrites. No schema changes. Just instant relief from connection overhead.

5. This should've been day-one infrastructure

PgBouncer is one of those tools that seems optional until you need it desperately. Then you wonder why you waited so long.

Our Current Setup (1 Month Later)

We've since scaled to:

8 NestJS instances
60k active users
Peak traffic: 15k requests/minute

PostgreSQL connections? Never exceeded 28.

PgBouncer just works. It's boring. It doesn't show up in our monitoring alerts because there's nothing to alert on.

That's the highest compliment I can give infrastructure.

If You're Facing Connection Issues

Symptoms that mean you need PgBouncer:

✅ High connection count but low CPU usage
✅ pg_stat_activity showing mostly idle connections
✅ Timeout errors but queries are actually fast
✅ Problems get worse when you scale up app instances
✅ Connection errors during traffic spikes

Setup time: 2-3 hours including testing

Difficulty: Surprisingly easy

ROI: Immediate and dramatic

Final Thought

That morning when production went down, I thought we'd need to refactor our entire backend. Maybe split the database. Maybe rewrite our connection handling. Maybe hire a DBA.

Instead, we installed a 15-year-old tool that's been solving this exact problem forever.

Sometimes the best solution isn't the clever one. It's the one that already exists and works.

P.S. If your PostgreSQL max_connections is above 200 and you're not using PgBouncer, you're probably going to have a bad morning soon. Just saying.

#connection-pooling #databases #hld #performance