How PgBouncer Saved the day (After We Almost Killed PostgreSQL)
I woke up to 47 unread messages. Never a good sign.
The first one from my tech lead at 8:23 AM: "production fatt gya hai" (production went down).
The next 46 were variations of panic in the group chat.
Our Stack (Pretty Standard Stuff)
- Mobile app: React Native (iOS + Android)
- Admin dashboard: Next.js
- Backend: NestJS API
- Database: PostgreSQL 14 on a beefy EC2 instance (c5.9xlarge - 36 vCPU, 72GB RAM)
- Users: ~50k active, all in India
Nothing fancy. We weren't doing microservices or Kubernetes. Just a solid monolithic API, a mobile app, and a dashboard. The kind of setup that's supposed to "just work."
The Problem: Everything Was Slow. Everything.
Opened Slack. The backend channel was a warzone:
[9:18 AM] API response times going crazy
[9:21 AM] Users reporting app freezing
[9:23 AM] Dashboard won't load
[9:24 AM] @channel PRODUCTION DOWN
I SSH'd into the EC2 instance. First thing I always check:
htop
CPU usage: 13%. RAM: 18GB used out of 72GB. Disk I/O: barely anything.
What the hell?
The server had resources. But nothing was working.
Checked the NestJS logs:
TimeoutError: Connection timeout
TimeoutError: Connection timeout
TimeoutError: Connection timeout
Hundreds of them. Our app couldn't connect to PostgreSQL.
The Diagnosis: Too Many Cooks in the Database Kitchen
Connected to PostgreSQL:
psql -U appuser -d production
Ran the magic query:
SELECT count(*) FROM pg_stat_activity;
Result: 1,247 connections
Our max_connections was set to 1500. We weren't even at the limit. So why was everything broken?
SELECT state, count(*)
FROM pg_stat_activity
GROUP BY state;
state | count
---------+-------
idle | 1089
active | 147
idle in | 11
transaction
1,089 idle connections. Just sitting there. Doing absolutely nothing.
That's when it clicked. PostgreSQL wasn't handling queries poorly. It was babysitting a thousand connections that weren't even doing work.
How Did This Happen?
We'd launched a feature two weeks ago: real-time order tracking. Users could see their delivery status update live in the app.
The React Native app was polling our API every 3 seconds:
// In the mobile app Analytics were fetched in response to user actions
onCartCreated(() => {
fetchUserAnalytics();
});
Simple enough. Worked great in testing with 10 users. Even 100 users.
But we had 40,000+ users actively creating card that morning. Each one holding a connection open through our API.
On the backend, our NestJS connection pool was configured like this:
// src/database/database.config.ts
TypeOrmModule.forRoot({
type: 'postgres',
host: 'localhost',
port: 5432,
username: 'appuser',
password: process.env.DB_PASSWORD,
database: 'production',
extra: {
max: 20, // Connection pool max
min: 5, // Connection pool min
},
})
The problem? We were running 4 instances of our NestJS app (behind nginx for load balancing). Each instance had its own pool of 20 connections.
4 instances × 20 connections = 80 connections (expected)
But somehow we had 1,247 connections. Where were the rest coming from?
Turns out:
- Our Next.js dashboard (SSR) was opening connections: ~80 connections
- Our analytics cron jobs: ~40 connections
- Our payment processing workers: ~30 connections
- Old connections that never properly closed: ~900+ connections 😱
We'd been slowly leaking connections for weeks. The real-time feature just pushed us over the edge.
The Temporary "Fix" That Made Things Worse
First instinct: kill idle connections.
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle'
AND state_change < current_timestamp - INTERVAL '5 minutes';
Killed 800+ connections. The app came back online. Users stopped complaining.
We thought we'd fixed it.
Three hours later: same problem. 1,100 idle connections. App down again.
We'd treated the symptom, not the disease.
The Real Fix: PgBouncer (The Tool I Wish I'd Known About Earlier)
I found a YouTube talk by a Zerodha engineer: “How Zerodha Optimized PostgreSQL at Scale.”
The takeaway was blunt: most PostgreSQL performance issues aren’t always about queries, they’re also about how you manage connections.
What PgBouncer does:
Instead of letting every app instance open 20 direct connections to PostgreSQL, PgBouncer sits in the middle:
Next.js Frontend → NestJS instances (120 connections) → PgBouncer → PostgreSQL (25 connections)
All those app connections go to PgBouncer. But PostgreSQL only sees 25 active connections, the ones actually doing work.
Installing PgBouncer (On Our EC2 Instance)
We were running everything on one EC2 instance (don't judge, we're a startup). PostgreSQL was already there. Adding PgBouncer was straightforward.
sudo apt update sudo apt install pgbouncer
Configuration (/etc/pgbouncer/pgbouncer.ini)
[databases] production = host=127.0.0.1 port=5432 dbname=production
[pgbouncer] listen_addr = 127.0.0.1 listen_port = 6432
auth_type = md5 auth_file = /etc/pgbouncer/userlist.txt
pool_mode = transaction default_pool_size = 25 max_client_conn = 1000
server_idle_timeout = 600 log_connections = 1 log_disconnections = 1
Key settings:
pool_mode = transaction: Connection is used only during active transactions, then releaseddefault_pool_size = 25: PostgreSQL will only see 25 connections maxmax_client_conn = 1000: PgBouncer can handle up to 1000 client connections
Setting Up Authentication
Created the user file:
sudo nano /etc/pgbouncer/userlist.txt
"appuser" "md5<password_hash>"
Got the password hash from PostgreSQL:
SELECT usename, passwd FROM pg_shadow WHERE usename = 'appuser';
Started PgBouncer:
sudo systemctl start pgbouncer sudo systemctl enable pgbouncer sudo systemctl status pgbouncer
Updating Our Application
The beauty of PgBouncer: minimal code changes.
NestJS Backend
Changed one line in our database config:
// Before
TypeOrmModule.forRoot({
host: 'localhost',
port: 5432, // Direct to PostgreSQL
// ...
})
// After
TypeOrmModule.forRoot({
host: 'localhost',
port: 6432, // Through PgBouncer
// ...
})
But we also reduced our connection pool size (since PgBouncer is handling it now):
extra: {
max: 10, // Down from 20
min: 2, // Down from 5
}
Next.js Dashboard
Same change in .env:
# Before
DATABASE_URL=postgresql://appuser:password@localhost:5432/production
# After
DATABASE_URL=postgresql://appuser:password@localhost:6432/production
The One Thing That Broke
Our analytics service had a job that used database cursors:
// This broke with PgBouncer transaction mode
const cursor = await connection.query('DECLARE cursor_name CURSOR FOR SELECT ...');
Cursors require holding a connection across multiple queries. Transaction pooling doesn't allow that.
Quick fix: We gave the analytics service a direct PostgreSQL connection (bypassing PgBouncer):
// analytics.config.ts
const analyticsDb = new Pool({
host: 'localhost',
port: 5432, // Direct to PostgreSQL
max: 5, // Small pool just for analytics
});
Everything else went through PgBouncer.
The Results
Before PgBouncer:
- Active connections: 1,247 (most idle)
- App crashes: 3-4 times per day
- API p95 latency: 3,200ms
- Query time: actually fast (< 20ms)
- Everyone stressed
After PgBouncer:
- Active connections: 18-25 (all working)
- App crashes: 0 in the last 3 months
- API p95 latency: 180ms
- Query time: still fast (< 20ms)
- Everyone sleeping at night
The database was doing the same work—just not wasting resources managing idle connections.
What We Learned
1. Connection count ≠ database load
We thought PostgreSQL could "handle" 1,500 connections because that was our max_connections setting. Technically true. But each connection is a separate process with its own memory. At 1,200 connections, PostgreSQL was spending more time managing processes than running queries.
2. Not all idle connections are equal
Some frameworks (looking at you, certain ORMs) hold connections even when they're not actively querying. Those "idle" connections still consume resources.
3. Scaling horizontally creates connection pressure
We went from 3 NestJS instances to 6 instances. Connection usage doubled. If we'd scaled to 12 instances without PgBouncer, we would've had 2,400+ connections. Game over.
4. PgBouncer is basically free performance
Zero code changes (except port number). No query rewrites. No schema changes. Just instant relief from connection overhead.
5. This should've been day-one infrastructure
PgBouncer is one of those tools that seems optional until you need it desperately. Then you wonder why you waited so long.
Our Current Setup (1 Month Later)
We've since scaled to:
- 8 NestJS instances
- 60k active users
- Peak traffic: 15k requests/minute
PostgreSQL connections? Never exceeded 28.
PgBouncer just works. It's boring. It doesn't show up in our monitoring alerts because there's nothing to alert on.
That's the highest compliment I can give infrastructure.
If You're Facing Connection Issues
Symptoms that mean you need PgBouncer:
- ✅ High connection count but low CPU usage
- ✅
pg_stat_activityshowing mostly idle connections - ✅ Timeout errors but queries are actually fast
- ✅ Problems get worse when you scale up app instances
- ✅ Connection errors during traffic spikes
Setup time: 2-3 hours including testing
Difficulty: Surprisingly easy
ROI: Immediate and dramatic
Final Thought
That morning when production went down, I thought we'd need to refactor our entire backend. Maybe split the database. Maybe rewrite our connection handling. Maybe hire a DBA.
Instead, we installed a 15-year-old tool that's been solving this exact problem forever.
Sometimes the best solution isn't the clever one. It's the one that already exists and works.
P.S. If your PostgreSQL max_connections is above 200 and you're not using PgBouncer, you're probably going to have a bad morning soon. Just saying.