The Database That Melted at 2 AM

It was 2:17 AM on a Tuesday. My phone buzzed with that special kind of urgency that only PagerDuty can deliver. You know the one. The vibration pattern that makes your heart skip a beat before you even look at the screen.

“Database connection pool exhausted. All services returning 503.”

Great. Just great.

I’d been on call for three days straight because my teammate was out with the flu. I was running on coffee, anxiety, and the vague hope that maybe this time it would be something simple. Like a typo. Or a config file that just needed a restart.

Spoiler alert: it wasn’t.

The Setup

We’d planned this migration for weeks. Add a new index to the users table. Simple stuff. The kind of thing that should take five minutes and then you go back to bed. Our staging environment handled it fine. Production? Production had other ideas.

The migration script itself was straightforward. We were using a standard PostgreSQL migration tool, nothing fancy. The index was on a column that wasn’t even heavily queried. Or so we thought.

The First Sign

I rolled out the migration at 1:45 AM. The maintenance window was supposed to be from 1:30 to 2:00. Plenty of time. The migration started, and I watched the logs scroll by. Everything looked normal for the first few minutes.

Then the connection count started climbing.

Our connection pool was set to 100 connections. That’s usually more than enough for our traffic patterns. But by 2:05 AM, we were at 95 connections. Then 98. Then 100.

The migration was holding onto connections like they were going out of style. And it wasn’t releasing them.

The Cascade

Here’s where it gets fun. When the connection pool maxed out, new requests couldn’t get database connections. So they started queuing. And queuing. And queuing.

Our application servers started timing out. The load balancer saw all these 503s and started thinking something was wrong with the servers. So it started health check failures. Which triggered auto-scaling. Which tried to spin up new instances. Which tried to get database connections. Which couldn’t because the pool was still exhausted.

It was like watching a Rube Goldberg machine of failure.

By 2:30 AM, I had the entire on-call team in a Zoom call. Our CTO was awake. Our VP of Engineering was awake. Half the company was probably awake, wondering why the app wasn’t working.

The Investigation

The first thing we tried was killing the migration. Simple, right? Just stop it and roll back.

Except the migration had locked the table. And it wasn’t releasing the lock. And it had 47 active connections, each one holding onto resources like a lifeline.

We couldn’t kill the connections without potentially corrupting data. We couldn’t wait for the migration to finish because it was going to take another three hours at the current rate. And we couldn’t increase the connection pool because that would just make the problem worse.

I spent the next hour digging through PostgreSQL logs, trying to figure out what the migration was actually doing. The query plan showed it was doing a full table scan. On a table with 47 million rows. Without using the existing indexes.

Turns out, the migration tool we were using had a bug. When creating an index on a column that was part of a composite index, it would sometimes decide to rebuild the entire table instead of just adding the new index. And it would do it in a way that held locks for way longer than necessary.

The Fix

At 4:00 AM, we made the call. We’d have to manually kill the migration process, accept that the index creation would be incomplete, and then manually clean up the partial index. Then we’d increase the connection pool temporarily, restart all the application servers, and hope nothing else broke.

The manual cleanup took another hour. We had to connect directly to the database, check what state the index was in, drop the partial index, and then manually create it using a different method that wouldn’t hold locks.

By 5:30 AM, services were starting to come back online. The connection pool was still stressed, but at least new connections could be established. We spent the next hour monitoring, making sure everything stabilized.

The Aftermath

The incident lasted about six hours total. We lost about four hours of peak traffic in our primary region. Customer support was flooded with tickets. The post-mortem meeting lasted three hours and resulted in 47 action items.

The worst part? The index we were trying to create wasn’t even necessary. We’d optimized the query that was supposed to use it, and it was performing fine without the new index. We just hadn’t cancelled the migration ticket.

What We Learned

First, always test migrations on production-sized datasets. Staging had 50,000 rows. Production had 47 million. That’s not the same thing.

Second, connection pool monitoring is critical. We had alerts, but they were set to fire at 90% capacity. By the time we got the alert, we were already at 100% and things were breaking.

Third, migration tools can have bugs. Just because something works in staging doesn’t mean it’ll work in production. Especially when the data volumes are different.

And finally, sometimes the best migration is the one you don’t run. We should have cancelled that ticket weeks ago when we realized the index wasn’t needed.

The Silver Lining

On the plus side, we now have much better monitoring for connection pools. We’ve added circuit breakers. We’ve improved our migration testing process. And I’ve learned to always check if a migration is actually necessary before running it at 2 AM.

Also, I got really good at PostgreSQL administration that night. So there’s that.

The next time someone asks me to run a “simple” migration, I’m going to ask a lot more questions. And probably schedule it for a time when I’m not running on three hours of sleep and my third cup of coffee.

Because trust me, you don’t want to be the person explaining to your CTO at 3 AM why the entire application is down because of an index you didn’t actually need.