← Back to Blog

Connection Pools: The Thing You Never Think About Until Production Burns

Connection pools sit quietly until they break. Here is what happens when they fail, the warning signs to watch, and how to catch it before production burns.

I want to tell you about the worst production incident I ever caused. It was a Tuesday afternoon. Traffic was normal. No deploys in the last four hours. And then, over the span of about ninety seconds, every API endpoint in our Spring Boot service started returning 500 errors. Not some of them. All of them.

The logs were full of the same exception: SQLTransientConnectionException: HikariPool-1 - Connection is not available, request timed out after 30000ms. Every single request was waiting for a database connection and not getting one. The pool was exhausted.

It took us forty minutes to figure out what happened. A seemingly harmless query in a rarely-used admin endpoint had started running long. Not crazy long. About eight seconds per call. But someone had kicked off a batch operation that hit that endpoint in a tight loop, and each request held a connection for the full eight seconds. Within a minute, all ten connections in the pool were occupied by slow queries. Every other request in the entire application, login, feed, notifications, everything, was stuck waiting in line behind them.

Ten connections brought down an entire service. That's the thing about connection pools. You never think about them until they ruin your afternoon.

What a Connection Pool Actually Does

If you already know this, skip ahead. But I've been surprised how many backend engineers treat the connection pool as a black box they never open.

Opening a database connection is expensive. There's a TCP handshake. TLS negotiation if you're using SSL. Authentication. Protocol negotiation. On a typical PostgreSQL setup, creating a new connection takes somewhere between 20 and 100 milliseconds. That doesn't sound like much, but if every HTTP request opens and closes a database connection, you're adding that overhead to every single request. At a few hundred requests per second, it adds up fast.

A connection pool solves this by keeping a set of connections open and reusing them. Your application borrows a connection from the pool, runs a query, and returns it. The next request borrows the same connection. No setup cost. No teardown cost. It's one of those abstractions that works so well you forget it exists.

Until it doesn't.

The Default Configuration Is Almost Never Right

Here's the thing that bit us. We were running HikariCP, which is the default connection pool in Spring Boot, with mostly default settings. The defaults are reasonable for getting started, but they make assumptions about your workload that might not hold.

The default maximum pool size in HikariCP is 10. Ten connections. For a lot of applications, that's actually fine. The math is simpler than people think. If your average query takes 5 milliseconds and you have 10 connections, you can theoretically handle 2,000 queries per second. Most CRUD applications don't come close to that.

But that math only works if your queries are consistently fast. The moment you have one slow query holding a connection for seconds instead of milliseconds, the entire equation breaks. That one connection isn't serving 200 queries per second anymore. It's serving one query every eight seconds. And you only have ten connections total.

The formula that actually matters is:

connections_needed = concurrent_requests × average_hold_time / average_query_time

If your average query takes 5ms but some requests hold connections for 8 seconds, you need to account for that. And most people don't.

More Connections Is Not the Answer

The first instinct when you hit a pool exhaustion issue is to increase the pool size. Just set it to 50. Or 100. Problem solved, right?

No. And this is where it gets counterintuitive.

Your database has its own limits. PostgreSQL defaults to a maximum of 100 connections. If you have four application instances each with a pool size of 50, you're already at 200 potential connections, double what the database allows. When the 101st connection attempt hits, PostgreSQL rejects it outright.

But even before you hit the hard limit, more connections means more problems. Each database connection consumes memory on the server. PostgreSQL allocates a dedicated backend process per connection. At 100 connections, that's 100 processes, each with its own memory allocation for query execution. On a database server with 8 GB of RAM, you can start running into memory pressure well before you hit the connection limit.

There's also contention. More connections means more concurrent queries means more lock contention. If multiple queries are trying to update the same rows, they'll block each other regardless of how many connections you have. Adding connections in this scenario actually makes things worse because you now have more processes all waiting on the same lock, consuming memory and CPU while doing nothing useful.

The HikariCP wiki has a page on pool sizing that I wish I'd read three years earlier. The short version: for most workloads, the right pool size is surprisingly small. Their formula is connections = (core_count × 2) + effective_spindle_count. For a server with 4 cores and an SSD, that's about 9 or 10. where the default already is.

More connections is not the answer. Fix why connections are being held too long.

The Silent Killers

Pool exhaustion is dramatic. It's the fire alarm. But there are quieter ways your connection pool can hurt you that are harder to detect.

Connection leaks

This is the classic one. You borrow a connection from the pool but never return it. In raw JDBC, this happens when you forget to close a connection in a finally block. In Spring, it's less common because the framework manages the connection lifecycle, but it still happens. Transaction methods that throw an exception in a way that bypasses the transaction manager. Manual DataSource access that doesn't go through the template. Test code that opens connections without proper cleanup.

HikariCP has a leakDetectionThreshold property that logs a warning when a connection has been out of the pool for longer than a threshold. I set this to 30 seconds on every project now. It's saved me more than once.

spring.datasource.hikari.leak-detection-threshold=30000

If you see those warnings in your logs, treat them as bugs, not noise.

Stale connections

Connections can go stale. The database restarts. A network blip kills the TCP connection. A firewall closes idle connections after a timeout. Your pool still thinks the connection is valid, but when your application tries to use it, it fails.

HikariCP handles this reasonably well out of the box with its connection validation, but not every pool does. And if you're behind a connection proxy like PgBouncer, the behavior can be different. I've seen setups where PgBouncer evicts idle connections after 5 minutes, but the application pool holds connections for 30 minutes. Every few minutes, a request would randomly fail because it grabbed a dead connection.

The fix is making sure your pool's idle timeout is shorter than whatever sits between it and the database. Sounds obvious in retrospect. It always does.

Long transactions

This is the one that got us. A connection is held for the entire duration of a transaction. If you have a @Transactional method that calls an external API, that connection is occupied the entire time the HTTP call is in flight. Your query might have taken 2 milliseconds, but the connection is held for 800 milliseconds while you wait for a third-party service to respond.

I've seen this pattern cause pool exhaustion in services that had plenty of capacity for their database load. The database was fine. The queries were fast. But the connections were tied up waiting on network calls that had nothing to do with the database.

The rule is simple: never do I/O inside a transaction unless you absolutely have to. Read from the database. Close the transaction. Call the external service. Open a new transaction if you need to write the result back. Yes, this means you lose atomicity across the whole operation. That's a trade-off you need to think about explicitly, not one to hit by accident while holding connections hostage.

Monitoring That Actually Helps

After our incident, we added monitoring that I now consider mandatory for any production service with a database.

Active connections. How many connections in the pool are currently in use. If this number is consistently close to your maximum, you're living on the edge. If it spikes and stays high, something is holding connections too long.

Pending requests. How many threads are waiting for a connection. This should normally be zero. If it's not zero, you're already in trouble. If it's growing, you're about to page someone.

Connection acquisition time. How long does it take to get a connection from the pool. This should be under a millisecond. If it's in the hundreds of milliseconds, the pool is under pressure. If it's hitting your connection timeout, requests are failing.

Connection usage time. How long each connection is held before being returned. This is your best indicator for spotting long-running transactions or connection leaks before they become incidents.

HikariCP exposes all of these through JMX and Micrometer. If you're using Spring Boot with Actuator and Prometheus, it's about five minutes of configuration.

management.metrics.enable.hikaricp=true

Set up alerts on pending requests greater than zero and connection acquisition time above 100ms. These two metrics alone would have caught our incident before it became an outage.

PgBouncer and Connection Proxies

Once you scale to multiple application instances, you'll probably end up looking at a connection proxy like PgBouncer. The idea is sound: instead of each application instance maintaining its own pool of connections to the database, they all connect to PgBouncer, which maintains a smaller pool of actual database connections and multiplexes.

This works well, but it adds another layer of pool management with its own configuration and its own failure modes. Now you have two pools. The application pool and the proxy pool. Each has its own maximum sizes, timeouts, and eviction policies. And they need to be configured in harmony.

I've seen setups where the application pool holds connections longer than PgBouncer's server idle timeout, causing random connection resets. I've seen transaction-mode PgBouncer break prepared statements because the connection you prepared the statement on isn't the connection that executes it. I've seen PgBouncer run out of connections because the application pool was sized too large.

A proxy doesn't eliminate the pool sizing problem. It moves it. You still need to understand the math. You just have two layers of math now.

What I Do Now

Every Spring Boot project I start now gets the same connection pool configuration on day one. Not because I've figured out the perfect settings, but because the defaults leave too many things unmonitored.

# Pool sizing (start conservative)
spring.datasource.hikari.maximum-pool-size=10
spring.datasource.hikari.minimum-idle=5

# Timeouts (fail fast rather than hang)
spring.datasource.hikari.connection-timeout=5000
spring.datasource.hikari.validation-timeout=3000

# Leak detection
spring.datasource.hikari.leak-detection-threshold=30000

# Idle timeout
spring.datasource.hikari.idle-timeout=300000
spring.datasource.hikari.max-lifetime=900000

The connection timeout is the important one. The default is 30 seconds. That means a request will hang for 30 seconds waiting for a connection before failing. In most applications, if you can't get a connection in 5 seconds, something is already wrong, and the user isn't going to wait 30 seconds anyway. Fail fast. Return a useful error and let the monitoring tell you why.

I also review every @Transactional method for external calls. If a transactional method calls an HTTP endpoint, sends an email, publishes to a message queue, or does any non-database I/O, that's a code review comment. Every time. No exceptions.

The Boring Stuff Matters

Connection pools aren't exciting. Nobody builds a career talking about HikariCP configuration at conferences. There's no trending blog post about maximum-pool-size settings. It's the kind of infrastructure that only gets attention when it breaks.

But here's the thing. When it breaks, it takes everything with it. Not one endpoint. Not one feature. Everything. Because every part of your application that touches the database shares the same pool. A slow query in your admin panel can bring down your user-facing API. A connection leak in a background job can kill your checkout flow.

The pool is a shared resource, and shared resources are where the scariest production incidents live. They're single points of failure hiding in plain sight.

Spend an hour understanding your connection pool configuration. Set up the monitoring. Add the leak detection. Review your transaction boundaries. It's the most boring hour you'll spend this month, and it might save you from the worst on-call night of your year.

Share
X LinkedIn HN
UI

Umur Inan

Principal Software Engineer

Backend engineer focused on JVM systems, distributed architecture, and the failure modes that only show up in production. I write about what I learn building and breaking things at scale.

👁 0 9 min read

Comments (0)