I have spent more 3am hours debugging connection pool exhaustion than any other production issue. Every time I think I have it figured out, a new pattern shows up. This tutorial is the playbook I wish I had three years ago: Spring Boot 4 + PostgreSQL + HikariCP + pgbouncer, configured the way services that stay up are configured.
All the code lives at github.com/umur/hikaricp-production-tuning: a runnable Spring Boot 4 service, a docker-compose with PostgreSQL + pgbouncer + Prometheus + Grafana, a k6 load script, a Prometheus alert rule, and a Testcontainers integration test that proves the fix does not pin the pool.
Specifically: how to size the pool using Little's Law instead of guesswork, which HikariCP properties matter and which are noise, what to monitor and what those metrics mean during an incident, and what to do when PostgreSQL caps your connection budget before your traffic does.
The Setup
Assume a Spring Boot 4.0.6 service on Java 21 that talks to PostgreSQL 17. HikariCP has been the default DataSource since Spring Boot 2.x and remains the default in 4.x. The HikariCP version pulled in by Spring Boot 4.0.6 is 6.x.
<parent>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-parent</artifactId>
<version>4.0.6</version>
</parent>
<dependencies>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-data-jpa</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
<dependency>
<groupId>io.micrometer</groupId>
<artifactId>micrometer-registry-prometheus</artifactId>
</dependency>
<dependency>
<groupId>org.postgresql</groupId>
<artifactId>postgresql</artifactId>
</dependency>
</dependencies>Starting properties in application.yml:
spring:
datasource:
url: jdbc:postgresql://db:5432/app
username: app
password: ${DB_PASSWORD}
hikari:
pool-name: app-hikari
maximum-pool-size: 20
minimum-idle: 20
connection-timeout: 5000
idle-timeout: 300000
max-lifetime: 1800000
keepalive-time: 60000
leak-detection-threshold: 0
data-source-properties:
prepareThreshold: 5
socketTimeout: 30
tcpKeepAlive: trueThese are the defaults I start with. The rest of the tutorial earns each value.
Sizing the Pool With Little's Law
Most teams size the pool by guess: pick 50, see if it works. The right way is Little's Law, applied to database queries:
pool_size = throughput_qps * avg_query_time_secondsIf your application runs 200 queries per second and the average query takes 25ms, you need 200 * 0.025 = 5 connections, steady state. Add 50 percent headroom for variance and bursts. Round up. So 8 connections. Not 50.
The trap is that average query time includes lock waits, network latency, and the time the connection is held while the application does non-database work between queries in the same transaction. Measure it from the pool's perspective, not from EXPLAIN ANALYZE.
HikariCP exposes this directly:
HikariPoolMXBean pool = ((HikariDataSource) dataSource).getHikariPoolMXBean();
int active = pool.getActiveConnections();
int total = pool.getTotalConnections();
int pending = pool.getThreadsAwaitingConnection();What you actually measure: the 99th percentile of active over a 10-minute window during peak. Set maximum-pool-size to that number plus 30 percent. If your p99 active is 14, set max to 18.
Why not bigger? Each PostgreSQL connection costs about 10MB of backend memory. 100 connections is a gig. Each connection also holds parser state, planner state, and a backend process. More connections increase contention on shared structures inside PostgreSQL, especially the lock manager and the buffer pool. The number that survives load is smaller than you think.
Why minimum-idle Should Equal maximum-pool-size
HikariCP's default sets minimum-idle equal to maximum-pool-size. Some teams override this to a smaller number to "save resources when idle." This is almost always wrong in production.
The reason: connection acquisition is slow. A fresh TCP connection plus a PostgreSQL backend fork plus SSL handshake plus authentication is 50 to 200ms. If your pool shrinks during a quiet minute and then traffic spikes, the first wave of requests waits hundreds of milliseconds for new connections. You will see this as a latency cliff at the start of every traffic spike.
Keep minimum-idle = maximum-pool-size. The pool is a fixed-size resource. Treat it that way.
connection-timeout, idle-timeout, max-lifetime
These three properties confuse everyone. Here is what they actually do.
connection-timeout: how long the application waits for a connection from the pool before giving up. Default is 30s. Set this to 5s in production. If your pool is exhausted for 30s, the right response is to fail fast and shed load, not wait. The 30s default exists because HikariCP does not want to surprise developers. In production, 30s of accumulated waits means a thread pool stuffed with stuck threads and a cascading failure.
idle-timeout: how long an idle connection stays in the pool before being closed. Only relevant if minimum-idle is less than maximum-pool-size. With my recommendation (equal), idle-timeout does nothing. Set it to 5 minutes anyway for sanity.
max-lifetime: the maximum total age of a connection before HikariCP retires it. Default 30 minutes. Of the three, this one matters most. Set it to a value smaller than any infrastructure timeout: AWS RDS proxy session timeout, your load balancer, your pgbouncer server_idle_timeout. The default 30 minutes is fine for most setups. If you use pgbouncer or HAProxy in front, drop to 15 to 25 minutes.
Why it matters: stale connections cause silent failures. The pool hands a thread a connection that the database (or a middlebox) has already closed. The next query throws a confusing error. By forcing rotation younger than any upstream idle reaper, you guarantee HikariCP retires connections before they go bad.
keepalive-time: A HikariCP 6.x Property Worth Knowing
HikariCP 6.x added keepalive-time. It runs the connection-test query on each idle connection at this interval. Set to 60 seconds. This catches connections killed by NAT translation tables, load balancer idle timeouts, or sudden network blips. Without keepalive-time, idle connections silently die and your next query against them fails.
leak-detection-threshold: Turn It On in Staging
Set leak-detection-threshold to 5000 (5 seconds) in staging. If any code path holds a connection longer than 5s without returning it, HikariCP logs the stack trace of where it was acquired. This catches code that does HTTP calls, file I/O, or external API calls while still holding a database connection.
In production, leave it at 0 (disabled) or set it to 30000. Detection itself adds a small bookkeeping cost per acquire and release.
The @Transactional Anti-Pattern That Eats Your Pool
An HTTP call inside a @Transactional method is the most common code path that trips leak-detection-threshold. A database connection stays open for the entire round trip, blocking other threads from acquiring it.
// ANTI-PATTERN: external call inside @Transactional
@Service
public class OrderService {
private final OrderRepository orders;
private final PricingClient pricing;
@Transactional
public Order createOrder(OrderRequest request) {
Order order = orders.save(request.toOrder());
// Holds the DB connection while waiting on HTTP.
// A 2-second round-trip means 2 seconds the connection is
// unavailable to anyone else.
BigDecimal price = pricing.calculate(order.getItems());
order.setPrice(price);
return orders.save(order);
}
}Under 100 requests per second with a 2-second pricing call, this single method occupies 200 connections on average (Little's Law again). A 20-connection pool dies in the first minute of real traffic. Move the external call out of the transaction:
// FIX: external call before the transaction starts
@Service
public class OrderService {
private final OrderRepository orders;
private final PricingClient pricing;
public Order createOrder(OrderRequest request) {
BigDecimal price = pricing.calculate(request.getItems());
return persist(request, price);
}
@Transactional
protected Order persist(OrderRequest request, BigDecimal price) {
Order order = request.toOrder();
order.setPrice(price);
return orders.save(order);
}
}The transaction now lasts as long as the database work and no longer. leak-detection-threshold at 5000 in staging catches the anti-pattern the first time integration tests hit a service with a slow downstream.
Monitoring: What to Actually Watch
HikariCP exposes JMX beans and Micrometer metrics. With spring-boot-starter-actuator + a Micrometer registry, you get all of these for free:
hikaricp.connections.active # in use right now
hikaricp.connections.idle # idle in the pool
hikaricp.connections.pending # threads waiting for a connection
hikaricp.connections.timeout # acquires that timed out (counter)
hikaricp.connections.acquire # time to acquire (histogram)
hikaricp.connections.usage # how long a connection was held (histogram)What each signal means during an incident:
hikaricp.connections.pending> 0 for more than a few seconds means your pool is too small or your queries are too slow. Fix the queries or scale the pool, but only after understanding why.hikaricp.connections.usagep99 spike means transactions are holding connections longer than usual. Look for a lock, a slow downstream service called inside a transaction, or a recent change that added work inside@Transactional.hikaricp.connections.acquirep99 spike withpending> 0 is the classic exhaustion signature. Your application threads are queued waiting for connections.
Set an alert on pending > 0 sustained for 30 seconds. It is the earliest signal that pool capacity is the bottleneck.
Reproducing Pool Exhaustion: Alerts and Dashboard
The monitoring section above is theory. Here is a 60-second setup that lets you see saturation happen in real time. k6 produces the load. Prometheus stores the metrics. Grafana draws the lines. An alert rule fires once pending stays positive.
k6 spike script:
// loadtest.js
import http from 'k6/http';
import { check } from 'k6';
export const options = {
scenarios: {
spike: {
executor: 'ramping-vus',
startVUs: 0,
stages: [
{ duration: '30s', target: 50 },
{ duration: '2m', target: 50 },
{ duration: '10s', target: 0 },
],
},
},
};
export default function () {
const res = http.get('http://localhost:8080/api/orders/recent');
check(res, { 'status 200': (r) => r.status === 200 });
}Prometheus alert rule:
# alerts.yml
groups:
- name: hikaricp
rules:
- alert: HikariPoolSaturated
expr: hikaricp_connections_pending{pool="app-hikari"} > 0
for: 30s
labels:
severity: page
annotations:
summary: "HikariCP pool app-hikari has threads waiting for connections"
description: "{{ $value }} threads pending for 30s. Likely pool exhaustion or query slowdown."Minimal Grafana panel that puts the three load-bearing signals on one chart:
{
"title": "HikariCP active vs pending vs idle",
"type": "timeseries",
"targets": [
{ "expr": "hikaricp_connections_active{pool=\"app-hikari\"}", "legendFormat": "active" },
{ "expr": "hikaricp_connections_pending{pool=\"app-hikari\"}", "legendFormat": "pending" },
{ "expr": "hikaricp_connections_idle{pool=\"app-hikari\"}", "legendFormat": "idle" }
]
}Run the k6 script against a service with maximum-pool-size: 10 and you will watch active climb to 10, pending spike above zero, and the alert fire within 30 seconds. Re-run with maximum-pool-size: 30 and the queue stays empty. The chart makes the difference visible in a way that staring at raw numbers does not.
When PostgreSQL Caps You: pgbouncer
PostgreSQL defaults to max_connections = 100. Each backend process is expensive. If you run 8 service instances with a pool of 20 each, that is 160 connections, already 60 over budget. Bumping max_connections to 500 burns memory and degrades PostgreSQL performance under contention.
The answer is pgbouncer in transaction-pooling mode. pgbouncer multiplexes many client connections onto few real PostgreSQL backends. Your application sees a giant connection pool. PostgreSQL sees 20.
# /etc/pgbouncer/pgbouncer.ini
[databases]
app = host=postgres-primary port=5432 dbname=app
[pgbouncer]
listen_addr = 0.0.0.0
listen_port = 6432
auth_type = scram-sha-256
auth_file = /etc/pgbouncer/userlist.txt
pool_mode = transaction
max_client_conn = 2000
default_pool_size = 20
reserve_pool_size = 5
reserve_pool_timeout = 3
server_idle_timeout = 600
server_lifetime = 3600
query_timeout = 30
ignore_startup_parameters = extra_float_digitsuserlist.txt holds the SCRAM-hashed credentials pgbouncer authenticates clients against:
# /etc/pgbouncer/userlist.txt
"app" "SCRAM-SHA-256$4096:<salt>$<stored_key>:<server_key>"Get the hash by running SELECT rolname, rolpassword FROM pg_authid WHERE rolname = 'app'; in PostgreSQL after creating the user with a password.
The application now connects to pgbouncer on port 6432. HikariCP's pool can stay at 20 per instance. pgbouncer keeps 20 real PostgreSQL connections total no matter how many app instances run.
The trade you make in transaction-pooling mode: you cannot use session-level features. Prepared statements do not survive across transactions. Advisory locks, LISTEN/NOTIFY, and temp tables are out. SET commands need a matching RESET or they leak. Most modern Spring Boot applications do not use these. Verify by switching to transaction-pool mode in staging and running the integration tests.
The Prepared Statement Trap
The JDBC driver caches prepared statements per connection by default. In transaction-pooling mode, the next transaction may run on a different real backend that has not cached that statement. PostgreSQL throws cached plan must not change result type and similar errors.
Two fixes. First, disable server-side prepared statement caching in the JDBC driver:
spring:
datasource:
url: jdbc:postgresql://pgbouncer:6432/app?prepareThreshold=0This pushes PostgreSQL back to planning every query. Modern PostgreSQL plan caching at the SQL layer handles this fine for most workloads.
Second, if you have heavy analytic queries that genuinely benefit from server-side prepared statements, route those to a read replica without pgbouncer in front, and keep the OLTP traffic on pgbouncer.
Running It All Locally with Docker Compose
One file spins up PostgreSQL, pgbouncer, the Spring Boot service, Prometheus, and Grafana. Names are project-prefixed (hikaricp-demo-*) so the stack does not collide with anything else on the same Docker host.
# docker-compose.yml
name: hikaricp-demo
services:
hikaricp-demo-postgres:
image: postgres:17-alpine
container_name: hikaricp-demo-postgres
environment:
POSTGRES_USER: app
POSTGRES_PASSWORD: app
POSTGRES_DB: app
ports:
- "5432:5432"
healthcheck:
test: ["CMD", "pg_isready", "-U", "app"]
interval: 5s
timeout: 3s
retries: 10
hikaricp-demo-pgbouncer:
image: edoburu/pgbouncer:latest
container_name: hikaricp-demo-pgbouncer
environment:
DATABASE_URL: postgres://app:app@hikaricp-demo-postgres:5432/app
POOL_MODE: transaction
MAX_CLIENT_CONN: 2000
DEFAULT_POOL_SIZE: 20
AUTH_TYPE: scram-sha-256
ports:
- "6432:6432"
depends_on:
hikaricp-demo-postgres:
condition: service_healthy
hikaricp-demo-app:
image: hikaricp-demo/app:dev
container_name: hikaricp-demo-app
build:
context: ./app
environment:
SPRING_DATASOURCE_URL: jdbc:postgresql://hikaricp-demo-pgbouncer:6432/app?prepareThreshold=0
SPRING_DATASOURCE_USERNAME: app
SPRING_DATASOURCE_PASSWORD: app
MANAGEMENT_ENDPOINTS_WEB_EXPOSURE_INCLUDE: prometheus,health
ports:
- "8080:8080"
depends_on:
- hikaricp-demo-pgbouncer
hikaricp-demo-prometheus:
image: prom/prometheus:latest
container_name: hikaricp-demo-prometheus
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- ./alerts.yml:/etc/prometheus/alerts.yml
ports:
- "9090:9090"
hikaricp-demo-grafana:
image: grafana/grafana:latest
container_name: hikaricp-demo-grafana
environment:
GF_SECURITY_ADMIN_PASSWORD: admin
ports:
- "3000:3000"
depends_on:
- hikaricp-demo-prometheusMatching prometheus.yml:
# prometheus.yml
global:
scrape_interval: 5s
rule_files:
- alerts.yml
scrape_configs:
- job_name: app
metrics_path: /actuator/prometheus
static_configs:
- targets: ['hikaricp-demo-app:8080']docker compose up brings the whole stack online. k6 run loadtest.js from another terminal drives the pool. Watch Grafana at localhost:3000 (admin/admin) and the alert at localhost:9090/alerts.
The Pre-Flight Checklist
Before shipping a service to production with HikariCP, walk this list:
maximum-pool-sizesized via Little's Law, not guessedminimum-idle=maximum-pool-sizeconnection-timeout= 5000max-lifetime= 1800000 or shorter than any middlebox timeoutkeepalive-time= 60000leak-detection-threshold= 5000 in staging, 0 or 30000 in production- Micrometer metrics for
hikaricp.connections.*wired to Prometheus (or your equivalent) - Alert on
hikaricp.connections.pending> 0 for > 30 seconds - If running 4 or more service instances against a single PostgreSQL, pgbouncer in transaction-pooling mode
- If using pgbouncer,
prepareThreshold=0in the JDBC URL
None of this is fancy. All of it is the difference between a service that scales linearly and a service that falls over the first time traffic doubles.