← Back to Tutorials

HikariCP in Production: Tuning, Monitoring, and pgbouncer

Tune HikariCP for production Spring Boot 4 services on Postgres. Little's Law sizing, the properties that matter, monitoring signals, pgbouncer fronting.

I have spent more 3am hours debugging connection pool exhaustion than any other production issue. Every time I think I have it figured out, a new pattern shows up. This tutorial is the playbook I wish I had three years ago: Spring Boot 4 + PostgreSQL + HikariCP + pgbouncer, configured the way services that stay up are configured.

All the code lives at github.com/umur/hikaricp-production-tuning: a runnable Spring Boot 4 service, a docker-compose with PostgreSQL + pgbouncer + Prometheus + Grafana, a k6 load script, a Prometheus alert rule, and a Testcontainers integration test that proves the fix does not pin the pool.

Specifically: how to size the pool using Little's Law instead of guesswork, which HikariCP properties matter and which are noise, what to monitor and what those metrics mean during an incident, and what to do when PostgreSQL caps your connection budget before your traffic does.

The Setup

Assume a Spring Boot 4.0.6 service on Java 21 that talks to PostgreSQL 17. HikariCP has been the default DataSource since Spring Boot 2.x and remains the default in 4.x. The HikariCP version pulled in by Spring Boot 4.0.6 is 6.x.

<parent>
  <groupId>org.springframework.boot</groupId>
  <artifactId>spring-boot-starter-parent</artifactId>
  <version>4.0.6</version>
</parent>

<dependencies>
  <dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-data-jpa</artifactId>
  </dependency>
  <dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-actuator</artifactId>
  </dependency>
  <dependency>
    <groupId>io.micrometer</groupId>
    <artifactId>micrometer-registry-prometheus</artifactId>
  </dependency>
  <dependency>
    <groupId>org.postgresql</groupId>
    <artifactId>postgresql</artifactId>
  </dependency>
</dependencies>

Starting properties in application.yml:

spring:
  datasource:
    url: jdbc:postgresql://db:5432/app
    username: app
    password: ${DB_PASSWORD}
    hikari:
      pool-name: app-hikari
      maximum-pool-size: 20
      minimum-idle: 20
      connection-timeout: 5000
      idle-timeout: 300000
      max-lifetime: 1800000
      keepalive-time: 60000
      leak-detection-threshold: 0
      data-source-properties:
        prepareThreshold: 5
        socketTimeout: 30
        tcpKeepAlive: true

These are the defaults I start with. The rest of the tutorial earns each value.

Sizing the Pool With Little's Law

Most teams size the pool by guess: pick 50, see if it works. The right way is Little's Law, applied to database queries:

pool_size = throughput_qps * avg_query_time_seconds

If your application runs 200 queries per second and the average query takes 25ms, you need 200 * 0.025 = 5 connections, steady state. Add 50 percent headroom for variance and bursts. Round up. So 8 connections. Not 50.

The trap is that average query time includes lock waits, network latency, and the time the connection is held while the application does non-database work between queries in the same transaction. Measure it from the pool's perspective, not from EXPLAIN ANALYZE.

HikariCP exposes this directly:

HikariPoolMXBean pool = ((HikariDataSource) dataSource).getHikariPoolMXBean();
int active = pool.getActiveConnections();
int total = pool.getTotalConnections();
int pending = pool.getThreadsAwaitingConnection();

What you actually measure: the 99th percentile of active over a 10-minute window during peak. Set maximum-pool-size to that number plus 30 percent. If your p99 active is 14, set max to 18.

Why not bigger? Each PostgreSQL connection costs about 10MB of backend memory. 100 connections is a gig. Each connection also holds parser state, planner state, and a backend process. More connections increase contention on shared structures inside PostgreSQL, especially the lock manager and the buffer pool. The number that survives load is smaller than you think.

Why minimum-idle Should Equal maximum-pool-size

HikariCP's default sets minimum-idle equal to maximum-pool-size. Some teams override this to a smaller number to "save resources when idle." This is almost always wrong in production.

The reason: connection acquisition is slow. A fresh TCP connection plus a PostgreSQL backend fork plus SSL handshake plus authentication is 50 to 200ms. If your pool shrinks during a quiet minute and then traffic spikes, the first wave of requests waits hundreds of milliseconds for new connections. You will see this as a latency cliff at the start of every traffic spike.

Keep minimum-idle = maximum-pool-size. The pool is a fixed-size resource. Treat it that way.

connection-timeout, idle-timeout, max-lifetime

These three properties confuse everyone. Here is what they actually do.

connection-timeout: how long the application waits for a connection from the pool before giving up. Default is 30s. Set this to 5s in production. If your pool is exhausted for 30s, the right response is to fail fast and shed load, not wait. The 30s default exists because HikariCP does not want to surprise developers. In production, 30s of accumulated waits means a thread pool stuffed with stuck threads and a cascading failure.

idle-timeout: how long an idle connection stays in the pool before being closed. Only relevant if minimum-idle is less than maximum-pool-size. With my recommendation (equal), idle-timeout does nothing. Set it to 5 minutes anyway for sanity.

max-lifetime: the maximum total age of a connection before HikariCP retires it. Default 30 minutes. Of the three, this one matters most. Set it to a value smaller than any infrastructure timeout: AWS RDS proxy session timeout, your load balancer, your pgbouncer server_idle_timeout. The default 30 minutes is fine for most setups. If you use pgbouncer or HAProxy in front, drop to 15 to 25 minutes.

Why it matters: stale connections cause silent failures. The pool hands a thread a connection that the database (or a middlebox) has already closed. The next query throws a confusing error. By forcing rotation younger than any upstream idle reaper, you guarantee HikariCP retires connections before they go bad.

keepalive-time: A HikariCP 6.x Property Worth Knowing

HikariCP 6.x added keepalive-time. It runs the connection-test query on each idle connection at this interval. Set to 60 seconds. This catches connections killed by NAT translation tables, load balancer idle timeouts, or sudden network blips. Without keepalive-time, idle connections silently die and your next query against them fails.

leak-detection-threshold: Turn It On in Staging

Set leak-detection-threshold to 5000 (5 seconds) in staging. If any code path holds a connection longer than 5s without returning it, HikariCP logs the stack trace of where it was acquired. This catches code that does HTTP calls, file I/O, or external API calls while still holding a database connection.

In production, leave it at 0 (disabled) or set it to 30000. Detection itself adds a small bookkeeping cost per acquire and release.

The @Transactional Anti-Pattern That Eats Your Pool

An HTTP call inside a @Transactional method is the most common code path that trips leak-detection-threshold. A database connection stays open for the entire round trip, blocking other threads from acquiring it.

// ANTI-PATTERN: external call inside @Transactional
@Service
public class OrderService {

    private final OrderRepository orders;
    private final PricingClient pricing;

    @Transactional
    public Order createOrder(OrderRequest request) {
        Order order = orders.save(request.toOrder());
        // Holds the DB connection while waiting on HTTP.
        // A 2-second round-trip means 2 seconds the connection is
        // unavailable to anyone else.
        BigDecimal price = pricing.calculate(order.getItems());
        order.setPrice(price);
        return orders.save(order);
    }
}

Under 100 requests per second with a 2-second pricing call, this single method occupies 200 connections on average (Little's Law again). A 20-connection pool dies in the first minute of real traffic. Move the external call out of the transaction:

// FIX: external call before the transaction starts
@Service
public class OrderService {

    private final OrderRepository orders;
    private final PricingClient pricing;

    public Order createOrder(OrderRequest request) {
        BigDecimal price = pricing.calculate(request.getItems());
        return persist(request, price);
    }

    @Transactional
    protected Order persist(OrderRequest request, BigDecimal price) {
        Order order = request.toOrder();
        order.setPrice(price);
        return orders.save(order);
    }
}

The transaction now lasts as long as the database work and no longer. leak-detection-threshold at 5000 in staging catches the anti-pattern the first time integration tests hit a service with a slow downstream.

Monitoring: What to Actually Watch

HikariCP exposes JMX beans and Micrometer metrics. With spring-boot-starter-actuator + a Micrometer registry, you get all of these for free:

hikaricp.connections.active   # in use right now
hikaricp.connections.idle     # idle in the pool
hikaricp.connections.pending  # threads waiting for a connection
hikaricp.connections.timeout  # acquires that timed out (counter)
hikaricp.connections.acquire  # time to acquire (histogram)
hikaricp.connections.usage    # how long a connection was held (histogram)

What each signal means during an incident:

  • hikaricp.connections.pending > 0 for more than a few seconds means your pool is too small or your queries are too slow. Fix the queries or scale the pool, but only after understanding why.
  • hikaricp.connections.usage p99 spike means transactions are holding connections longer than usual. Look for a lock, a slow downstream service called inside a transaction, or a recent change that added work inside @Transactional.
  • hikaricp.connections.acquire p99 spike with pending > 0 is the classic exhaustion signature. Your application threads are queued waiting for connections.

Set an alert on pending > 0 sustained for 30 seconds. It is the earliest signal that pool capacity is the bottleneck.

Reproducing Pool Exhaustion: Alerts and Dashboard

The monitoring section above is theory. Here is a 60-second setup that lets you see saturation happen in real time. k6 produces the load. Prometheus stores the metrics. Grafana draws the lines. An alert rule fires once pending stays positive.

k6 spike script:

// loadtest.js
import http from 'k6/http';
import { check } from 'k6';

export const options = {
  scenarios: {
    spike: {
      executor: 'ramping-vus',
      startVUs: 0,
      stages: [
        { duration: '30s', target: 50 },
        { duration: '2m',  target: 50 },
        { duration: '10s', target: 0 },
      ],
    },
  },
};

export default function () {
  const res = http.get('http://localhost:8080/api/orders/recent');
  check(res, { 'status 200': (r) => r.status === 200 });
}

Prometheus alert rule:

# alerts.yml
groups:
  - name: hikaricp
    rules:
      - alert: HikariPoolSaturated
        expr: hikaricp_connections_pending{pool="app-hikari"} > 0
        for: 30s
        labels:
          severity: page
        annotations:
          summary: "HikariCP pool app-hikari has threads waiting for connections"
          description: "{{ $value }} threads pending for 30s. Likely pool exhaustion or query slowdown."

Minimal Grafana panel that puts the three load-bearing signals on one chart:

{
  "title": "HikariCP active vs pending vs idle",
  "type": "timeseries",
  "targets": [
    { "expr": "hikaricp_connections_active{pool=\"app-hikari\"}",  "legendFormat": "active"  },
    { "expr": "hikaricp_connections_pending{pool=\"app-hikari\"}", "legendFormat": "pending" },
    { "expr": "hikaricp_connections_idle{pool=\"app-hikari\"}",    "legendFormat": "idle"    }
  ]
}

Run the k6 script against a service with maximum-pool-size: 10 and you will watch active climb to 10, pending spike above zero, and the alert fire within 30 seconds. Re-run with maximum-pool-size: 30 and the queue stays empty. The chart makes the difference visible in a way that staring at raw numbers does not.

When PostgreSQL Caps You: pgbouncer

PostgreSQL defaults to max_connections = 100. Each backend process is expensive. If you run 8 service instances with a pool of 20 each, that is 160 connections, already 60 over budget. Bumping max_connections to 500 burns memory and degrades PostgreSQL performance under contention.

The answer is pgbouncer in transaction-pooling mode. pgbouncer multiplexes many client connections onto few real PostgreSQL backends. Your application sees a giant connection pool. PostgreSQL sees 20.

# /etc/pgbouncer/pgbouncer.ini
[databases]
app = host=postgres-primary port=5432 dbname=app

[pgbouncer]
listen_addr = 0.0.0.0
listen_port = 6432
auth_type = scram-sha-256
auth_file = /etc/pgbouncer/userlist.txt
pool_mode = transaction
max_client_conn = 2000
default_pool_size = 20
reserve_pool_size = 5
reserve_pool_timeout = 3
server_idle_timeout = 600
server_lifetime = 3600
query_timeout = 30
ignore_startup_parameters = extra_float_digits

userlist.txt holds the SCRAM-hashed credentials pgbouncer authenticates clients against:

# /etc/pgbouncer/userlist.txt
"app" "SCRAM-SHA-256$4096:<salt>$<stored_key>:<server_key>"

Get the hash by running SELECT rolname, rolpassword FROM pg_authid WHERE rolname = 'app'; in PostgreSQL after creating the user with a password.

The application now connects to pgbouncer on port 6432. HikariCP's pool can stay at 20 per instance. pgbouncer keeps 20 real PostgreSQL connections total no matter how many app instances run.

The trade you make in transaction-pooling mode: you cannot use session-level features. Prepared statements do not survive across transactions. Advisory locks, LISTEN/NOTIFY, and temp tables are out. SET commands need a matching RESET or they leak. Most modern Spring Boot applications do not use these. Verify by switching to transaction-pool mode in staging and running the integration tests.

The Prepared Statement Trap

The JDBC driver caches prepared statements per connection by default. In transaction-pooling mode, the next transaction may run on a different real backend that has not cached that statement. PostgreSQL throws cached plan must not change result type and similar errors.

Two fixes. First, disable server-side prepared statement caching in the JDBC driver:

spring:
  datasource:
    url: jdbc:postgresql://pgbouncer:6432/app?prepareThreshold=0

This pushes PostgreSQL back to planning every query. Modern PostgreSQL plan caching at the SQL layer handles this fine for most workloads.

Second, if you have heavy analytic queries that genuinely benefit from server-side prepared statements, route those to a read replica without pgbouncer in front, and keep the OLTP traffic on pgbouncer.

Running It All Locally with Docker Compose

One file spins up PostgreSQL, pgbouncer, the Spring Boot service, Prometheus, and Grafana. Names are project-prefixed (hikaricp-demo-*) so the stack does not collide with anything else on the same Docker host.

# docker-compose.yml
name: hikaricp-demo

services:
  hikaricp-demo-postgres:
    image: postgres:17-alpine
    container_name: hikaricp-demo-postgres
    environment:
      POSTGRES_USER: app
      POSTGRES_PASSWORD: app
      POSTGRES_DB: app
    ports:
      - "5432:5432"
    healthcheck:
      test: ["CMD", "pg_isready", "-U", "app"]
      interval: 5s
      timeout: 3s
      retries: 10

  hikaricp-demo-pgbouncer:
    image: edoburu/pgbouncer:latest
    container_name: hikaricp-demo-pgbouncer
    environment:
      DATABASE_URL: postgres://app:app@hikaricp-demo-postgres:5432/app
      POOL_MODE: transaction
      MAX_CLIENT_CONN: 2000
      DEFAULT_POOL_SIZE: 20
      AUTH_TYPE: scram-sha-256
    ports:
      - "6432:6432"
    depends_on:
      hikaricp-demo-postgres:
        condition: service_healthy

  hikaricp-demo-app:
    image: hikaricp-demo/app:dev
    container_name: hikaricp-demo-app
    build:
      context: ./app
    environment:
      SPRING_DATASOURCE_URL: jdbc:postgresql://hikaricp-demo-pgbouncer:6432/app?prepareThreshold=0
      SPRING_DATASOURCE_USERNAME: app
      SPRING_DATASOURCE_PASSWORD: app
      MANAGEMENT_ENDPOINTS_WEB_EXPOSURE_INCLUDE: prometheus,health
    ports:
      - "8080:8080"
    depends_on:
      - hikaricp-demo-pgbouncer

  hikaricp-demo-prometheus:
    image: prom/prometheus:latest
    container_name: hikaricp-demo-prometheus
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - ./alerts.yml:/etc/prometheus/alerts.yml
    ports:
      - "9090:9090"

  hikaricp-demo-grafana:
    image: grafana/grafana:latest
    container_name: hikaricp-demo-grafana
    environment:
      GF_SECURITY_ADMIN_PASSWORD: admin
    ports:
      - "3000:3000"
    depends_on:
      - hikaricp-demo-prometheus

Matching prometheus.yml:

# prometheus.yml
global:
  scrape_interval: 5s
rule_files:
  - alerts.yml
scrape_configs:
  - job_name: app
    metrics_path: /actuator/prometheus
    static_configs:
      - targets: ['hikaricp-demo-app:8080']

docker compose up brings the whole stack online. k6 run loadtest.js from another terminal drives the pool. Watch Grafana at localhost:3000 (admin/admin) and the alert at localhost:9090/alerts.

The Pre-Flight Checklist

Before shipping a service to production with HikariCP, walk this list:

  • maximum-pool-size sized via Little's Law, not guessed
  • minimum-idle = maximum-pool-size
  • connection-timeout = 5000
  • max-lifetime = 1800000 or shorter than any middlebox timeout
  • keepalive-time = 60000
  • leak-detection-threshold = 5000 in staging, 0 or 30000 in production
  • Micrometer metrics for hikaricp.connections.* wired to Prometheus (or your equivalent)
  • Alert on hikaricp.connections.pending > 0 for > 30 seconds
  • If running 4 or more service instances against a single PostgreSQL, pgbouncer in transaction-pooling mode
  • If using pgbouncer, prepareThreshold=0 in the JDBC URL

None of this is fancy. All of it is the difference between a service that scales linearly and a service that falls over the first time traffic doubles.

Share
X LinkedIn HN
UI

Umur Inan

Principal Software Engineer

Backend engineer focused on JVM systems, distributed architecture, and the failure modes that only show up in production. I write about what I learn building and breaking things at scale.