← Back to Blog

DNS Is Always the Answer

It really is always DNS. TTL caches, resolver storms, NXDOMAIN under load, split-horizon traps, SNI mismatches. The actual failure modes behind the meme.

"It's always DNS" is a meme because it is true. The reason it is true is that DNS sits at the bottom of every network operation, and it has more failure modes than people remember. When production breaks in a way that does not fit your mental model, DNS deserves to be the first thing you check, not the last.

This post is a catalog of the real DNS failure modes that hit modern services, plus the three commands that resolve most of them, plus what to actually do to prevent the next one.

Why DNS is special

Every TCP connection starts with a name lookup. Load balancers, TLS, service meshes, Kubernetes, all of them layer on top of a working DNS resolution. DNS is the substrate. So a DNS misbehavior masquerades as a problem in whichever upper layer happens to log first. A slow service, a failed TLS handshake, an intermittent 502, a pod that flaps between healthy and unhealthy: any of these can be DNS underneath.

The mental model that helps: when a system behaves in a way that does not fit your understanding of how it should fail, suspect a layer below the one you are looking at. DNS is below almost everything.

The TTL that outlived the migration

You move a service from one IP to another. The DNS record is updated immediately. On the old record, the TTL is 24 hours. Resolver caches in your network sit on the same 24-hour TTL. Half your fleet resolves the new IP on its next query. The other half keeps hitting the old IP for the rest of the day, because the cached value is still inside its TTL window.

The symptom is intermittent: 50% of requests succeed, 50% time out. No pattern in the dashboards because the load balancer sees nothing wrong. The fix is waiting out the TTL or forcing a flush on every resolver in the path. Each option is slow. The lesson is that TTLs matter before the migration, not after. Short TTLs are cheap insurance.

The resolver under load

A local resolver (systemd-resolved, dnsmasq, the kubelet DNS sidecar, CoreDNS itself) runs out of UDP source ports because of poor query multiplexing. Concurrent lookups stack up. Each name lookup that used to take 2 ms now takes 5 seconds because of retry timeouts.

The downstream effect is that every service that calls another service through a hostname inherits the resolver latency. Everything looks slow at the same time. The "slow service" you are debugging is downstream of the slow resolver, not the cause. Start with the resolver's metrics (latency, ServFail rate, port-exhaustion counters) before opening the next dashboard.

NXDOMAIN under load

The upstream DNS server returns NXDOMAIN intermittently. The cause might be rate limiting, a flapping zone refresh, or a backend problem at your DNS provider. Whatever the cause, the negative cache picks up the NXDOMAIN and serves it for the negative TTL window. That window is usually 5 to 30 minutes.

For that window, callers in the resolver's cache think the hostname does not exist. Not slow, not failing: doesn't exist. The actual upstream blip was 30 seconds. The customer impact is 15 minutes. Negative caching is a feature, except when the negative answer was wrong, at which point it is the longest-lived bug in your incident timeline.

The split-horizon trap

Internal DNS resolves db.internal.example.com to a private VPC IP. External DNS does not resolve it at all, or resolves it to something different. The service works from your laptop on the VPN. It works from the pod, because the pod's resolver points at internal DNS. It fails from the CI runner that is on a network you did not check.

Split-horizon is a fine pattern when you know it exists. It is a half-day debugging session when you do not. The first thing to ask in any "works for me, doesn't work for them" report is which DNS view each side is using.

The /etc/resolv.conf you forgot

On Kubernetes, a pod's /etc/resolv.conf has a search list inherited from the kubelet plus the node's resolver. That list usually contains default.svc.cluster.local, svc.cluster.local, cluster.local, plus whatever the node had. A bare hostname like db goes through the list element by element until one resolves.

That means db resolves to db.default.svc.cluster.local in one namespace and to db.production.svc.cluster.local in another and to nothing at all in a third. Copying a manifest between namespaces silently changes which database the service talks to. The fix is always to use fully qualified names with the trailing dot (db.production.svc.cluster.local.). The bug is always discovered after the migration is live.

Service discovery is DNS

Kubernetes "service discovery" is CoreDNS, which is DNS. Consul service discovery exposes a DNS interface. ECS service discovery uses Route 53. AWS Cloud Map is DNS. Every fancy abstraction in this space sits on top of standard DNS. Failure modes are the same: caches that lag, resolvers under load, negative caching, the search list.

The trap is that the abstraction makes you stop thinking about DNS. Flip that mental model. When service discovery misbehaves, ask what the underlying DNS resolution looks like before opening the orchestrator's docs.

The TLS handshake that failed because SNI

The TLS client sends Server Name Indication during the handshake. The server uses SNI to pick which certificate to present. If your CNAME chain rewrites the hostname (you point app.example.com at app.example-cdn.com, which points at the CDN's edge), and your client sends the CDN hostname as SNI, the certificate presented may not match the hostname your code expected.

The error message looks like a TLS problem: "cert mismatch," "untrusted issuer," "hostname does not match." The cause is the DNS chain in front of the TLS handshake. The TLS layer is doing exactly what you asked it to. You asked it the wrong question because DNS sent you somewhere unexpected.

How to actually debug DNS

Three dig commands resolve most production DNS questions.

dig +short <name> returns just the answer. Use this from the box where the problem is happening to see what the local resolver currently returns.

$ dig +short api.example.com
10.0.42.7

dig +trace <name> walks from the root servers down through every delegation step. Use this when the answer is wrong and you need to figure out which authoritative server is wrong, or where the cache is.

$ dig +trace api.example.com
; root, then .com, then example.com NS, then api.example.com
; each hop shows you who answered and what they said

dig @<resolver> <name> bypasses the local cache by querying a specific resolver. Use @8.8.8.8 or @1.1.1.1 to compare what public resolvers see versus what your local one does.

$ dig @8.8.8.8 api.example.com
$ dig @10.0.0.10 api.example.com
; compare the two answers and you find the disagreement

Run all three from at least two networks: your laptop and a prod pod, or a CI runner and an internal box. The thing you are debugging is almost always the difference between two answers.

How to prevent the next one

Short TTLs cost almost nothing and buy you fast rollback. 300 seconds for anything that might move. 60 seconds when you are actively migrating. The argument against short TTLs ("more query load on the resolver") is real but small; the argument for them ("I can fix a mistake in a minute, not a day") is enormous.

Treat DNS as an SLO. Alert on resolver query latency, ServFail rate, and NXDOMAIN rate, not just on downstream service latency. When the resolver is sick, every dashboard above it lights up red at once, and the resolver dashboards are the only ones that tell you why.

Standardize on fully qualified names with the trailing dot in config files, especially anywhere a manifest can move between namespaces. The trailing dot says "do not append the search list," which prevents the kind of cross-namespace surprise that takes a senior engineer an afternoon to track down.

Before deploying a DNS change, run dig from at least three networks (laptop, prod pod, CI runner) and confirm the answer is what you expect. DNS changes look reversible, and they usually are, but the TTL on the wrong record is your rollback ceiling.

The punchline is not the point

"It's always DNS" is the punchline. The work happens before you reach the punchline: knowing which of the half-dozen failure modes you are actually looking at, and having the muscle memory to run dig +trace before opening another dashboard. Every senior engineer learns this the same way, which is by losing an afternoon to a TTL or a CNAME chain or a misbehaving negative cache. The faster you internalize that DNS deserves your first check, not your last, the fewer afternoons you give up.

Share
X LinkedIn HN
UI

Umur Inan

Principal Software Engineer

Backend engineer focused on JVM systems, distributed architecture, and the failure modes that only show up in production. I write about what I learn building and breaking things at scale.

👁 0 6 min read

Comments (0)