Skip to content
Published on

DNS Troubleshooting Complete Guide - From Fundamentals to Kubernetes

Authors
  • Name
    Twitter

1. What is DNS

DNS (Domain Name System) is a distributed hierarchical naming system that translates human-readable domain names (e.g., example.com) into IP addresses (e.g., 93.184.216.34) that computers use for communication. Often called the phonebook of the internet, it handles the very first step of almost every network interaction.

1.1 Why DNS Matters

  • Nearly every network operation -- HTTP requests, API calls, email delivery -- begins with a DNS lookup.
  • A DNS outage can cascade into a full service outage.
  • In microservice architectures, DNS is central to service discovery.

2. DNS Resolution Process

When a client enters a domain name, the following steps occur to obtain the corresponding IP address.

2.1 Overall Flow

1. ClientCheck local DNS cache (including /etc/hosts)
2. Local cache miss → Query Recursive Resolver (ISP or configured DNS server)
3. Recursive ResolverQuery Root Name Server (.)
4. Root NSResponds with TLD Name Server (.com, .net, etc.)
5. TLD NSResponds with Authoritative Name Server
6. Authoritative NSReturns the final IP address
7. Recursive ResolverCaches result and responds to client

2.2 Recursive vs Iterative Queries

# Trace recursive resolution with dig +trace
$ dig +trace example.com

; <<>> DiG 9.18.18 <<>> +trace example.com
;; global options: +cmd
.                       518400  IN      NS      a.root-servers.net.
.                       518400  IN      NS      b.root-servers.net.
;; Received 239 bytes from 127.0.0.53#53(127.0.0.53) in 1 ms

com.                    172800  IN      NS      a.gtld-servers.net.
com.                    172800  IN      NS      b.gtld-servers.net.
;; Received 1170 bytes from 198.41.0.4#53(a.root-servers.net) in 23 ms

example.com.            172800  IN      NS      a.iana-servers.net.
example.com.            172800  IN      NS      b.iana-servers.net.
;; Received 356 bytes from 192.5.6.30#53(a.gtld-servers.net) in 15 ms

example.com.            86400   IN      A       93.184.216.34
;; Received 56 bytes from 199.43.135.53#53(a.iana-servers.net) in 78 ms

2.3 DNS Record Types

RecordDescriptionExample
AIPv4 address mappingexample.com → 93.184.216.34
AAAAIPv6 address mappingexample.com → 2606:2800:220:1:...
CNAMECanonical name (alias)www.example.com → example.com
MXMail exchange serverexample.com → mail.example.com
NSName server delegationexample.com → ns1.example.com
TXTText record (SPF, DKIM, etc.)v=spf1 include:...
SRVService locator_http._tcp.example.com
PTRReverse lookup (IP→domain)34.216.184.93 → example.com
SOAStart of AuthorityZone management metadata

3. Common DNS Issues

3.1 NXDOMAIN (Non-Existent Domain)

This response is returned when the queried domain does not exist.

$ dig nonexistent.example.com

;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 12345
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

;; QUESTION SECTION:
;nonexistent.example.com.    IN      A

;; AUTHORITY SECTION:
example.com.            900     IN      SOA     ns1.example.com. admin.example.com. 2024010101 3600 900 604800 86400

Root Cause Analysis:

  • Typo in the domain name
  • DNS record has not been created yet
  • Domain registration has expired
  • DNS propagation has not completed

3.2 DNS Timeout

$ dig @10.0.0.1 example.com +timeout=5

; <<>> DiG 9.18.18 <<>> @10.0.0.1 example.com +timeout=5
; (1 server found)
;; global options: +cmd
;; connection timed out; no servers could be reached

Root Cause Analysis:

  • DNS server is down or unreachable
  • Firewall blocking UDP/TCP port 53
  • Network connectivity issues
  • DNS server overloaded

3.3 Stale or Wrong Records

# Unexpected IP address returned
$ dig api.myservice.com +short
192.168.1.100    # Expected: 10.0.1.50

# Compare results from multiple DNS servers
$ dig @8.8.8.8 api.myservice.com +short
10.0.1.50
$ dig @1.1.1.1 api.myservice.com +short
10.0.1.50
$ dig @192.168.1.1 api.myservice.com +short
192.168.1.100    # Local DNS cache returning stale value

4. DNS Debugging Tools

4.1 dig (Domain Information Groper)

The most powerful and widely used DNS debugging tool.

# Basic lookup
$ dig example.com

# Query specific record types
$ dig example.com MX
$ dig example.com AAAA
$ dig example.com TXT

# Short output
$ dig example.com +short
93.184.216.34

# Query a specific DNS server
$ dig @8.8.8.8 example.com

# Reverse lookup
$ dig -x 93.184.216.34

# Query all record types
$ dig example.com ANY

# Check response time
$ dig example.com | grep "Query time"
;; Query time: 23 msec

# Use TCP instead of UDP
$ dig +tcp example.com

# DNSSEC validation
$ dig +dnssec example.com

4.2 nslookup

Supports both interactive and non-interactive modes.

# Basic lookup
$ nslookup example.com
Server:         127.0.0.53
Address:        127.0.0.53#53

Non-authoritative answer:
Name:   example.com
Address: 93.184.216.34

# Use a specific DNS server
$ nslookup example.com 8.8.8.8

# Query specific record type
$ nslookup -type=MX example.com

# Interactive mode
$ nslookup
> set type=NS
> example.com
Server:         127.0.0.53
Address:        127.0.0.53#53

Non-authoritative answer:
example.com     nameserver = a.iana-servers.net.
example.com     nameserver = b.iana-servers.net.
> exit

4.3 host

A lightweight tool providing concise output.

# Basic lookup
$ host example.com
example.com has address 93.184.216.34
example.com has IPv6 address 2606:2800:220:1:248:1893:25c8:1946
example.com mail is handled by 0 .

# Reverse lookup
$ host 93.184.216.34
34.216.184.93.in-addr.arpa domain name pointer example.com.

# Query specific record type
$ host -t NS example.com
example.com name server a.iana-servers.net.
example.com name server b.iana-servers.net.

# Verbose output
$ host -v example.com

4.4 drill

A tool with enhanced DNSSEC support (from the ldns package).

# Basic lookup
$ drill example.com
;; ->>HEADER<<- opcode: QUERY, rcode: NOERROR, id: 54321
;; QUESTION SECTION:
;; example.com.   IN      A

;; ANSWER SECTION:
example.com.      86400   IN      A       93.184.216.34

# DNSSEC trace
$ drill -DT example.com

# Query a specific server
$ drill @8.8.8.8 example.com

5. DNS Caching Issues

5.1 TTL (Time To Live)

# Check TTL value
$ dig example.com

;; ANSWER SECTION:
example.com.            86400   IN      A       93.184.216.34
#                       ^^^^^ TTL: 86400 seconds = 24 hours

A long TTL means DNS changes take longer to propagate. Before a DNS migration, lower the TTL in advance.

# TTL strategy example
# 1. 24 hours before migration: Lower TTL to 300 seconds (5 minutes)
# 2. Execute migration: Change IP address
# 3. After confirming propagation: Restore TTL to original value

5.2 Negative Caching

NXDOMAIN responses are also cached. The MINIMUM field in the SOA record determines the negative cache TTL.

$ dig example.com SOA

;; ANSWER SECTION:
example.com.    86400   IN      SOA     ns1.example.com. admin.example.com. (
                                        2024010101 ; Serial
                                        3600       ; Refresh
                                        900        ; Retry
                                        604800     ; Expire
                                        86400 )    ; Minimum TTL (negative cache TTL)

5.3 Managing Local DNS Cache

# Linux: Check systemd-resolved cache statistics
$ resolvectl statistics
Current Cache Size: 152
Cache Hits: 1234
Cache Misses: 567

# Linux: Flush systemd-resolved cache
$ sudo resolvectl flush-caches

# macOS: Flush DNS cache
$ sudo dscacheutil -flushcache && sudo killall -HUP mDNSResponder

# Windows: Flush DNS cache
> ipconfig /flushdns

6. resolv.conf and nsswitch.conf Configuration

6.1 /etc/resolv.conf

$ cat /etc/resolv.conf
# DNS server configuration (maximum 3)
nameserver 8.8.8.8
nameserver 8.8.4.4
nameserver 1.1.1.1

# Default search domains
search mycompany.com prod.mycompany.com

# Options
options timeout:2        # Timeout in 2 seconds
options attempts:3       # Retry 3 times
options ndots:5          # Threshold for FQDN determination (details below)
options rotate           # Round-robin DNS servers
options edns0            # Enable EDNS0

Key Configuration Options:

  • nameserver: DNS servers to use (tried in order, maximum 3)
  • search: Domains automatically appended to short hostnames
  • domain: Similar to search but specifies only a single domain
  • options ndots:n: Names with fewer than n dots try search domains first

6.2 /etc/nsswitch.conf

Controls the order of name resolution sources.

$ grep hosts /etc/nsswitch.conf
hosts:          files dns myhostname

# files = Check /etc/hosts first
# dns = Query DNS servers
# myhostname = Resolve local hostname (systemd)
# /etc/hosts file example
$ cat /etc/hosts
127.0.0.1       localhost
127.0.1.1       myserver
10.0.1.50       api.internal.mycompany.com api-internal
192.168.1.100   db-master.mycompany.com

6.3 Checking systemd-resolved

# View current DNS configuration
$ resolvectl status
Global
       Protocols: -LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
resolv.conf mode: stub

Link 2 (eth0)
    Current Scopes: DNS
         Protocols: +DefaultRoute +LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
Current DNS Server: 8.8.8.8
       DNS Servers: 8.8.8.8 8.8.4.4

# Test domain resolution
$ resolvectl query example.com
example.com: 93.184.216.34           -- link: eth0
             2606:2800:220:1:248:1893:25c8:1946 -- link: eth0

7. CoreDNS Troubleshooting in Kubernetes

7.1 CoreDNS Architecture

Within a Kubernetes cluster, CoreDNS handles service discovery. All DNS queries from Pods are processed through CoreDNS.

# Check CoreDNS Pod status
$ kubectl get pods -n kube-system -l k8s-app=kube-dns
NAME                       READY   STATUS    RESTARTS   AGE
coredns-5d78c9869d-abc12   1/1     Running   0          7d
coredns-5d78c9869d-def34   1/1     Running   0          7d

# Check CoreDNS service
$ kubectl get svc -n kube-system kube-dns
NAME       TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)                  AGE
kube-dns   ClusterIP   10.96.0.10   <none>        53/UDP,53/TCP,9153/TCP   30d

7.2 Inspecting the CoreDNS Corefile

$ kubectl get configmap coredns -n kube-system -o yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: coredns
  namespace: kube-system
data:
  Corefile: |
    .:53 {
        errors
        health {
           lameduck 5s
        }
        ready
        kubernetes cluster.local in-addr.arpa ip6.arpa {
           pods insecure
           fallthrough in-addr.arpa ip6.arpa
           ttl 30
        }
        prometheus :9153
        forward . /etc/resolv.conf {
           max_concurrent 1000
        }
        cache 30
        loop
        reload
        loadbalance
    }

7.3 Checking CoreDNS Logs

# View CoreDNS logs
$ kubectl logs -n kube-system -l k8s-app=kube-dns --tail=50
[INFO] 10.244.0.15:45678 - 12345 "A IN my-service.default.svc.cluster.local. udp 54 false 512" NOERROR qr,aa,rd 106 0.000234s
[INFO] 10.244.0.15:45679 - 12346 "A IN external-api.com. udp 34 false 512" NOERROR qr,rd,ra 62 0.023456s

# Enable log plugin (add log to Corefile)
# .:53 {
#     log
#     errors
#     ...
# }

7.4 Debugging DNS from Inside a Pod

# Create a debug Pod
$ kubectl run dns-debug --image=nicolaka/netshoot --rm -it --restart=Never -- bash

# Test DNS resolution inside the Pod
bash-5.1# nslookup kubernetes.default.svc.cluster.local
Server:         10.96.0.10
Address:        10.96.0.10#53

Name:   kubernetes.default.svc.cluster.local
Address: 10.96.0.1

# Detailed check with dig
bash-5.1# dig kubernetes.default.svc.cluster.local

;; ANSWER SECTION:
kubernetes.default.svc.cluster.local. 30 IN A   10.96.0.1

;; Query time: 1 msec
;; SERVER: 10.96.0.10#53(10.96.0.10)

# Test external domain resolution
bash-5.1# dig example.com +short
93.184.216.34

# Check the Pod's DNS configuration
bash-5.1# cat /etc/resolv.conf
nameserver 10.96.0.10
search default.svc.cluster.local svc.cluster.local cluster.local
options ndots:5

8. ndots Configuration and Search Domains

8.1 How ndots Works

The ndots option specifies that if the query name contains fewer dots (.) than this value, the system will try appending search domains first.

# Kubernetes default: ndots:5
# search default.svc.cluster.local svc.cluster.local cluster.local

# Querying "api.example.com" (2 dots < ndots 5)
# Actual query order:
1. api.example.com.default.svc.cluster.local    → NXDOMAIN
2. api.example.com.svc.cluster.local            → NXDOMAIN
3. api.example.com.cluster.local                → NXDOMAIN
4. api.example.com.                             → Success!

This means external domain lookups generate 3 unnecessary extra DNS queries.

8.2 Optimizing ndots

# Adjust ndots via dnsConfig in Pod spec
apiVersion: v1
kind: Pod
metadata:
  name: optimized-pod
spec:
  containers:
    - name: app
      image: myapp:latest
  dnsConfig:
    options:
      - name: ndots
        value: '2'
# Use FQDN to avoid unnecessary queries (add trailing dot)
# Inefficient:
$ dig api.example.com   # Multiple queries due to ndots

# Efficient:
$ dig api.example.com.  # Direct query as FQDN (trailing dot)

8.3 dnsPolicy Options

# ClusterFirst (default): Use CoreDNS first
apiVersion: v1
kind: Pod
spec:
  dnsPolicy: ClusterFirst

# Default: Use the node's DNS configuration as-is
spec:
  dnsPolicy: Default

# None: Configure DNS manually via dnsConfig
spec:
  dnsPolicy: None
  dnsConfig:
    nameservers:
    - 8.8.8.8
    - 1.1.1.1
    searches:
    - my-namespace.svc.cluster.local
    - svc.cluster.local
    options:
    - name: ndots
      value: "2"

9. Practical Debugging Scenarios

9.1 Scenario 1: Inter-Service Communication Failure

# Symptom: Pod A cannot connect to Pod B's service
$ kubectl exec pod-a -- curl http://my-service:8080
curl: (6) Could not resolve host: my-service

# Step 1: Check the Pod's DNS configuration
$ kubectl exec pod-a -- cat /etc/resolv.conf
nameserver 10.96.0.10
search default.svc.cluster.local svc.cluster.local cluster.local
options ndots:5

# Step 2: Query CoreDNS directly
$ kubectl exec pod-a -- dig @10.96.0.10 my-service.default.svc.cluster.local
;; status: NXDOMAIN

# Step 3: Verify the service exists
$ kubectl get svc my-service -n default
Error from server (NotFound): services "my-service" not found

# Step 4: Check the correct namespace
$ kubectl get svc --all-namespaces | grep my-service
production   my-service   ClusterIP   10.96.45.123   <none>   8080/TCP   5d

# Solution: Use FQDN with the correct namespace
$ kubectl exec pod-a -- curl http://my-service.production.svc.cluster.local:8080

9.2 Scenario 2: External Domain Resolution Failure

# Symptom: External API calls from Pod are failing
$ kubectl exec my-pod -- curl https://api.external.com
curl: (6) Could not resolve host: api.external.com

# Step 1: Verify CoreDNS is working for internal queries
$ kubectl exec my-pod -- dig @10.96.0.10 kubernetes.default.svc.cluster.local +short
10.96.0.1    # Internal DNS is working fine

# Step 2: Check CoreDNS upstream forwarding
$ kubectl exec my-pod -- dig @10.96.0.10 api.external.com
;; status: SERVFAIL

# Step 3: Check CoreDNS logs
$ kubectl logs -n kube-system -l k8s-app=kube-dns | grep "api.external.com"
[ERROR] plugin/forward: no nameservers found

# Step 4: Inspect the CoreDNS forward configuration
$ kubectl get configmap coredns -n kube-system -o jsonpath='{.data.Corefile}'
# Look for: forward . /etc/resolv.conf

# Step 5: Check CoreDNS Pod's resolv.conf
$ kubectl exec -n kube-system coredns-5d78c9869d-abc12 -- cat /etc/resolv.conf
nameserver 169.254.169.253   # Cloud DNS may be unreachable

# Solution: Explicitly specify forward targets in the Corefile
# forward . 8.8.8.8 8.8.4.4

9.3 Scenario 3: Intermittent DNS Timeouts

# Symptom: DNS lookups intermittently take more than 5 seconds
$ time dig @10.96.0.10 example.com
;; Query time: 5003 msec    # 5-second timeout before retry

# Cause: Linux conntrack race condition (DNAT + UDP)
# UDP DNS packets can collide in the conntrack table

# Verify: Check conntrack table statistics
$ sudo conntrack -S
cpu=0           found=0 invalid=1523 insert=0 insert_failed=156 drop=156
#                                                ^^^^^^^^^^^^^ insert failures indicate the issue

# Solution 1: Use TCP for DNS
# Add force_tcp option to CoreDNS Corefile

# Solution 2: Deploy NodeLocal DNSCache
$ kubectl apply -f https://raw.githubusercontent.com/kubernetes/kubernetes/master/cluster/addons/dns/nodelocaldns/nodelocaldns.yaml

# Solution 3: Use single-request-reopen option in Pod
apiVersion: v1
kind: Pod
spec:
  dnsConfig:
    options:
    - name: single-request-reopen
      value: ""

9.4 Scenario 4: Verifying DNS Propagation

# Check propagation across multiple DNS servers after a change
$ for dns in 8.8.8.8 1.1.1.1 9.9.9.9 208.67.222.222; do
    echo "=== $dns ==="
    dig @$dns api.myservice.com +short +timeout=3
done

=== 8.8.8.8 ===
10.0.1.50
=== 1.1.1.1 ===
10.0.1.50
=== 9.9.9.9 ===
192.168.1.100    # Still returning old record
=== 208.67.222.222 ===
192.168.1.100    # Still returning old record

# Check TTL to estimate when the cache will expire
$ dig @9.9.9.9 api.myservice.com | grep -A1 "ANSWER SECTION"
;; ANSWER SECTION:
api.myservice.com.    1423    IN      A       192.168.1.100
#                     ^^^^ Remaining TTL: cache expires in approximately 24 minutes

10. Useful DNS Debugging One-Liners

# 1. Benchmark DNS response times
$ for i in $(seq 1 10); do dig example.com | grep "Query time"; done

# 2. Bulk lookup for multiple domains
$ for domain in api.example.com web.example.com db.example.com; do
    echo "$domain: $(dig +short $domain)"
done

# 3. Monitor DNS record changes
$ watch -n 5 "dig +short api.myservice.com @8.8.8.8"

# 4. Bulk reverse DNS lookups
$ for ip in 10.0.1.{1..10}; do
    result=$(dig +short -x $ip)
    echo "$ip -> ${result:-NO PTR}"
done

# 5. Check DNSSEC validation status
$ dig +dnssec +short example.com
93.184.216.34
A 13 2 86400 20240315000000 20240301000000 12345 example.com. <base64_signature>

# 6. Verify DNS resolution for all Kubernetes services
$ kubectl get svc --all-namespaces -o jsonpath='{range .items[*]}{.metadata.name}.{.metadata.namespace}.svc.cluster.local{"\n"}{end}' | \
  while read fqdn; do
    result=$(kubectl exec dns-debug -- dig +short $fqdn 2>/dev/null)
    echo "$fqdn -> ${result:-FAILED}"
  done

11. Summary and Checklist

When a DNS issue occurs, diagnose in the following order:

  1. Check /etc/resolv.conf settings (nameserver, search, ndots)
  2. Test basic DNS queries with dig or nslookup
  3. Query a specific DNS server (dig @8.8.8.8)
  4. Trace the full resolution path with +trace
  5. Check TTL to determine if caching is the issue
  6. In Kubernetes environments, check CoreDNS Pod status and logs
  7. Review the performance impact of ndots and search domain settings
  8. For intermittent conntrack-related issues, consider deploying NodeLocal DNSCache

DNS is very often the root cause of network problems. Building systematic debugging habits can significantly reduce incident response times.