Skip to content

Split View: DNS 트러블슈팅 완전 가이드 - 원리부터 쿠버네티스까지

✨ Learn with Quiz
|

DNS 트러블슈팅 완전 가이드 - 원리부터 쿠버네티스까지

1. DNS란 무엇인가

DNS(Domain Name System)는 사람이 읽을 수 있는 도메인 이름(예: example.com)을 컴퓨터가 통신에 사용하는 IP 주소(예: 93.184.216.34)로 변환하는 분산 계층형 네이밍 시스템입니다. 인터넷의 전화번호부라고 불리며, 거의 모든 네트워크 통신의 첫 번째 단계를 담당합니다.

1.1 DNS가 중요한 이유

  • 웹 브라우저의 HTTP 요청, API 호출, 이메일 전송 등 거의 모든 네트워크 작업이 DNS 조회로 시작됩니다.
  • DNS 장애는 서비스 전체 장애로 이어질 수 있습니다.
  • 마이크로서비스 환경에서 서비스 디스커버리의 핵심 역할을 수행합니다.

2. DNS Resolution 과정

클라이언트가 도메인 이름을 입력하면 다음과 같은 단계로 IP 주소를 얻습니다.

2.1 전체 흐름

1. 클라이언트 → 로컬 DNS 캐시 확인 (/etc/hosts 포함)
2. 로컬 캐시 miss → Recursive Resolver(ISP 또는 설정된 DNS 서버)에 질의
3. Recursive ResolverRoot Name Server (.) 질의
4. Root NSTLD Name Server (.com, .net) 응답
5. TLD NSAuthoritative Name Server 응답
6. Authoritative NS → 최종 IP 주소 응답
7. Recursive Resolver → 결과 캐시 후 클라이언트에 응답

2.2 재귀적(Recursive) vs 반복적(Iterative) 질의

# 재귀적 질의 추적 (dig +trace)
$ dig +trace example.com

; <<>> DiG 9.18.18 <<>> +trace example.com
;; global options: +cmd
.                       518400  IN      NS      a.root-servers.net.
.                       518400  IN      NS      b.root-servers.net.
;; Received 239 bytes from 127.0.0.53#53(127.0.0.53) in 1 ms

com.                    172800  IN      NS      a.gtld-servers.net.
com.                    172800  IN      NS      b.gtld-servers.net.
;; Received 1170 bytes from 198.41.0.4#53(a.root-servers.net) in 23 ms

example.com.            172800  IN      NS      a.iana-servers.net.
example.com.            172800  IN      NS      b.iana-servers.net.
;; Received 356 bytes from 192.5.6.30#53(a.gtld-servers.net) in 15 ms

example.com.            86400   IN      A       93.184.216.34
;; Received 56 bytes from 199.43.135.53#53(a.iana-servers.net) in 78 ms

2.3 DNS 레코드 타입

레코드설명예시
AIPv4 주소 매핑example.com → 93.184.216.34
AAAAIPv6 주소 매핑example.com → 2606:2800:220:1:...
CNAME별칭(Canonical Name)www.example.com → example.com
MX메일 서버example.com → mail.example.com
NS네임서버 지정example.com → ns1.example.com
TXT텍스트 레코드 (SPF, DKIM 등)v=spf1 include:...
SRV서비스 로케이터_http._tcp.example.com
PTR역방향 조회 (IP→도메인)34.216.184.93 → example.com
SOA존 권한 시작존 관리 메타데이터

3. 자주 발생하는 DNS 문제

3.1 NXDOMAIN (Non-Existent Domain)

도메인이 존재하지 않을 때 반환되는 응답입니다.

$ dig nonexistent.example.com

;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 12345
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

;; QUESTION SECTION:
;nonexistent.example.com.    IN      A

;; AUTHORITY SECTION:
example.com.            900     IN      SOA     ns1.example.com. admin.example.com. 2024010101 3600 900 604800 86400

원인 분석:

  • 도메인 이름 오타
  • DNS 레코드가 아직 생성되지 않음
  • 도메인 등록 만료
  • DNS 전파(propagation)가 완료되지 않음

3.2 DNS Timeout

$ dig @10.0.0.1 example.com +timeout=5

; <<>> DiG 9.18.18 <<>> @10.0.0.1 example.com +timeout=5
; (1 server found)
;; global options: +cmd
;; connection timed out; no servers could be reached

원인 분석:

  • DNS 서버가 다운되었거나 접근 불가
  • 방화벽이 UDP/TCP 53 포트를 차단
  • 네트워크 연결 문제
  • DNS 서버 과부하

3.3 잘못된 레코드 (Stale/Wrong Records)

# 예상과 다른 IP가 반환되는 경우
$ dig api.myservice.com +short
192.168.1.100    # 예상: 10.0.1.50

# 여러 DNS 서버에서 비교 확인
$ dig @8.8.8.8 api.myservice.com +short
10.0.1.50
$ dig @1.1.1.1 api.myservice.com +short
10.0.1.50
$ dig @192.168.1.1 api.myservice.com +short
192.168.1.100    # 로컬 DNS 캐시가 오래된 값을 반환

4. DNS 디버깅 도구

4.1 dig (Domain Information Groper)

가장 강력하고 널리 사용되는 DNS 디버깅 도구입니다.

# 기본 조회
$ dig example.com

# 특정 레코드 타입 조회
$ dig example.com MX
$ dig example.com AAAA
$ dig example.com TXT

# 간략한 출력
$ dig example.com +short
93.184.216.34

# 특정 DNS 서버 지정
$ dig @8.8.8.8 example.com

# 역방향 조회
$ dig -x 93.184.216.34

# 모든 레코드 조회
$ dig example.com ANY

# 응답 시간 확인 (Query time)
$ dig example.com | grep "Query time"
;; Query time: 23 msec

# TCP 사용 (UDP 대신)
$ dig +tcp example.com

# DNSSEC 검증
$ dig +dnssec example.com

4.2 nslookup

대화형 및 비대화형 모드를 지원합니다.

# 기본 조회
$ nslookup example.com
Server:         127.0.0.53
Address:        127.0.0.53#53

Non-authoritative answer:
Name:   example.com
Address: 93.184.216.34

# 특정 DNS 서버 사용
$ nslookup example.com 8.8.8.8

# 특정 레코드 타입
$ nslookup -type=MX example.com

# 대화형 모드
$ nslookup
> set type=NS
> example.com
Server:         127.0.0.53
Address:        127.0.0.53#53

Non-authoritative answer:
example.com     nameserver = a.iana-servers.net.
example.com     nameserver = b.iana-servers.net.
> exit

4.3 host

간결한 출력을 제공하는 경량 도구입니다.

# 기본 조회
$ host example.com
example.com has address 93.184.216.34
example.com has IPv6 address 2606:2800:220:1:248:1893:25c8:1946
example.com mail is handled by 0 .

# 역방향 조회
$ host 93.184.216.34
34.216.184.93.in-addr.arpa domain name pointer example.com.

# 특정 레코드 타입
$ host -t NS example.com
example.com name server a.iana-servers.net.
example.com name server b.iana-servers.net.

# 상세 출력
$ host -v example.com

4.4 drill

DNSSEC 지원이 강화된 도구입니다 (ldns 패키지).

# 기본 조회
$ drill example.com
;; ->>HEADER<<- opcode: QUERY, rcode: NOERROR, id: 54321
;; QUESTION SECTION:
;; example.com.   IN      A

;; ANSWER SECTION:
example.com.      86400   IN      A       93.184.216.34

# DNSSEC 추적
$ drill -DT example.com

# 특정 서버로 조회
$ drill @8.8.8.8 example.com

5. DNS 캐싱 이슈

5.1 TTL (Time To Live)

# TTL 값 확인
$ dig example.com

;; ANSWER SECTION:
example.com.            86400   IN      A       93.184.216.34
#                       ^^^^^ TTL: 86400초 = 24시간

TTL이 길면 DNS 변경이 전파되는 데 시간이 오래 걸립니다. DNS 마이그레이션 전에는 TTL을 미리 낮춰두는 것이 좋습니다.

# TTL 전략 예시
# 1. 마이그레이션 24시간 전: TTL을 300초(5분)으로 낮춤
# 2. 마이그레이션 실행: IP 주소 변경
# 3. 전파 완료 확인 후: TTL을 원래 값으로 복원

5.2 네거티브 캐싱 (Negative Caching)

NXDOMAIN 응답도 캐시됩니다. SOA 레코드의 MINIMUM 필드가 네거티브 캐시 TTL을 결정합니다.

$ dig example.com SOA

;; ANSWER SECTION:
example.com.    86400   IN      SOA     ns1.example.com. admin.example.com. (
                                        2024010101 ; Serial
                                        3600       ; Refresh
                                        900        ; Retry
                                        604800     ; Expire
                                        86400 )    ; Minimum TTL (네거티브 캐시 TTL)

5.3 로컬 DNS 캐시 관리

# Linux: systemd-resolved 캐시 확인
$ resolvectl statistics
Current Cache Size: 152
Cache Hits: 1234
Cache Misses: 567

# Linux: systemd-resolved 캐시 초기화
$ sudo resolvectl flush-caches

# macOS: DNS 캐시 초기화
$ sudo dscacheutil -flushcache && sudo killall -HUP mDNSResponder

# Windows: DNS 캐시 초기화
> ipconfig /flushdns

6. resolv.conf와 nsswitch.conf 설정

6.1 /etc/resolv.conf

$ cat /etc/resolv.conf
# DNS 서버 설정 (최대 3개)
nameserver 8.8.8.8
nameserver 8.8.4.4
nameserver 1.1.1.1

# 기본 검색 도메인
search mycompany.com prod.mycompany.com

# 옵션
options timeout:2        # 타임아웃 2초
options attempts:3       # 재시도 3회
options ndots:5          # FQDN 판단 기준 (아래에서 상세 설명)
options rotate           # DNS 서버 라운드로빈
options edns0            # EDNS0 활성화

주요 설정 설명:

  • nameserver: 사용할 DNS 서버 (순서대로 시도, 최대 3개)
  • search: 짧은 호스트명에 자동으로 붙일 도메인 목록
  • domain: search와 유사하지만 하나의 도메인만 지정
  • options ndots:n: 점(.)이 n개 미만인 이름은 search 도메인을 먼저 시도

6.2 /etc/nsswitch.conf

이름 해석 순서를 제어합니다.

$ grep hosts /etc/nsswitch.conf
hosts:          files dns myhostname

# files = /etc/hosts 파일 먼저 확인
# dns = DNS 서버에 질의
# myhostname = 로컬 호스트명 확인 (systemd)
# /etc/hosts 파일 예시
$ cat /etc/hosts
127.0.0.1       localhost
127.0.1.1       myserver
10.0.1.50       api.internal.mycompany.com api-internal
192.168.1.100   db-master.mycompany.com

6.3 systemd-resolved 확인

# 현재 DNS 설정 확인
$ resolvectl status
Global
       Protocols: -LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
resolv.conf mode: stub

Link 2 (eth0)
    Current Scopes: DNS
         Protocols: +DefaultRoute +LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
Current DNS Server: 8.8.8.8
       DNS Servers: 8.8.8.8 8.8.4.4

# 특정 도메인 해석 테스트
$ resolvectl query example.com
example.com: 93.184.216.34           -- link: eth0
             2606:2800:220:1:248:1893:25c8:1946 -- link: eth0

7. Kubernetes에서의 CoreDNS 트러블슈팅

7.1 CoreDNS 아키텍처

Kubernetes 클러스터 내에서 CoreDNS는 서비스 디스커버리를 담당합니다. 모든 Pod의 DNS 질의는 CoreDNS를 통해 처리됩니다.

# CoreDNS Pod 상태 확인
$ kubectl get pods -n kube-system -l k8s-app=kube-dns
NAME                       READY   STATUS    RESTARTS   AGE
coredns-5d78c9869d-abc12   1/1     Running   0          7d
coredns-5d78c9869d-def34   1/1     Running   0          7d

# CoreDNS 서비스 확인
$ kubectl get svc -n kube-system kube-dns
NAME       TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)                  AGE
kube-dns   ClusterIP   10.96.0.10   <none>        53/UDP,53/TCP,9153/TCP   30d

7.2 CoreDNS Corefile 확인

$ kubectl get configmap coredns -n kube-system -o yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: coredns
  namespace: kube-system
data:
  Corefile: |
    .:53 {
        errors
        health {
           lameduck 5s
        }
        ready
        kubernetes cluster.local in-addr.arpa ip6.arpa {
           pods insecure
           fallthrough in-addr.arpa ip6.arpa
           ttl 30
        }
        prometheus :9153
        forward . /etc/resolv.conf {
           max_concurrent 1000
        }
        cache 30
        loop
        reload
        loadbalance
    }

7.3 CoreDNS 로그 확인

# CoreDNS 로그 확인
$ kubectl logs -n kube-system -l k8s-app=kube-dns --tail=50
[INFO] 10.244.0.15:45678 - 12345 "A IN my-service.default.svc.cluster.local. udp 54 false 512" NOERROR qr,aa,rd 106 0.000234s
[INFO] 10.244.0.15:45679 - 12346 "A IN external-api.com. udp 34 false 512" NOERROR qr,rd,ra 62 0.023456s

# 로그 플러그인 활성화 (Corefile에 log 추가)
# .:53 {
#     log
#     errors
#     ...
# }

7.4 Pod에서 DNS 디버깅

# DNS 디버깅용 Pod 생성
$ kubectl run dns-debug --image=nicolaka/netshoot --rm -it --restart=Never -- bash

# Pod 내에서 DNS 테스트
bash-5.1# nslookup kubernetes.default.svc.cluster.local
Server:         10.96.0.10
Address:        10.96.0.10#53

Name:   kubernetes.default.svc.cluster.local
Address: 10.96.0.1

# dig로 상세 확인
bash-5.1# dig kubernetes.default.svc.cluster.local

;; ANSWER SECTION:
kubernetes.default.svc.cluster.local. 30 IN A   10.96.0.1

;; Query time: 1 msec
;; SERVER: 10.96.0.10#53(10.96.0.10)

# 외부 도메인 해석 확인
bash-5.1# dig example.com +short
93.184.216.34

# Pod의 DNS 설정 확인
bash-5.1# cat /etc/resolv.conf
nameserver 10.96.0.10
search default.svc.cluster.local svc.cluster.local cluster.local
options ndots:5

8. ndots 설정과 Search Domain

8.1 ndots의 동작 원리

ndots 옵션은 쿼리 이름에 포함된 점(.)의 개수가 이 값 미만이면, search 도메인을 먼저 붙여서 질의합니다.

# Kubernetes 기본 설정: ndots:5
# search default.svc.cluster.local svc.cluster.local cluster.local

# "api.example.com" 조회 시 (점 2개 < ndots 5)
# 실제 질의 순서:
1. api.example.com.default.svc.cluster.local    → NXDOMAIN
2. api.example.com.svc.cluster.local            → NXDOMAIN
3. api.example.com.cluster.local                → NXDOMAIN
4. api.example.com.                             → 성공!

이로 인해 외부 도메인 조회 시 불필요한 DNS 질의가 3번 추가로 발생합니다.

8.2 ndots 최적화

# Pod spec에서 dnsConfig로 ndots 조정
apiVersion: v1
kind: Pod
metadata:
  name: optimized-pod
spec:
  containers:
    - name: app
      image: myapp:latest
  dnsConfig:
    options:
      - name: ndots
        value: '2'
# FQDN 사용으로 불필요한 질의 방지 (끝에 점 추가)
# 비효율적:
$ dig api.example.com   # ndots로 인해 여러 번 질의

# 효율적:
$ dig api.example.com.  # FQDN으로 바로 질의 (trailing dot)

8.3 dnsPolicy 옵션

# ClusterFirst (기본값): CoreDNS를 먼저 사용
apiVersion: v1
kind: Pod
spec:
  dnsPolicy: ClusterFirst

# Default: 노드의 DNS 설정을 그대로 사용
spec:
  dnsPolicy: Default

# None: dnsConfig에서 직접 설정
spec:
  dnsPolicy: None
  dnsConfig:
    nameservers:
    - 8.8.8.8
    - 1.1.1.1
    searches:
    - my-namespace.svc.cluster.local
    - svc.cluster.local
    options:
    - name: ndots
      value: "2"

9. 실전 디버깅 시나리오

9.1 시나리오 1: 서비스 간 통신 실패

# 증상: Pod A에서 Pod B의 서비스에 연결 실패
$ kubectl exec pod-a -- curl http://my-service:8080
curl: (6) Could not resolve host: my-service

# Step 1: Pod의 DNS 설정 확인
$ kubectl exec pod-a -- cat /etc/resolv.conf
nameserver 10.96.0.10
search default.svc.cluster.local svc.cluster.local cluster.local
options ndots:5

# Step 2: CoreDNS에 직접 질의
$ kubectl exec pod-a -- dig @10.96.0.10 my-service.default.svc.cluster.local
;; status: NXDOMAIN

# Step 3: 서비스 존재 여부 확인
$ kubectl get svc my-service -n default
Error from server (NotFound): services "my-service" not found

# Step 4: 올바른 네임스페이스 확인
$ kubectl get svc --all-namespaces | grep my-service
production   my-service   ClusterIP   10.96.45.123   <none>   8080/TCP   5d

# 해결: 네임스페이스를 포함한 FQDN 사용
$ kubectl exec pod-a -- curl http://my-service.production.svc.cluster.local:8080

9.2 시나리오 2: 외부 도메인 해석 실패

# 증상: Pod에서 외부 API 호출 실패
$ kubectl exec my-pod -- curl https://api.external.com
curl: (6) Could not resolve host: api.external.com

# Step 1: CoreDNS가 정상인지 확인
$ kubectl exec my-pod -- dig @10.96.0.10 kubernetes.default.svc.cluster.local +short
10.96.0.1    # 내부 DNS는 정상

# Step 2: CoreDNS의 upstream 포워딩 확인
$ kubectl exec my-pod -- dig @10.96.0.10 api.external.com
;; status: SERVFAIL

# Step 3: CoreDNS 로그 확인
$ kubectl logs -n kube-system -l k8s-app=kube-dns | grep "api.external.com"
[ERROR] plugin/forward: no nameservers found

# Step 4: CoreDNS의 forward 설정 확인
$ kubectl get configmap coredns -n kube-system -o jsonpath='{.data.Corefile}'
# forward . /etc/resolv.conf 확인

# Step 5: CoreDNS Pod의 resolv.conf 확인
$ kubectl exec -n kube-system coredns-5d78c9869d-abc12 -- cat /etc/resolv.conf
nameserver 169.254.169.253   # 클라우드 DNS가 접근 불가할 수 있음

# 해결: Corefile에서 forward 대상을 명시적으로 지정
# forward . 8.8.8.8 8.8.4.4

9.3 시나리오 3: 간헐적 DNS 타임아웃

# 증상: 간헐적으로 DNS 조회 시간이 5초 이상 걸림
$ time dig @10.96.0.10 example.com
;; Query time: 5003 msec    # 5초 타임아웃 후 재시도

# 원인: Linux conntrack race condition (DNAT + UDP)
# UDP DNS 패킷이 conntrack 테이블에서 충돌할 수 있음

# 확인: conntrack 테이블 상태
$ sudo conntrack -S
cpu=0           found=0 invalid=1523 insert=0 insert_failed=156 drop=156
#                                                ^^^^^^^^^^^^^ insert 실패가 있으면 문제

# 해결 방법 1: TCP DNS 사용
# CoreDNS Corefile에 force_tcp 옵션 추가

# 해결 방법 2: NodeLocal DNSCache 배포
$ kubectl apply -f https://raw.githubusercontent.com/kubernetes/kubernetes/master/cluster/addons/dns/nodelocaldns/nodelocaldns.yaml

# 해결 방법 3: Pod에서 single-request-reopen 옵션 사용
apiVersion: v1
kind: Pod
spec:
  dnsConfig:
    options:
    - name: single-request-reopen
      value: ""

9.4 시나리오 4: DNS 전파 지연 확인

# DNS 변경 후 전파 상태를 여러 서버에서 확인
$ for dns in 8.8.8.8 1.1.1.1 9.9.9.9 208.67.222.222; do
    echo "=== $dns ==="
    dig @$dns api.myservice.com +short +timeout=3
done

=== 8.8.8.8 ===
10.0.1.50
=== 1.1.1.1 ===
10.0.1.50
=== 9.9.9.9 ===
192.168.1.100    # 아직 이전 레코드
=== 208.67.222.222 ===
192.168.1.100    # 아직 이전 레코드

# TTL 확인으로 캐시 만료 시점 예측
$ dig @9.9.9.9 api.myservice.com | grep -A1 "ANSWER SECTION"
;; ANSWER SECTION:
api.myservice.com.    1423    IN      A       192.168.1.100
#                     ^^^^ 남은 TTL: 약 24분 후 캐시 만료

10. 유용한 DNS 디버깅 원라이너 모음

# 1. DNS 응답 시간 벤치마크
$ for i in $(seq 1 10); do dig example.com | grep "Query time"; done

# 2. 여러 도메인 일괄 조회
$ for domain in api.example.com web.example.com db.example.com; do
    echo "$domain: $(dig +short $domain)"
done

# 3. DNS 레코드 변경 모니터링
$ watch -n 5 "dig +short api.myservice.com @8.8.8.8"

# 4. 역방향 DNS 대량 확인
$ for ip in 10.0.1.{1..10}; do
    result=$(dig +short -x $ip)
    echo "$ip -> ${result:-NO PTR}"
done

# 5. DNSSEC 검증 상태 확인
$ dig +dnssec +short example.com
93.184.216.34
A 13 2 86400 20240315000000 20240301000000 12345 example.com. <base64_signature>

# 6. Kubernetes에서 모든 서비스의 DNS 해석 확인
$ kubectl get svc --all-namespaces -o jsonpath='{range .items[*]}{.metadata.name}.{.metadata.namespace}.svc.cluster.local{"\n"}{end}' | \
  while read fqdn; do
    result=$(kubectl exec dns-debug -- dig +short $fqdn 2>/dev/null)
    echo "$fqdn -> ${result:-FAILED}"
  done

11. 정리 및 체크리스트

DNS 문제 발생 시 다음 순서로 진단합니다.

  1. /etc/resolv.conf 설정 확인 (nameserver, search, ndots)
  2. dig 또는 nslookup로 기본 DNS 질의 테스트
  3. 특정 DNS 서버를 지정하여 질의 (dig @8.8.8.8)
  4. +trace 옵션으로 전체 해석 경로 추적
  5. TTL 확인으로 캐시 문제 여부 판단
  6. Kubernetes 환경이면 CoreDNS Pod 상태와 로그 확인
  7. ndots와 search domain 설정이 성능에 미치는 영향 검토
  8. conntrack 관련 간헐적 문제는 NodeLocal DNSCache 도입 고려

DNS는 네트워크 문제의 근본 원인인 경우가 매우 많습니다. 체계적인 디버깅 습관을 들이면 장애 대응 시간을 크게 줄일 수 있습니다.

DNS Troubleshooting Complete Guide - From Fundamentals to Kubernetes

1. What is DNS

DNS (Domain Name System) is a distributed hierarchical naming system that translates human-readable domain names (e.g., example.com) into IP addresses (e.g., 93.184.216.34) that computers use for communication. Often called the phonebook of the internet, it handles the very first step of almost every network interaction.

1.1 Why DNS Matters

  • Nearly every network operation -- HTTP requests, API calls, email delivery -- begins with a DNS lookup.
  • A DNS outage can cascade into a full service outage.
  • In microservice architectures, DNS is central to service discovery.

2. DNS Resolution Process

When a client enters a domain name, the following steps occur to obtain the corresponding IP address.

2.1 Overall Flow

1. ClientCheck local DNS cache (including /etc/hosts)
2. Local cache miss → Query Recursive Resolver (ISP or configured DNS server)
3. Recursive ResolverQuery Root Name Server (.)
4. Root NSResponds with TLD Name Server (.com, .net, etc.)
5. TLD NSResponds with Authoritative Name Server
6. Authoritative NSReturns the final IP address
7. Recursive ResolverCaches result and responds to client

2.2 Recursive vs Iterative Queries

# Trace recursive resolution with dig +trace
$ dig +trace example.com

; <<>> DiG 9.18.18 <<>> +trace example.com
;; global options: +cmd
.                       518400  IN      NS      a.root-servers.net.
.                       518400  IN      NS      b.root-servers.net.
;; Received 239 bytes from 127.0.0.53#53(127.0.0.53) in 1 ms

com.                    172800  IN      NS      a.gtld-servers.net.
com.                    172800  IN      NS      b.gtld-servers.net.
;; Received 1170 bytes from 198.41.0.4#53(a.root-servers.net) in 23 ms

example.com.            172800  IN      NS      a.iana-servers.net.
example.com.            172800  IN      NS      b.iana-servers.net.
;; Received 356 bytes from 192.5.6.30#53(a.gtld-servers.net) in 15 ms

example.com.            86400   IN      A       93.184.216.34
;; Received 56 bytes from 199.43.135.53#53(a.iana-servers.net) in 78 ms

2.3 DNS Record Types

RecordDescriptionExample
AIPv4 address mappingexample.com → 93.184.216.34
AAAAIPv6 address mappingexample.com → 2606:2800:220:1:...
CNAMECanonical name (alias)www.example.com → example.com
MXMail exchange serverexample.com → mail.example.com
NSName server delegationexample.com → ns1.example.com
TXTText record (SPF, DKIM, etc.)v=spf1 include:...
SRVService locator_http._tcp.example.com
PTRReverse lookup (IP→domain)34.216.184.93 → example.com
SOAStart of AuthorityZone management metadata

3. Common DNS Issues

3.1 NXDOMAIN (Non-Existent Domain)

This response is returned when the queried domain does not exist.

$ dig nonexistent.example.com

;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 12345
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

;; QUESTION SECTION:
;nonexistent.example.com.    IN      A

;; AUTHORITY SECTION:
example.com.            900     IN      SOA     ns1.example.com. admin.example.com. 2024010101 3600 900 604800 86400

Root Cause Analysis:

  • Typo in the domain name
  • DNS record has not been created yet
  • Domain registration has expired
  • DNS propagation has not completed

3.2 DNS Timeout

$ dig @10.0.0.1 example.com +timeout=5

; <<>> DiG 9.18.18 <<>> @10.0.0.1 example.com +timeout=5
; (1 server found)
;; global options: +cmd
;; connection timed out; no servers could be reached

Root Cause Analysis:

  • DNS server is down or unreachable
  • Firewall blocking UDP/TCP port 53
  • Network connectivity issues
  • DNS server overloaded

3.3 Stale or Wrong Records

# Unexpected IP address returned
$ dig api.myservice.com +short
192.168.1.100    # Expected: 10.0.1.50

# Compare results from multiple DNS servers
$ dig @8.8.8.8 api.myservice.com +short
10.0.1.50
$ dig @1.1.1.1 api.myservice.com +short
10.0.1.50
$ dig @192.168.1.1 api.myservice.com +short
192.168.1.100    # Local DNS cache returning stale value

4. DNS Debugging Tools

4.1 dig (Domain Information Groper)

The most powerful and widely used DNS debugging tool.

# Basic lookup
$ dig example.com

# Query specific record types
$ dig example.com MX
$ dig example.com AAAA
$ dig example.com TXT

# Short output
$ dig example.com +short
93.184.216.34

# Query a specific DNS server
$ dig @8.8.8.8 example.com

# Reverse lookup
$ dig -x 93.184.216.34

# Query all record types
$ dig example.com ANY

# Check response time
$ dig example.com | grep "Query time"
;; Query time: 23 msec

# Use TCP instead of UDP
$ dig +tcp example.com

# DNSSEC validation
$ dig +dnssec example.com

4.2 nslookup

Supports both interactive and non-interactive modes.

# Basic lookup
$ nslookup example.com
Server:         127.0.0.53
Address:        127.0.0.53#53

Non-authoritative answer:
Name:   example.com
Address: 93.184.216.34

# Use a specific DNS server
$ nslookup example.com 8.8.8.8

# Query specific record type
$ nslookup -type=MX example.com

# Interactive mode
$ nslookup
> set type=NS
> example.com
Server:         127.0.0.53
Address:        127.0.0.53#53

Non-authoritative answer:
example.com     nameserver = a.iana-servers.net.
example.com     nameserver = b.iana-servers.net.
> exit

4.3 host

A lightweight tool providing concise output.

# Basic lookup
$ host example.com
example.com has address 93.184.216.34
example.com has IPv6 address 2606:2800:220:1:248:1893:25c8:1946
example.com mail is handled by 0 .

# Reverse lookup
$ host 93.184.216.34
34.216.184.93.in-addr.arpa domain name pointer example.com.

# Query specific record type
$ host -t NS example.com
example.com name server a.iana-servers.net.
example.com name server b.iana-servers.net.

# Verbose output
$ host -v example.com

4.4 drill

A tool with enhanced DNSSEC support (from the ldns package).

# Basic lookup
$ drill example.com
;; ->>HEADER<<- opcode: QUERY, rcode: NOERROR, id: 54321
;; QUESTION SECTION:
;; example.com.   IN      A

;; ANSWER SECTION:
example.com.      86400   IN      A       93.184.216.34

# DNSSEC trace
$ drill -DT example.com

# Query a specific server
$ drill @8.8.8.8 example.com

5. DNS Caching Issues

5.1 TTL (Time To Live)

# Check TTL value
$ dig example.com

;; ANSWER SECTION:
example.com.            86400   IN      A       93.184.216.34
#                       ^^^^^ TTL: 86400 seconds = 24 hours

A long TTL means DNS changes take longer to propagate. Before a DNS migration, lower the TTL in advance.

# TTL strategy example
# 1. 24 hours before migration: Lower TTL to 300 seconds (5 minutes)
# 2. Execute migration: Change IP address
# 3. After confirming propagation: Restore TTL to original value

5.2 Negative Caching

NXDOMAIN responses are also cached. The MINIMUM field in the SOA record determines the negative cache TTL.

$ dig example.com SOA

;; ANSWER SECTION:
example.com.    86400   IN      SOA     ns1.example.com. admin.example.com. (
                                        2024010101 ; Serial
                                        3600       ; Refresh
                                        900        ; Retry
                                        604800     ; Expire
                                        86400 )    ; Minimum TTL (negative cache TTL)

5.3 Managing Local DNS Cache

# Linux: Check systemd-resolved cache statistics
$ resolvectl statistics
Current Cache Size: 152
Cache Hits: 1234
Cache Misses: 567

# Linux: Flush systemd-resolved cache
$ sudo resolvectl flush-caches

# macOS: Flush DNS cache
$ sudo dscacheutil -flushcache && sudo killall -HUP mDNSResponder

# Windows: Flush DNS cache
> ipconfig /flushdns

6. resolv.conf and nsswitch.conf Configuration

6.1 /etc/resolv.conf

$ cat /etc/resolv.conf
# DNS server configuration (maximum 3)
nameserver 8.8.8.8
nameserver 8.8.4.4
nameserver 1.1.1.1

# Default search domains
search mycompany.com prod.mycompany.com

# Options
options timeout:2        # Timeout in 2 seconds
options attempts:3       # Retry 3 times
options ndots:5          # Threshold for FQDN determination (details below)
options rotate           # Round-robin DNS servers
options edns0            # Enable EDNS0

Key Configuration Options:

  • nameserver: DNS servers to use (tried in order, maximum 3)
  • search: Domains automatically appended to short hostnames
  • domain: Similar to search but specifies only a single domain
  • options ndots:n: Names with fewer than n dots try search domains first

6.2 /etc/nsswitch.conf

Controls the order of name resolution sources.

$ grep hosts /etc/nsswitch.conf
hosts:          files dns myhostname

# files = Check /etc/hosts first
# dns = Query DNS servers
# myhostname = Resolve local hostname (systemd)
# /etc/hosts file example
$ cat /etc/hosts
127.0.0.1       localhost
127.0.1.1       myserver
10.0.1.50       api.internal.mycompany.com api-internal
192.168.1.100   db-master.mycompany.com

6.3 Checking systemd-resolved

# View current DNS configuration
$ resolvectl status
Global
       Protocols: -LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
resolv.conf mode: stub

Link 2 (eth0)
    Current Scopes: DNS
         Protocols: +DefaultRoute +LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
Current DNS Server: 8.8.8.8
       DNS Servers: 8.8.8.8 8.8.4.4

# Test domain resolution
$ resolvectl query example.com
example.com: 93.184.216.34           -- link: eth0
             2606:2800:220:1:248:1893:25c8:1946 -- link: eth0

7. CoreDNS Troubleshooting in Kubernetes

7.1 CoreDNS Architecture

Within a Kubernetes cluster, CoreDNS handles service discovery. All DNS queries from Pods are processed through CoreDNS.

# Check CoreDNS Pod status
$ kubectl get pods -n kube-system -l k8s-app=kube-dns
NAME                       READY   STATUS    RESTARTS   AGE
coredns-5d78c9869d-abc12   1/1     Running   0          7d
coredns-5d78c9869d-def34   1/1     Running   0          7d

# Check CoreDNS service
$ kubectl get svc -n kube-system kube-dns
NAME       TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)                  AGE
kube-dns   ClusterIP   10.96.0.10   <none>        53/UDP,53/TCP,9153/TCP   30d

7.2 Inspecting the CoreDNS Corefile

$ kubectl get configmap coredns -n kube-system -o yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: coredns
  namespace: kube-system
data:
  Corefile: |
    .:53 {
        errors
        health {
           lameduck 5s
        }
        ready
        kubernetes cluster.local in-addr.arpa ip6.arpa {
           pods insecure
           fallthrough in-addr.arpa ip6.arpa
           ttl 30
        }
        prometheus :9153
        forward . /etc/resolv.conf {
           max_concurrent 1000
        }
        cache 30
        loop
        reload
        loadbalance
    }

7.3 Checking CoreDNS Logs

# View CoreDNS logs
$ kubectl logs -n kube-system -l k8s-app=kube-dns --tail=50
[INFO] 10.244.0.15:45678 - 12345 "A IN my-service.default.svc.cluster.local. udp 54 false 512" NOERROR qr,aa,rd 106 0.000234s
[INFO] 10.244.0.15:45679 - 12346 "A IN external-api.com. udp 34 false 512" NOERROR qr,rd,ra 62 0.023456s

# Enable log plugin (add log to Corefile)
# .:53 {
#     log
#     errors
#     ...
# }

7.4 Debugging DNS from Inside a Pod

# Create a debug Pod
$ kubectl run dns-debug --image=nicolaka/netshoot --rm -it --restart=Never -- bash

# Test DNS resolution inside the Pod
bash-5.1# nslookup kubernetes.default.svc.cluster.local
Server:         10.96.0.10
Address:        10.96.0.10#53

Name:   kubernetes.default.svc.cluster.local
Address: 10.96.0.1

# Detailed check with dig
bash-5.1# dig kubernetes.default.svc.cluster.local

;; ANSWER SECTION:
kubernetes.default.svc.cluster.local. 30 IN A   10.96.0.1

;; Query time: 1 msec
;; SERVER: 10.96.0.10#53(10.96.0.10)

# Test external domain resolution
bash-5.1# dig example.com +short
93.184.216.34

# Check the Pod's DNS configuration
bash-5.1# cat /etc/resolv.conf
nameserver 10.96.0.10
search default.svc.cluster.local svc.cluster.local cluster.local
options ndots:5

8. ndots Configuration and Search Domains

8.1 How ndots Works

The ndots option specifies that if the query name contains fewer dots (.) than this value, the system will try appending search domains first.

# Kubernetes default: ndots:5
# search default.svc.cluster.local svc.cluster.local cluster.local

# Querying "api.example.com" (2 dots < ndots 5)
# Actual query order:
1. api.example.com.default.svc.cluster.local    → NXDOMAIN
2. api.example.com.svc.cluster.local            → NXDOMAIN
3. api.example.com.cluster.local                → NXDOMAIN
4. api.example.com.                             → Success!

This means external domain lookups generate 3 unnecessary extra DNS queries.

8.2 Optimizing ndots

# Adjust ndots via dnsConfig in Pod spec
apiVersion: v1
kind: Pod
metadata:
  name: optimized-pod
spec:
  containers:
    - name: app
      image: myapp:latest
  dnsConfig:
    options:
      - name: ndots
        value: '2'
# Use FQDN to avoid unnecessary queries (add trailing dot)
# Inefficient:
$ dig api.example.com   # Multiple queries due to ndots

# Efficient:
$ dig api.example.com.  # Direct query as FQDN (trailing dot)

8.3 dnsPolicy Options

# ClusterFirst (default): Use CoreDNS first
apiVersion: v1
kind: Pod
spec:
  dnsPolicy: ClusterFirst

# Default: Use the node's DNS configuration as-is
spec:
  dnsPolicy: Default

# None: Configure DNS manually via dnsConfig
spec:
  dnsPolicy: None
  dnsConfig:
    nameservers:
    - 8.8.8.8
    - 1.1.1.1
    searches:
    - my-namespace.svc.cluster.local
    - svc.cluster.local
    options:
    - name: ndots
      value: "2"

9. Practical Debugging Scenarios

9.1 Scenario 1: Inter-Service Communication Failure

# Symptom: Pod A cannot connect to Pod B's service
$ kubectl exec pod-a -- curl http://my-service:8080
curl: (6) Could not resolve host: my-service

# Step 1: Check the Pod's DNS configuration
$ kubectl exec pod-a -- cat /etc/resolv.conf
nameserver 10.96.0.10
search default.svc.cluster.local svc.cluster.local cluster.local
options ndots:5

# Step 2: Query CoreDNS directly
$ kubectl exec pod-a -- dig @10.96.0.10 my-service.default.svc.cluster.local
;; status: NXDOMAIN

# Step 3: Verify the service exists
$ kubectl get svc my-service -n default
Error from server (NotFound): services "my-service" not found

# Step 4: Check the correct namespace
$ kubectl get svc --all-namespaces | grep my-service
production   my-service   ClusterIP   10.96.45.123   <none>   8080/TCP   5d

# Solution: Use FQDN with the correct namespace
$ kubectl exec pod-a -- curl http://my-service.production.svc.cluster.local:8080

9.2 Scenario 2: External Domain Resolution Failure

# Symptom: External API calls from Pod are failing
$ kubectl exec my-pod -- curl https://api.external.com
curl: (6) Could not resolve host: api.external.com

# Step 1: Verify CoreDNS is working for internal queries
$ kubectl exec my-pod -- dig @10.96.0.10 kubernetes.default.svc.cluster.local +short
10.96.0.1    # Internal DNS is working fine

# Step 2: Check CoreDNS upstream forwarding
$ kubectl exec my-pod -- dig @10.96.0.10 api.external.com
;; status: SERVFAIL

# Step 3: Check CoreDNS logs
$ kubectl logs -n kube-system -l k8s-app=kube-dns | grep "api.external.com"
[ERROR] plugin/forward: no nameservers found

# Step 4: Inspect the CoreDNS forward configuration
$ kubectl get configmap coredns -n kube-system -o jsonpath='{.data.Corefile}'
# Look for: forward . /etc/resolv.conf

# Step 5: Check CoreDNS Pod's resolv.conf
$ kubectl exec -n kube-system coredns-5d78c9869d-abc12 -- cat /etc/resolv.conf
nameserver 169.254.169.253   # Cloud DNS may be unreachable

# Solution: Explicitly specify forward targets in the Corefile
# forward . 8.8.8.8 8.8.4.4

9.3 Scenario 3: Intermittent DNS Timeouts

# Symptom: DNS lookups intermittently take more than 5 seconds
$ time dig @10.96.0.10 example.com
;; Query time: 5003 msec    # 5-second timeout before retry

# Cause: Linux conntrack race condition (DNAT + UDP)
# UDP DNS packets can collide in the conntrack table

# Verify: Check conntrack table statistics
$ sudo conntrack -S
cpu=0           found=0 invalid=1523 insert=0 insert_failed=156 drop=156
#                                                ^^^^^^^^^^^^^ insert failures indicate the issue

# Solution 1: Use TCP for DNS
# Add force_tcp option to CoreDNS Corefile

# Solution 2: Deploy NodeLocal DNSCache
$ kubectl apply -f https://raw.githubusercontent.com/kubernetes/kubernetes/master/cluster/addons/dns/nodelocaldns/nodelocaldns.yaml

# Solution 3: Use single-request-reopen option in Pod
apiVersion: v1
kind: Pod
spec:
  dnsConfig:
    options:
    - name: single-request-reopen
      value: ""

9.4 Scenario 4: Verifying DNS Propagation

# Check propagation across multiple DNS servers after a change
$ for dns in 8.8.8.8 1.1.1.1 9.9.9.9 208.67.222.222; do
    echo "=== $dns ==="
    dig @$dns api.myservice.com +short +timeout=3
done

=== 8.8.8.8 ===
10.0.1.50
=== 1.1.1.1 ===
10.0.1.50
=== 9.9.9.9 ===
192.168.1.100    # Still returning old record
=== 208.67.222.222 ===
192.168.1.100    # Still returning old record

# Check TTL to estimate when the cache will expire
$ dig @9.9.9.9 api.myservice.com | grep -A1 "ANSWER SECTION"
;; ANSWER SECTION:
api.myservice.com.    1423    IN      A       192.168.1.100
#                     ^^^^ Remaining TTL: cache expires in approximately 24 minutes

10. Useful DNS Debugging One-Liners

# 1. Benchmark DNS response times
$ for i in $(seq 1 10); do dig example.com | grep "Query time"; done

# 2. Bulk lookup for multiple domains
$ for domain in api.example.com web.example.com db.example.com; do
    echo "$domain: $(dig +short $domain)"
done

# 3. Monitor DNS record changes
$ watch -n 5 "dig +short api.myservice.com @8.8.8.8"

# 4. Bulk reverse DNS lookups
$ for ip in 10.0.1.{1..10}; do
    result=$(dig +short -x $ip)
    echo "$ip -> ${result:-NO PTR}"
done

# 5. Check DNSSEC validation status
$ dig +dnssec +short example.com
93.184.216.34
A 13 2 86400 20240315000000 20240301000000 12345 example.com. <base64_signature>

# 6. Verify DNS resolution for all Kubernetes services
$ kubectl get svc --all-namespaces -o jsonpath='{range .items[*]}{.metadata.name}.{.metadata.namespace}.svc.cluster.local{"\n"}{end}' | \
  while read fqdn; do
    result=$(kubectl exec dns-debug -- dig +short $fqdn 2>/dev/null)
    echo "$fqdn -> ${result:-FAILED}"
  done

11. Summary and Checklist

When a DNS issue occurs, diagnose in the following order:

  1. Check /etc/resolv.conf settings (nameserver, search, ndots)
  2. Test basic DNS queries with dig or nslookup
  3. Query a specific DNS server (dig @8.8.8.8)
  4. Trace the full resolution path with +trace
  5. Check TTL to determine if caching is the issue
  6. In Kubernetes environments, check CoreDNS Pod status and logs
  7. Review the performance impact of ndots and search domain settings
  8. For intermittent conntrack-related issues, consider deploying NodeLocal DNSCache

DNS is very often the root cause of network problems. Building systematic debugging habits can significantly reduce incident response times.

Quiz

Q1: What is the main topic covered in "DNS Troubleshooting Complete Guide - From Fundamentals to Kubernetes"?

Understand how DNS resolution works and learn systematic approaches to debugging common DNS issues in production. Covers dig, nslookup, and other tools, along with Kubernetes CoreDNS troubleshooting with practical examples.

Q2: What is DNS? DNS (Domain Name System) is a distributed hierarchical naming system that translates human-readable domain names (e.g., example.com) into IP addresses (e.g., 93.184.216.34) that computers use for communication.

Q3: Explain the core concept of DNS Resolution Process. When a client enters a domain name, the following steps occur to obtain the corresponding IP address. 2.1 Overall Flow 2.2 Recursive vs Iterative Queries 2.3 DNS Record Types

Q4: What are the key aspects of Common DNS Issues? 3.1 NXDOMAIN (Non-Existent Domain) This response is returned when the queried domain does not exist. Root Cause Analysis: Typo in the domain name DNS record has not been created yet Domain registration has expired DNS propagation has not completed 3.2 DNS Timeout Root Cause Analy...

Q5: What approach is recommended for DNS Debugging Tools? 4.1 dig (Domain Information Groper) The most powerful and widely used DNS debugging tool. 4.2 nslookup Supports both interactive and non-interactive modes. 4.3 host A lightweight tool providing concise output. 4.4 drill A tool with enhanced DNSSEC support (from the ldns package).