Split View: 클라우드 네트워크 아키텍처 트러블슈팅 완벽 가이드 — AWS, GCP, Azure 실전 디버깅
클라우드 네트워크 아키텍처 트러블슈팅 완벽 가이드 — AWS, GCP, Azure 실전 디버깅
- 1. VPC 네트워킹 기초와 트러블슈팅 개요
- 2. Security Groups vs NACLs 디버깅
- 3. VPC Peering과 Transit Gateway 이슈
- 4. AWS/GCP/Azure 네트워크 트러블슈팅
- 5. VPC Flow Logs 분석
- 6. NAT Gateway와 Internet Gateway 이슈
- 7. 크로스 리전과 하이브리드 클라우드 연결
- 8. 클라우드 DNS 트러블슈팅 (Route 53, Cloud DNS)
- 9. 실전 클라우드 네트워크 디버깅 시나리오
- 10. 트러블슈팅 도구 모음과 자동화
- 마무리
1. VPC 네트워킹 기초와 트러블슈팅 개요
클라우드 환경에서 네트워크 문제는 가장 빈번하면서도 진단이 어려운 이슈 중 하나입니다. 온프레미스와 달리 물리적 장비에 직접 접근할 수 없기 때문에, 클라우드 제공자가 제공하는 도구와 로그를 활용한 체계적인 접근이 필요합니다.
1.1 VPC(Virtual Private Cloud) 기본 구조
VPC는 클라우드 내 논리적으로 격리된 네트워크 공간입니다. 각 클라우드 벤더별 용어는 다르지만 핵심 개념은 동일합니다.
| 개념 | AWS | GCP | Azure |
|---|---|---|---|
| 가상 네트워크 | VPC | VPC | VNet |
| 서브넷 | Subnet | Subnet | Subnet |
| 라우팅 테이블 | Route Table | Routes | Route Table |
| 방화벽(인스턴스) | Security Group | Firewall Rules | NSG |
| 방화벽(서브넷) | NACL | Firewall Policies | NSG (서브넷 연결) |
1.2 체계적인 트러블슈팅 접근법
네트워크 문제를 진단할 때는 OSI 7계층 모델을 역으로 따라가는 것이 효과적입니다.
트러블슈팅 순서:
1. 애플리케이션 계층 (L7) — DNS 확인, HTTP 응답 코드
2. 전송 계층 (L4) — 포트 연결성, TCP handshake
3. 네트워크 계층 (L3) — IP 라우팅, 서브넷 CIDR
4. 보안 계층 — Security Group, NACL, 방화벽 규칙
5. 인프라 계층 — IGW, NAT GW, VPN, Peering 상태
1.3 기본 연결성 테스트 명령어
# 기본 연결 테스트
ping -c 4 10.0.1.50
# 특정 포트 연결성 확인
nc -zv 10.0.1.50 443 -w 5
# TCP 연결 상세 추적
traceroute -T -p 443 10.0.1.50
# DNS 조회 테스트
dig +trace example.com
nslookup example.com 10.0.0.2
# curl을 이용한 HTTP 레벨 디버깅
curl -vvv --connect-timeout 5 https://api.internal.example.com/health
2. Security Groups vs NACLs 디버깅
2.1 Security Group과 NACL의 핵심 차이
두 보안 메커니즘의 차이를 정확히 이해하지 못하면 트러블슈팅에서 많은 시간을 낭비하게 됩니다.
Security Group (상태 저장형 - Stateful):
✓ 인바운드 허용 시, 해당 응답 트래픽 자동 허용
✓ 인스턴스(ENI) 레벨에서 적용
✓ 허용 규칙만 지정 (암묵적 거부)
✓ 규칙 평가: 모든 규칙을 평가하여 허용 여부 결정
NACL (상태 비저장형 - Stateless):
✓ 인바운드/아웃바운드 규칙을 각각 별도로 설정해야 함
✓ 서브넷 레벨에서 적용
✓ 허용 및 거부 규칙 모두 지정 가능
✓ 규칙 평가: 규칙 번호 순서대로 평가 (첫 번째 매칭 적용)
2.2 흔한 Security Group 문제와 해결
# AWS CLI로 Security Group 규칙 확인
aws ec2 describe-security-groups \
--group-ids sg-0abc1234def56789 \
--query 'SecurityGroups[*].{
GroupId:GroupId,
Ingress:IpPermissions,
Egress:IpPermissionsEgress
}' \
--output json
# 특정 인스턴스에 연결된 Security Group 확인
aws ec2 describe-instances \
--instance-ids i-0abcdef1234567890 \
--query 'Reservations[*].Instances[*].SecurityGroups' \
--output table
사례 1: 아웃바운드 규칙 누락
문제: EC2에서 외부 API 호출 실패
원인: Security Group의 기본 아웃바운드 규칙(0.0.0.0/0 허용)이 삭제됨
해결:
aws ec2 authorize-security-group-egress \
--group-id sg-0abc1234def56789 \
--protocol tcp --port 443 \
--cidr 0.0.0.0/0
사례 2: 자기 참조 규칙 미설정
문제: 같은 Security Group 내 인스턴스 간 통신 실패
원인: SG가 자기 자신을 소스로 허용하는 규칙이 없음
해결:
aws ec2 authorize-security-group-ingress \
--group-id sg-0abc1234def56789 \
--protocol -1 \
--source-group sg-0abc1234def56789
2.3 NACL 디버깅 패턴
# NACL 규칙 확인
aws ec2 describe-network-acls \
--network-acl-ids acl-0abc1234 \
--query 'NetworkAcls[*].Entries' \
--output json
임시 포트(Ephemeral Port) 문제 — 가장 흔한 NACL 이슈:
문제: 아웃바운드 HTTP 요청의 응답을 받지 못함
원인: NACL에서 임시 포트 범위(1024-65535)에 대한 인바운드 규칙이 없음
NACL 규칙 예시 (올바른 설정):
인바운드:
Rule 100: TCP 443 허용 (0.0.0.0/0)
Rule 110: TCP 1024-65535 허용 (0.0.0.0/0) ← 이것이 반드시 필요
Rule *: 모든 트래픽 거부
아웃바운드:
Rule 100: TCP 443 허용 (0.0.0.0/0)
Rule 110: TCP 1024-65535 허용 (0.0.0.0/0) ← 응답 트래픽
Rule *: 모든 트래픽 거부
3. VPC Peering과 Transit Gateway 이슈
3.1 VPC Peering 트러블슈팅
VPC Peering은 두 VPC 간 프라이빗 네트워크 연결을 제공하지만, 여러 제약 사항이 있습니다.
VPC Peering 제약사항:
✗ 전이적 피어링(Transitive Peering) 불가
VPC-A ↔ VPC-B ↔ VPC-C 에서 A→C 직접 통신 불가
✗ CIDR 중복 불가
✗ 리전 간 피어링 시 Security Group 참조 불가(일부 리전)
# VPC Peering 상태 확인
aws ec2 describe-vpc-peering-connections \
--filters "Name=status-code,Values=active,pending-acceptance" \
--query 'VpcPeeringConnections[*].{
PeeringId:VpcPeeringConnectionId,
Status:Status.Code,
Requester:RequesterVpcInfo.{VpcId:VpcId,CidrBlock:CidrBlock},
Accepter:AccepterVpcInfo.{VpcId:VpcId,CidrBlock:CidrBlock}
}'
# 라우팅 테이블에 피어링 경로 확인
aws ec2 describe-route-tables \
--filters "Name=vpc-id,Values=vpc-0abc1234" \
--query 'RouteTables[*].Routes[?GatewayId!=`local`]'
흔한 실수: 라우팅 테이블 미설정
문제: Peering 연결은 Active 상태이지만 통신 불가
원인: 양쪽 VPC의 라우팅 테이블에 피어링 경로를 추가하지 않음
해결:
# VPC-A의 라우팅 테이블에 VPC-B CIDR 추가
aws ec2 create-route \
--route-table-id rtb-aaa111 \
--destination-cidr-block 10.1.0.0/16 \
--vpc-peering-connection-id pcx-0abc1234
# VPC-B의 라우팅 테이블에 VPC-A CIDR 추가
aws ec2 create-route \
--route-table-id rtb-bbb222 \
--destination-cidr-block 10.0.0.0/16 \
--vpc-peering-connection-id pcx-0abc1234
3.2 Transit Gateway 문제 해결
Transit Gateway는 허브-스포크 모델로 여러 VPC와 온프레미스 네트워크를 연결합니다.
# Transit Gateway 라우팅 테이블 확인
aws ec2 describe-transit-gateway-route-tables \
--transit-gateway-route-table-ids tgw-rtb-0abc1234
# Transit Gateway 연결(Attachment) 상태 확인
aws ec2 describe-transit-gateway-attachments \
--filters "Name=transit-gateway-id,Values=tgw-0abc1234" \
--query 'TransitGatewayAttachments[*].{
AttachmentId:TransitGatewayAttachmentId,
ResourceType:ResourceType,
State:State
}'
# TGW 라우팅 테이블의 경로 조회
aws ec2 search-transit-gateway-routes \
--transit-gateway-route-table-id tgw-rtb-0abc1234 \
--filters "Name=type,Values=static,propagated"
TGW 라우팅 전파(Propagation) 문제:
문제: TGW에 연결된 VPC 간 통신 실패
체크리스트:
□ TGW Attachment가 available 상태인가?
□ TGW Route Table에 경로가 전파(propagated)되어 있는가?
□ 각 VPC의 서브넷 라우팅 테이블에 TGW 경로가 있는가?
□ TGW Route Table Association이 올바른가?
4. AWS/GCP/Azure 네트워크 트러블슈팅
4.1 AWS 고유 네트워크 문제
# VPC Reachability Analyzer 사용 (강력한 진단 도구)
aws ec2 create-network-insights-path \
--source i-0abc1234 \
--destination i-0def5678 \
--protocol TCP \
--destination-port 443
aws ec2 start-network-insights-analysis \
--network-insights-path-id nip-0abc1234
# 결과 확인
aws ec2 describe-network-insights-analyses \
--network-insights-analysis-ids nia-0abc1234 \
--query 'NetworkInsightsAnalyses[*].{
Status:Status,
PathFound:NetworkPathFound,
Explanations:Explanations
}'
4.2 GCP 네트워크 디버깅
# GCP 방화벽 규칙 확인
gcloud compute firewall-rules list \
--filter="network=my-vpc" \
--format="table(name,direction,priority,allowed,sourceRanges,targetTags)"
# Connectivity Test 실행 (GCP의 Reachability Analyzer)
gcloud network-management connectivity-tests create my-test \
--source-instance=projects/my-project/zones/us-central1-a/instances/vm-1 \
--destination-instance=projects/my-project/zones/us-central1-b/instances/vm-2 \
--protocol=TCP \
--destination-port=8080
# 결과 확인
gcloud network-management connectivity-tests describe my-test
4.3 Azure 네트워크 진단
# NSG Flow 확인
az network nsg rule list \
--nsg-name my-nsg \
--resource-group my-rg \
--output table
# Network Watcher를 이용한 연결성 확인
az network watcher test-connectivity \
--resource-group my-rg \
--source-resource vm-source \
--dest-resource vm-dest \
--dest-port 443
# Effective Routes 확인
az network nic show-effective-route-table \
--name my-nic \
--resource-group my-rg
5. VPC Flow Logs 분석
5.1 Flow Logs 설정과 필드 이해
VPC Flow Logs는 네트워크 트러블슈팅의 핵심 데이터 소스입니다.
# AWS VPC Flow Logs 활성화
aws ec2 create-flow-logs \
--resource-type VPC \
--resource-ids vpc-0abc1234 \
--traffic-type ALL \
--log-destination-type cloud-watch-logs \
--log-group-name /vpc/flow-logs \
--deliver-logs-permission-arn arn:aws:iam::123456789012:role/flowlogsRole \
--max-aggregation-interval 60
Flow Log 레코드 형식:
<version> <account-id> <interface-id> <srcaddr> <dstaddr> <srcport> <dstport> <protocol> <packets> <bytes> <start> <end> <action> <log-status>
예시:
2 123456789012 eni-0abc1234 10.0.1.50 10.0.2.100 49152 443 6 20 4000 1625000000 1625000060 ACCEPT OK
2 123456789012 eni-0abc1234 203.0.113.5 10.0.1.50 0 0 1 4 336 1625000000 1625000060 REJECT OK
5.2 CloudWatch Logs Insights로 분석
-- 거부된 트래픽 Top 10 소스 IP
fields @timestamp, srcAddr, dstAddr, dstPort, action
| filter action = "REJECT"
| stats count(*) as rejectedCount by srcAddr
| sort rejectedCount desc
| limit 10
-- 특정 IP의 트래픽 패턴 분석
fields @timestamp, srcAddr, dstAddr, srcPort, dstPort, protocol, action
| filter srcAddr = "10.0.1.50" or dstAddr = "10.0.1.50"
| sort @timestamp desc
| limit 100
-- 비정상적인 포트 스캔 탐지
fields srcAddr, dstPort, action
| filter action = "REJECT"
| stats count(dstPort) as portCount by srcAddr
| filter portCount > 100
| sort portCount desc
5.3 Athena를 사용한 대용량 Flow Log 분석
-- Athena 테이블 생성
CREATE EXTERNAL TABLE IF NOT EXISTS vpc_flow_logs (
version int,
account_id string,
interface_id string,
srcaddr string,
dstaddr string,
srcport int,
dstport int,
protocol bigint,
packets bigint,
bytes bigint,
start bigint,
end bigint,
action string,
log_status string
)
PARTITIONED BY (dt string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '
LOCATION 's3://my-flow-logs-bucket/AWSLogs/123456789012/vpcflowlogs/';
-- 시간대별 거부 트래픽 추이
SELECT
from_unixtime(start) as event_time,
srcaddr,
dstaddr,
dstport,
protocol,
sum(packets) as total_packets,
sum(bytes) as total_bytes
FROM vpc_flow_logs
WHERE action = 'REJECT'
AND dt = '2026/03/08'
GROUP BY 1, 2, 3, 4, 5
ORDER BY total_packets DESC
LIMIT 50;
6. NAT Gateway와 Internet Gateway 이슈
6.1 Internet Gateway(IGW) 트러블슈팅
IGW를 통한 인터넷 접근 체크리스트:
□ VPC에 IGW가 연결(Attached)되어 있는가?
□ 퍼블릭 서브넷의 라우팅 테이블에 0.0.0.0/0 → IGW 경로가 있는가?
□ 인스턴스에 퍼블릭 IP 또는 Elastic IP가 할당되어 있는가?
□ Security Group 아웃바운드에 해당 트래픽이 허용되어 있는가?
□ NACL에서 인바운드/아웃바운드 모두 허용되어 있는가?
# IGW 연결 상태 확인
aws ec2 describe-internet-gateways \
--filters "Name=attachment.vpc-id,Values=vpc-0abc1234" \
--query 'InternetGateways[*].{IGWId:InternetGatewayId,State:Attachments[0].State}'
# 퍼블릭 서브넷 라우팅 확인
aws ec2 describe-route-tables \
--filters "Name=association.subnet-id,Values=subnet-pub123" \
--query 'RouteTables[*].Routes[?DestinationCidrBlock==`0.0.0.0/0`]'
6.2 NAT Gateway 문제 해결
# NAT Gateway 상태 확인
aws ec2 describe-nat-gateways \
--filter "Name=vpc-id,Values=vpc-0abc1234" \
--query 'NatGateways[*].{
NatGatewayId:NatGatewayId,
State:State,
SubnetId:SubnetId,
PublicIp:NatGatewayAddresses[0].PublicIp,
PrivateIp:NatGatewayAddresses[0].PrivateIp
}'
NAT Gateway 일반적인 문제:
문제 1: NAT Gateway를 통한 인터넷 접속 불가
원인: NAT Gateway가 프라이빗 서브넷에 배치됨
해결: NAT Gateway는 반드시 퍼블릭 서브넷에 위치해야 함
문제 2: 간헐적 연결 끊김
원인: NAT Gateway의 연결 추적 한도(Conntrack limit) 초과
확인: CloudWatch 메트릭 > ErrorPortAllocation
해결: 여러 NAT Gateway로 분산하거나 대상 서비스의 엔드포인트를 사용
문제 3: NAT Gateway 비용 폭증
원인: S3, DynamoDB 등 AWS 서비스 트래픽이 NAT를 경유
해결: VPC Endpoint(Gateway 타입)를 생성하여 프라이빗 경로 사용
# VPC Endpoint 생성 (S3 예시 — NAT 비용 절감)
aws ec2 create-vpc-endpoint \
--vpc-id vpc-0abc1234 \
--service-name com.amazonaws.ap-northeast-2.s3 \
--route-table-ids rtb-priv111 rtb-priv222
# NAT Gateway CloudWatch 메트릭 확인
aws cloudwatch get-metric-statistics \
--namespace AWS/NATGateway \
--metric-name ErrorPortAllocation \
--dimensions Name=NatGatewayId,Value=nat-0abc1234 \
--start-time 2026-03-07T00:00:00Z \
--end-time 2026-03-08T00:00:00Z \
--period 3600 \
--statistics Sum
7. 크로스 리전과 하이브리드 클라우드 연결
7.1 크로스 리전 VPC 연결
크로스 리전 연결 옵션:
1. Inter-Region VPC Peering — 1:1 연결, 간단한 구성
2. Transit Gateway Inter-Region Peering — 허브-스포크, 대규모 구성
3. AWS Cloud WAN — 글로벌 네트워크 관리 (최신)
# Inter-Region TGW Peering 상태 확인
aws ec2 describe-transit-gateway-peering-attachments \
--filters "Name=state,Values=available,pendingAcceptance" \
--query 'TransitGatewayPeeringAttachments[*].{
AttachmentId:TransitGatewayAttachmentId,
RequesterTgw:RequesterTgwInfo.TransitGatewayId,
AccepterTgw:AccepterTgwInfo.TransitGatewayId,
State:State
}'
7.2 하이브리드 클라우드 연결 (VPN & Direct Connect)
# Site-to-Site VPN 터널 상태 확인
aws ec2 describe-vpn-connections \
--vpn-connection-ids vpn-0abc1234 \
--query 'VpnConnections[*].{
VpnId:VpnConnectionId,
State:State,
Tunnel1Status:VgwTelemetry[0].Status,
Tunnel1StatusMessage:VgwTelemetry[0].StatusMessage,
Tunnel2Status:VgwTelemetry[1].Status,
Tunnel2StatusMessage:VgwTelemetry[1].StatusMessage
}'
VPN 터널 다운 진단:
VPN 터널이 DOWN인 경우 체크리스트:
□ 고객 게이트웨이(CGW) 설정 확인
- IKE 버전 (v1 또는 v2)
- Pre-shared key 일치 여부
- Phase 1/Phase 2 암호화 파라미터 일치
□ 온프레미스 방화벽에서 UDP 500, UDP 4500 허용 여부
□ NAT-Traversal 설정 확인
□ BGP 세션 상태 확인 (동적 라우팅 사용 시)
□ Dead Peer Detection(DPD) 설정 일치 여부
7.3 멀티 클라우드 네트워크 연결
AWS ↔ GCP 연결 옵션:
1. IPSec VPN (AWS VPN + GCP Cloud VPN)
2. Partner Interconnect 사용
3. 서드파티 SD-WAN 솔루션
AWS ↔ Azure 연결 옵션:
1. IPSec VPN (AWS VPN + Azure VPN Gateway)
2. ExpressRoute + Direct Connect (전용선)
3. Megaport 등 Cloud Exchange 활용
# GCP VPN 터널 상태 확인
gcloud compute vpn-tunnels describe my-tunnel \
--region=us-central1 \
--format="value(status,detailedStatus)"
# Azure VPN 연결 상태 확인
az network vpn-connection show \
--name my-vpn-connection \
--resource-group my-rg \
--query '{Status:connectionStatus,EgressBytes:egressBytesTransferred,IngressBytes:ingressBytesTransferred}'
8. 클라우드 DNS 트러블슈팅 (Route 53, Cloud DNS)
8.1 AWS Route 53 문제 해결
# Route 53 호스팅 존 레코드 확인
aws route53 list-resource-record-sets \
--hosted-zone-id Z0123456789ABCDEFGHIJ \
--query 'ResourceRecordSets[?Name==`api.example.com.`]'
# Route 53 Resolver 쿼리 로그 확인 (VPC 내부 DNS)
aws route53resolver list-resolver-query-log-configs
# DNS 해석 테스트
aws route53 test-dns-answer \
--hosted-zone-id Z0123456789ABCDEFGHIJ \
--record-name api.example.com \
--record-type A
프라이빗 호스팅 존 문제:
문제: VPC 내에서 프라이빗 DNS 레코드가 해석되지 않음
체크리스트:
□ 프라이빗 호스팅 존이 해당 VPC에 연결되어 있는가?
□ VPC의 enableDnsSupport가 true인가?
□ VPC의 enableDnsHostnames가 true인가?
□ DHCP Options Set에서 AmazonProvidedDNS를 사용하고 있는가?
# VPC DNS 설정 확인
aws ec2 describe-vpc-attribute \
--vpc-id vpc-0abc1234 \
--attribute enableDnsSupport
aws ec2 describe-vpc-attribute \
--vpc-id vpc-0abc1234 \
--attribute enableDnsHostnames
8.2 GCP Cloud DNS 디버깅
# Cloud DNS 레코드 확인
gcloud dns record-sets list \
--zone=my-dns-zone \
--filter="name=api.example.com."
# 프라이빗 DNS 존 연결 VPC 확인
gcloud dns managed-zones describe my-private-zone \
--format="value(privateVisibilityConfig.networks)"
# DNS 정책 확인
gcloud dns policies list
8.3 하이브리드 DNS 구성 문제
VPC Peering 환경의 DNS 해석 문제:
- Route 53 Resolver 인바운드/아웃바운드 엔드포인트 설정 필요
- 온프레미스 → AWS: Inbound Endpoint 사용
- AWS → 온프레미스: Outbound Endpoint + Forwarding Rule 설정
┌─────────────────────────────┐
│ Route 53 Resolver │
온프레미스 ──→ │ Inbound Outbound │ ──→ 온프레미스 DNS
DNS 쿼리 │ Endpoint Endpoint │ 포워딩
└─────────────────────────────┘
# Route 53 Resolver 엔드포인트 상태 확인
aws route53resolver list-resolver-endpoints \
--query 'ResolverEndpoints[*].{
Id:Id,
Name:Name,
Direction:Direction,
Status:Status,
IpAddressCount:IpAddressCount
}'
# 포워딩 규칙 확인
aws route53resolver list-resolver-rules \
--query 'ResolverRules[*].{
Id:Id,
DomainName:DomainName,
RuleType:RuleType,
Status:Status,
TargetIps:TargetIps
}'
9. 실전 클라우드 네트워크 디버깅 시나리오
시나리오 1: ECS 서비스에서 외부 API 호출 타임아웃
증상: ECS Fargate 태스크에서 외부 API 호출 시 연결 타임아웃 발생
환경: 프라이빗 서브넷에 배치된 Fargate 태스크
디버깅 순서:
1. 태스크의 네트워크 모드 확인 (awsvpc)
2. 태스크의 ENI에 연결된 Security Group 아웃바운드 규칙 확인
3. 서브넷 라우팅 테이블에 NAT Gateway 경로 확인
4. NAT Gateway 상태 및 위치(퍼블릭 서브넷) 확인
5. NACL 규칙 확인 (아웃바운드 + 임시 포트 인바운드)
근본 원인: NAT Gateway가 삭제되었으나 라우팅 테이블에 블랙홀 경로가 남아있었음
해결: 새 NAT Gateway 생성 후 라우팅 테이블 업데이트
시나리오 2: VPC Peering 환경에서 RDS 접근 불가
증상: VPC-A의 애플리케이션에서 VPC-B의 RDS에 TCP 3306 연결 불가
환경: VPC-A(10.0.0.0/16) ↔ Peering ↔ VPC-B(10.1.0.0/16)
디버깅 순서:
1. Peering 상태 확인 → Active ✓
2. VPC-A 라우팅 테이블 → 10.1.0.0/16 → pcx-xxx ✓
3. VPC-B 라우팅 테이블 → 10.0.0.0/16 → pcx-xxx ✓
4. RDS Security Group 인바운드 확인 → 10.0.0.0/16에서 3306 허용 ✗ ← 문제 발견!
5. RDS가 VPC-B의 SG 참조만 허용하고 있었음
해결: RDS Security Group에 VPC-A CIDR(10.0.0.0/16)에서 3306 인바운드 허용 추가
시나리오 3: Lambda 함수의 VPC 내 리소스 접근 실패
증상: VPC에 연결된 Lambda 함수에서 ElastiCache 접근 불가, 인터넷도 불가
환경: Lambda가 프라이빗 서브넷에 연결
디버깅 순서:
1. Lambda 실행 역할에 VPC 관련 권한 확인
- ec2:CreateNetworkInterface
- ec2:DescribeNetworkInterfaces
- ec2:DeleteNetworkInterface
2. Lambda에 할당된 서브넷 확인 (최소 2개 AZ 권장)
3. Lambda Security Group → ElastiCache Security Group 허용 확인
4. 인터넷 접근: 프라이빗 서브넷 + NAT Gateway 필요
근본 원인: Lambda의 Security Group 아웃바운드가 제한적이어서
ElastiCache 포트(6379)가 허용되지 않았음
해결: Lambda SG 아웃바운드에 TCP 6379 허용 추가
시나리오 4: 크로스 리전 데이터 전송 지연
증상: us-east-1과 ap-northeast-2 간 데이터 복제 지연 급증
환경: Inter-Region VPC Peering 사용
디버깅 순서:
1. VPC Peering 상태 확인 → Active ✓
2. 네트워크 대역폭 확인 → CloudWatch BytesIn/BytesOut 메트릭
3. 인스턴스 타입별 네트워크 성능 한도 확인
4. TCP Window Scaling 및 버퍼 크기 최적화
5. AWS Global Accelerator 또는 CloudFront 활용 검토
최적화:
- 대용량 전송 시 병렬 스트림 사용
- TCP 윈도우 크기 튜닝: sysctl -w net.core.rmem_max=16777216
- S3 Transfer Acceleration 활용 (파일 전송의 경우)
10. 트러블슈팅 도구 모음과 자동화
10.1 종합 진단 스크립트
#!/bin/bash
# cloud-network-diag.sh — 클라우드 네트워크 종합 진단 스크립트
TARGET_IP=$1
TARGET_PORT=${2:-443}
VPC_ID=${3:-"auto"}
echo "=== 클라우드 네트워크 진단 시작 ==="
echo "대상: ${TARGET_IP}:${TARGET_PORT}"
echo ""
# 1. 기본 연결성 확인
echo "--- 1. 기본 연결성 확인 ---"
ping -c 3 -W 2 $TARGET_IP 2>/dev/null
nc -zv -w 3 $TARGET_IP $TARGET_PORT 2>&1
# 2. DNS 해석 확인
echo "--- 2. DNS 해석 확인 ---"
dig +short $TARGET_IP 2>/dev/null || echo "IP 주소로 DNS 조회 생략"
# 3. 경로 추적
echo "--- 3. 경로 추적 ---"
traceroute -m 15 -w 2 $TARGET_IP 2>/dev/null
# 4. AWS 보안 그룹 확인 (인스턴스에서 실행 시)
echo "--- 4. 메타데이터 기반 보안 그룹 확인 ---"
TOKEN=$(curl -s -X PUT "http://169.254.169.254/latest/api/token" \
-H "X-aws-ec2-metadata-token-ttl-seconds: 60" 2>/dev/null)
if [ ! -z "$TOKEN" ]; then
SG_IDS=$(curl -s -H "X-aws-ec2-metadata-token: $TOKEN" \
http://169.254.169.254/latest/meta-data/security-groups 2>/dev/null)
echo "연결된 Security Groups: $SG_IDS"
fi
echo "=== 진단 완료 ==="
10.2 CloudWatch 경보 설정 (네트워크 모니터링)
# NAT Gateway ErrorPortAllocation 경보
aws cloudwatch put-metric-alarm \
--alarm-name "NATGateway-PortAllocation-Error" \
--metric-name ErrorPortAllocation \
--namespace AWS/NATGateway \
--dimensions Name=NatGatewayId,Value=nat-0abc1234 \
--statistic Sum \
--period 300 \
--threshold 100 \
--comparison-operator GreaterThanThreshold \
--evaluation-periods 2 \
--alarm-actions arn:aws:sns:ap-northeast-2:123456789012:network-alerts
# VPN 터널 다운 경보
aws cloudwatch put-metric-alarm \
--alarm-name "VPN-Tunnel-Down" \
--metric-name TunnelState \
--namespace AWS/VPN \
--dimensions Name=VpnId,Value=vpn-0abc1234 \
--statistic Maximum \
--period 300 \
--threshold 0.5 \
--comparison-operator LessThanThreshold \
--evaluation-periods 2 \
--alarm-actions arn:aws:sns:ap-northeast-2:123456789012:network-alerts
마무리
클라우드 네트워크 트러블슈팅은 체계적인 접근법이 핵심입니다. 주요 포인트를 정리하면:
- 계층적 접근: OSI 모델을 기반으로 상위 계층에서 하위 계층으로 진단
- 로그 활용: VPC Flow Logs, DNS Query Logs를 항상 활성화
- 도구 활용: 각 클라우드의 Reachability Analyzer, Connectivity Test 적극 활용
- 보안 규칙 이해: Stateful(SG) vs Stateless(NACL) 차이를 명확히 이해
- 비용 최적화: VPC Endpoint를 활용하여 NAT Gateway 비용 절감
- 모니터링: CloudWatch 경보를 통한 사전 감지 체계 구축
네트워크 문제는 재현이 어려운 경우가 많으므로, 평소에 모니터링과 로깅을 충실히 설정해 두는 것이 가장 중요합니다.
Complete Guide to Cloud Network Architecture Troubleshooting — Practical Debugging for AWS, GCP, and Azure
- 1. VPC Networking Fundamentals and Troubleshooting Overview
- 2. Security Groups vs NACLs Debugging
- 3. VPC Peering and Transit Gateway Issues
- 4. AWS/GCP/Azure-Specific Network Troubleshooting
- 5. VPC Flow Logs Analysis
- 6. NAT Gateway and Internet Gateway Issues
- 7. Cross-Region and Hybrid Cloud Connectivity
- 8. Cloud DNS Troubleshooting (Route 53, Cloud DNS)
- 9. Practical Cloud Network Debugging Scenarios
- 10. Troubleshooting Toolkit and Automation
- Conclusion
- Quiz
1. VPC Networking Fundamentals and Troubleshooting Overview
Network issues in cloud environments are among the most frequent yet difficult to diagnose. Unlike on-premises setups where you can physically access hardware, cloud troubleshooting requires a systematic approach using tools and logs provided by cloud vendors.
1.1 VPC (Virtual Private Cloud) Basic Structure
A VPC is a logically isolated network space within the cloud. While terminology varies across cloud vendors, the core concepts remain the same.
| Concept | AWS | GCP | Azure |
|---|---|---|---|
| Virtual Network | VPC | VPC | VNet |
| Subnet | Subnet | Subnet | Subnet |
| Routing Table | Route Table | Routes | Route Table |
| Firewall (Instance) | Security Group | Firewall Rules | NSG |
| Firewall (Subnet) | NACL | Firewall Policies | NSG (Subnet-attached) |
1.2 Systematic Troubleshooting Approach
When diagnosing network issues, working backwards through the OSI 7-layer model is most effective.
Troubleshooting Order:
1. Application Layer (L7) — DNS resolution, HTTP response codes
2. Transport Layer (L4) — Port connectivity, TCP handshake
3. Network Layer (L3) — IP routing, subnet CIDR
4. Security Layer — Security Groups, NACLs, firewall rules
5. Infrastructure Layer — IGW, NAT GW, VPN, Peering status
1.3 Basic Connectivity Test Commands
# Basic connectivity test
ping -c 4 10.0.1.50
# Specific port connectivity check
nc -zv 10.0.1.50 443 -w 5
# Detailed TCP route tracing
traceroute -T -p 443 10.0.1.50
# DNS lookup test
dig +trace example.com
nslookup example.com 10.0.0.2
# HTTP-level debugging with curl
curl -vvv --connect-timeout 5 https://api.internal.example.com/health
2. Security Groups vs NACLs Debugging
2.1 Key Differences Between Security Groups and NACLs
Failure to understand the differences between these two security mechanisms leads to wasted troubleshooting time.
Security Group (Stateful):
- Inbound allow automatically permits corresponding response traffic
- Applied at the instance (ENI) level
- Allow rules only (implicit deny)
- Rule evaluation: All rules are evaluated to determine access
NACL (Stateless):
- Inbound/outbound rules must be configured separately
- Applied at the subnet level
- Both allow and deny rules supported
- Rule evaluation: Processed in rule number order (first match wins)
2.2 Common Security Group Issues and Fixes
# Check Security Group rules with AWS CLI
aws ec2 describe-security-groups \
--group-ids sg-0abc1234def56789 \
--query 'SecurityGroups[*].{
GroupId:GroupId,
Ingress:IpPermissions,
Egress:IpPermissionsEgress
}' \
--output json
# Check Security Groups attached to a specific instance
aws ec2 describe-instances \
--instance-ids i-0abcdef1234567890 \
--query 'Reservations[*].Instances[*].SecurityGroups' \
--output table
Case 1: Missing Outbound Rules
Problem: EC2 fails to call external APIs
Cause: Default outbound rule (0.0.0.0/0 allow) was deleted from the Security Group
Fix:
aws ec2 authorize-security-group-egress \
--group-id sg-0abc1234def56789 \
--protocol tcp --port 443 \
--cidr 0.0.0.0/0
Case 2: Missing Self-Referencing Rules
Problem: Instances within the same Security Group cannot communicate
Cause: The SG lacks a rule allowing itself as a source
Fix:
aws ec2 authorize-security-group-ingress \
--group-id sg-0abc1234def56789 \
--protocol -1 \
--source-group sg-0abc1234def56789
2.3 NACL Debugging Patterns
# Check NACL rules
aws ec2 describe-network-acls \
--network-acl-ids acl-0abc1234 \
--query 'NetworkAcls[*].Entries' \
--output json
Ephemeral Port Issues — The Most Common NACL Problem:
Problem: Outbound HTTP request responses are not received
Cause: No inbound NACL rule for ephemeral port range (1024-65535)
NACL Rule Example (correct configuration):
Inbound:
Rule 100: TCP 443 ALLOW (0.0.0.0/0)
Rule 110: TCP 1024-65535 ALLOW (0.0.0.0/0) <-- This is essential
Rule *: All traffic DENY
Outbound:
Rule 100: TCP 443 ALLOW (0.0.0.0/0)
Rule 110: TCP 1024-65535 ALLOW (0.0.0.0/0) <-- Response traffic
Rule *: All traffic DENY
3. VPC Peering and Transit Gateway Issues
3.1 VPC Peering Troubleshooting
VPC Peering provides private network connectivity between two VPCs, but comes with several constraints.
VPC Peering Constraints:
x Transitive peering is NOT supported
VPC-A <-> VPC-B <-> VPC-C: A cannot reach C directly
x Overlapping CIDRs are NOT allowed
x Cross-region peering may not support Security Group references
# Check VPC Peering status
aws ec2 describe-vpc-peering-connections \
--filters "Name=status-code,Values=active,pending-acceptance" \
--query 'VpcPeeringConnections[*].{
PeeringId:VpcPeeringConnectionId,
Status:Status.Code,
Requester:RequesterVpcInfo.{VpcId:VpcId,CidrBlock:CidrBlock},
Accepter:AccepterVpcInfo.{VpcId:VpcId,CidrBlock:CidrBlock}
}'
# Check peering routes in routing tables
aws ec2 describe-route-tables \
--filters "Name=vpc-id,Values=vpc-0abc1234" \
--query 'RouteTables[*].Routes[?GatewayId!=`local`]'
Common Mistake: Missing Route Table Entries
Problem: Peering connection is Active but communication fails
Cause: Peering routes were not added to both VPCs' route tables
Fix:
# Add VPC-B CIDR to VPC-A route table
aws ec2 create-route \
--route-table-id rtb-aaa111 \
--destination-cidr-block 10.1.0.0/16 \
--vpc-peering-connection-id pcx-0abc1234
# Add VPC-A CIDR to VPC-B route table
aws ec2 create-route \
--route-table-id rtb-bbb222 \
--destination-cidr-block 10.0.0.0/16 \
--vpc-peering-connection-id pcx-0abc1234
3.2 Transit Gateway Problem Resolution
Transit Gateway connects multiple VPCs and on-premises networks using a hub-and-spoke model.
# Check Transit Gateway route tables
aws ec2 describe-transit-gateway-route-tables \
--transit-gateway-route-table-ids tgw-rtb-0abc1234
# Check Transit Gateway attachment states
aws ec2 describe-transit-gateway-attachments \
--filters "Name=transit-gateway-id,Values=tgw-0abc1234" \
--query 'TransitGatewayAttachments[*].{
AttachmentId:TransitGatewayAttachmentId,
ResourceType:ResourceType,
State:State
}'
# Query routes in TGW route table
aws ec2 search-transit-gateway-routes \
--transit-gateway-route-table-id tgw-rtb-0abc1234 \
--filters "Name=type,Values=static,propagated"
TGW Route Propagation Issues:
Problem: Communication fails between VPCs connected via TGW
Checklist:
[ ] Is the TGW Attachment in "available" state?
[ ] Are routes propagated in the TGW Route Table?
[ ] Do each VPC's subnet route tables have TGW routes?
[ ] Is the TGW Route Table Association correct?
4. AWS/GCP/Azure-Specific Network Troubleshooting
4.1 AWS-Specific Network Issues
# Use VPC Reachability Analyzer (powerful diagnostic tool)
aws ec2 create-network-insights-path \
--source i-0abc1234 \
--destination i-0def5678 \
--protocol TCP \
--destination-port 443
aws ec2 start-network-insights-analysis \
--network-insights-path-id nip-0abc1234
# Check results
aws ec2 describe-network-insights-analyses \
--network-insights-analysis-ids nia-0abc1234 \
--query 'NetworkInsightsAnalyses[*].{
Status:Status,
PathFound:NetworkPathFound,
Explanations:Explanations
}'
4.2 GCP Network Debugging
# Check GCP firewall rules
gcloud compute firewall-rules list \
--filter="network=my-vpc" \
--format="table(name,direction,priority,allowed,sourceRanges,targetTags)"
# Run Connectivity Test (GCP's Reachability Analyzer)
gcloud network-management connectivity-tests create my-test \
--source-instance=projects/my-project/zones/us-central1-a/instances/vm-1 \
--destination-instance=projects/my-project/zones/us-central1-b/instances/vm-2 \
--protocol=TCP \
--destination-port=8080
# Check results
gcloud network-management connectivity-tests describe my-test
4.3 Azure Network Diagnostics
# Check NSG rules
az network nsg rule list \
--nsg-name my-nsg \
--resource-group my-rg \
--output table
# Connectivity check with Network Watcher
az network watcher test-connectivity \
--resource-group my-rg \
--source-resource vm-source \
--dest-resource vm-dest \
--dest-port 443
# Check effective routes
az network nic show-effective-route-table \
--name my-nic \
--resource-group my-rg
5. VPC Flow Logs Analysis
5.1 Flow Logs Setup and Field Reference
VPC Flow Logs are the essential data source for network troubleshooting.
# Enable AWS VPC Flow Logs
aws ec2 create-flow-logs \
--resource-type VPC \
--resource-ids vpc-0abc1234 \
--traffic-type ALL \
--log-destination-type cloud-watch-logs \
--log-group-name /vpc/flow-logs \
--deliver-logs-permission-arn arn:aws:iam::123456789012:role/flowlogsRole \
--max-aggregation-interval 60
Flow Log Record Format:
<version> <account-id> <interface-id> <srcaddr> <dstaddr> <srcport> <dstport> <protocol> <packets> <bytes> <start> <end> <action> <log-status>
Example:
2 123456789012 eni-0abc1234 10.0.1.50 10.0.2.100 49152 443 6 20 4000 1625000000 1625000060 ACCEPT OK
2 123456789012 eni-0abc1234 203.0.113.5 10.0.1.50 0 0 1 4 336 1625000000 1625000060 REJECT OK
5.2 Analysis with CloudWatch Logs Insights
-- Top 10 source IPs with rejected traffic
fields @timestamp, srcAddr, dstAddr, dstPort, action
| filter action = "REJECT"
| stats count(*) as rejectedCount by srcAddr
| sort rejectedCount desc
| limit 10
-- Traffic pattern analysis for a specific IP
fields @timestamp, srcAddr, dstAddr, srcPort, dstPort, protocol, action
| filter srcAddr = "10.0.1.50" or dstAddr = "10.0.1.50"
| sort @timestamp desc
| limit 100
-- Detect anomalous port scanning
fields srcAddr, dstPort, action
| filter action = "REJECT"
| stats count(dstPort) as portCount by srcAddr
| filter portCount > 100
| sort portCount desc
5.3 Large-Scale Flow Log Analysis with Athena
-- Create Athena table
CREATE EXTERNAL TABLE IF NOT EXISTS vpc_flow_logs (
version int,
account_id string,
interface_id string,
srcaddr string,
dstaddr string,
srcport int,
dstport int,
protocol bigint,
packets bigint,
bytes bigint,
start bigint,
end bigint,
action string,
log_status string
)
PARTITIONED BY (dt string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '
LOCATION 's3://my-flow-logs-bucket/AWSLogs/123456789012/vpcflowlogs/';
-- Rejected traffic trends by time period
SELECT
from_unixtime(start) as event_time,
srcaddr,
dstaddr,
dstport,
protocol,
sum(packets) as total_packets,
sum(bytes) as total_bytes
FROM vpc_flow_logs
WHERE action = 'REJECT'
AND dt = '2026/03/08'
GROUP BY 1, 2, 3, 4, 5
ORDER BY total_packets DESC
LIMIT 50;
6. NAT Gateway and Internet Gateway Issues
6.1 Internet Gateway (IGW) Troubleshooting
Internet Access via IGW Checklist:
[ ] Is an IGW attached to the VPC?
[ ] Does the public subnet route table have a 0.0.0.0/0 -> IGW route?
[ ] Does the instance have a Public IP or Elastic IP assigned?
[ ] Does the Security Group outbound allow the traffic?
[ ] Do the NACL rules allow both inbound and outbound traffic?
# Check IGW attachment status
aws ec2 describe-internet-gateways \
--filters "Name=attachment.vpc-id,Values=vpc-0abc1234" \
--query 'InternetGateways[*].{IGWId:InternetGatewayId,State:Attachments[0].State}'
# Check public subnet routing
aws ec2 describe-route-tables \
--filters "Name=association.subnet-id,Values=subnet-pub123" \
--query 'RouteTables[*].Routes[?DestinationCidrBlock==`0.0.0.0/0`]'
6.2 NAT Gateway Problem Resolution
# Check NAT Gateway status
aws ec2 describe-nat-gateways \
--filter "Name=vpc-id,Values=vpc-0abc1234" \
--query 'NatGateways[*].{
NatGatewayId:NatGatewayId,
State:State,
SubnetId:SubnetId,
PublicIp:NatGatewayAddresses[0].PublicIp,
PrivateIp:NatGatewayAddresses[0].PrivateIp
}'
Common NAT Gateway Issues:
Issue 1: No internet access through NAT Gateway
Cause: NAT Gateway was placed in a private subnet
Fix: NAT Gateway must be located in a public subnet
Issue 2: Intermittent connection drops
Cause: NAT Gateway connection tracking limit exceeded
Check: CloudWatch Metrics > ErrorPortAllocation
Fix: Distribute across multiple NAT Gateways or use service endpoints
Issue 3: NAT Gateway cost explosion
Cause: AWS service traffic (S3, DynamoDB) routing through NAT
Fix: Create VPC Endpoints (Gateway type) for private routing
# Create VPC Endpoint (S3 example — reduces NAT costs)
aws ec2 create-vpc-endpoint \
--vpc-id vpc-0abc1234 \
--service-name com.amazonaws.ap-northeast-2.s3 \
--route-table-ids rtb-priv111 rtb-priv222
# Check NAT Gateway CloudWatch metrics
aws cloudwatch get-metric-statistics \
--namespace AWS/NATGateway \
--metric-name ErrorPortAllocation \
--dimensions Name=NatGatewayId,Value=nat-0abc1234 \
--start-time 2026-03-07T00:00:00Z \
--end-time 2026-03-08T00:00:00Z \
--period 3600 \
--statistics Sum
7. Cross-Region and Hybrid Cloud Connectivity
7.1 Cross-Region VPC Connectivity
Cross-Region Connection Options:
1. Inter-Region VPC Peering — 1:1 connection, simple setup
2. Transit Gateway Inter-Region Peering — Hub-spoke, large-scale
3. AWS Cloud WAN — Global network management (newest)
# Check Inter-Region TGW Peering status
aws ec2 describe-transit-gateway-peering-attachments \
--filters "Name=state,Values=available,pendingAcceptance" \
--query 'TransitGatewayPeeringAttachments[*].{
AttachmentId:TransitGatewayAttachmentId,
RequesterTgw:RequesterTgwInfo.TransitGatewayId,
AccepterTgw:AccepterTgwInfo.TransitGatewayId,
State:State
}'
7.2 Hybrid Cloud Connectivity (VPN & Direct Connect)
# Check Site-to-Site VPN tunnel status
aws ec2 describe-vpn-connections \
--vpn-connection-ids vpn-0abc1234 \
--query 'VpnConnections[*].{
VpnId:VpnConnectionId,
State:State,
Tunnel1Status:VgwTelemetry[0].Status,
Tunnel1StatusMessage:VgwTelemetry[0].StatusMessage,
Tunnel2Status:VgwTelemetry[1].Status,
Tunnel2StatusMessage:VgwTelemetry[1].StatusMessage
}'
VPN Tunnel Down Diagnosis:
When a VPN tunnel is DOWN, check the following:
[ ] Customer Gateway (CGW) configuration
- IKE version (v1 or v2)
- Pre-shared key match
- Phase 1/Phase 2 encryption parameter match
[ ] On-premises firewall allows UDP 500, UDP 4500
[ ] NAT-Traversal configuration
[ ] BGP session status (when using dynamic routing)
[ ] Dead Peer Detection (DPD) settings match
7.3 Multi-Cloud Network Connectivity
AWS <-> GCP Connection Options:
1. IPSec VPN (AWS VPN + GCP Cloud VPN)
2. Partner Interconnect
3. Third-party SD-WAN solutions
AWS <-> Azure Connection Options:
1. IPSec VPN (AWS VPN + Azure VPN Gateway)
2. ExpressRoute + Direct Connect (dedicated line)
3. Cloud Exchange via Megaport, etc.
# Check GCP VPN tunnel status
gcloud compute vpn-tunnels describe my-tunnel \
--region=us-central1 \
--format="value(status,detailedStatus)"
# Check Azure VPN connection status
az network vpn-connection show \
--name my-vpn-connection \
--resource-group my-rg \
--query '{Status:connectionStatus,EgressBytes:egressBytesTransferred,IngressBytes:ingressBytesTransferred}'
8. Cloud DNS Troubleshooting (Route 53, Cloud DNS)
8.1 AWS Route 53 Problem Resolution
# Check Route 53 hosted zone records
aws route53 list-resource-record-sets \
--hosted-zone-id Z0123456789ABCDEFGHIJ \
--query 'ResourceRecordSets[?Name==`api.example.com.`]'
# Check Route 53 Resolver query logs (VPC internal DNS)
aws route53resolver list-resolver-query-log-configs
# DNS resolution test
aws route53 test-dns-answer \
--hosted-zone-id Z0123456789ABCDEFGHIJ \
--record-name api.example.com \
--record-type A
Private Hosted Zone Issues:
Problem: Private DNS records are not resolving within the VPC
Checklist:
[ ] Is the private hosted zone associated with the VPC?
[ ] Is enableDnsSupport set to true for the VPC?
[ ] Is enableDnsHostnames set to true for the VPC?
[ ] Is the DHCP Options Set using AmazonProvidedDNS?
# Check VPC DNS settings
aws ec2 describe-vpc-attribute \
--vpc-id vpc-0abc1234 \
--attribute enableDnsSupport
aws ec2 describe-vpc-attribute \
--vpc-id vpc-0abc1234 \
--attribute enableDnsHostnames
8.2 GCP Cloud DNS Debugging
# Check Cloud DNS records
gcloud dns record-sets list \
--zone=my-dns-zone \
--filter="name=api.example.com."
# Check VPCs associated with a private DNS zone
gcloud dns managed-zones describe my-private-zone \
--format="value(privateVisibilityConfig.networks)"
# Check DNS policies
gcloud dns policies list
8.3 Hybrid DNS Configuration Issues
DNS Resolution in VPC Peering Environments:
- Route 53 Resolver inbound/outbound endpoints are required
- On-premises -> AWS: Use Inbound Endpoint
- AWS -> On-premises: Use Outbound Endpoint + Forwarding Rule
+-----------------------------+
| Route 53 Resolver |
On-prem --> | Inbound Outbound | --> On-prem DNS
DNS query | Endpoint Endpoint | Forwarding
+-----------------------------+
# Check Route 53 Resolver endpoint status
aws route53resolver list-resolver-endpoints \
--query 'ResolverEndpoints[*].{
Id:Id,
Name:Name,
Direction:Direction,
Status:Status,
IpAddressCount:IpAddressCount
}'
# Check forwarding rules
aws route53resolver list-resolver-rules \
--query 'ResolverRules[*].{
Id:Id,
DomainName:DomainName,
RuleType:RuleType,
Status:Status,
TargetIps:TargetIps
}'
9. Practical Cloud Network Debugging Scenarios
Scenario 1: External API Call Timeout from ECS Service
Symptom: Connection timeout when calling external API from ECS Fargate task
Environment: Fargate task deployed in a private subnet
Debugging Steps:
1. Verify task network mode (awsvpc)
2. Check Security Group outbound rules on the task's ENI
3. Verify NAT Gateway route in subnet route table
4. Check NAT Gateway status and placement (public subnet)
5. Check NACL rules (outbound + ephemeral port inbound)
Root Cause: NAT Gateway was deleted but a blackhole route remained
Fix: Create a new NAT Gateway and update the route table
Scenario 2: RDS Inaccessible Across VPC Peering
Symptom: Application in VPC-A cannot connect to RDS in VPC-B on TCP 3306
Environment: VPC-A (10.0.0.0/16) <-> Peering <-> VPC-B (10.1.0.0/16)
Debugging Steps:
1. Check Peering status -> Active
2. VPC-A route table -> 10.1.0.0/16 -> pcx-xxx (OK)
3. VPC-B route table -> 10.0.0.0/16 -> pcx-xxx (OK)
4. RDS Security Group inbound check -> 10.0.0.0/16 on 3306 NOT allowed (FOUND IT!)
5. RDS only permitted references from VPC-B Security Groups
Fix: Add inbound rule to RDS Security Group allowing 3306 from VPC-A CIDR (10.0.0.0/16)
Scenario 3: Lambda Function Cannot Access VPC Resources
Symptom: VPC-connected Lambda cannot access ElastiCache, and internet is also blocked
Environment: Lambda connected to private subnets
Debugging Steps:
1. Check Lambda execution role for VPC permissions
- ec2:CreateNetworkInterface
- ec2:DescribeNetworkInterfaces
- ec2:DeleteNetworkInterface
2. Verify assigned subnets (minimum 2 AZs recommended)
3. Check Lambda SG -> ElastiCache SG connectivity
4. Internet access requires: private subnet + NAT Gateway
Root Cause: Lambda Security Group outbound was restrictive and
ElastiCache port (6379) was not allowed
Fix: Add TCP 6379 outbound allow to Lambda SG
Scenario 4: Cross-Region Data Transfer Latency Spike
Symptom: Data replication latency spike between us-east-1 and ap-northeast-2
Environment: Using Inter-Region VPC Peering
Debugging Steps:
1. Check VPC Peering status -> Active
2. Review network bandwidth -> CloudWatch BytesIn/BytesOut metrics
3. Check instance type network performance limits
4. Optimize TCP Window Scaling and buffer sizes
5. Consider AWS Global Accelerator or CloudFront
Optimizations:
- Use parallel streams for bulk transfers
- TCP window size tuning: sysctl -w net.core.rmem_max=16777216
- Leverage S3 Transfer Acceleration (for file transfers)
10. Troubleshooting Toolkit and Automation
10.1 Comprehensive Diagnostic Script
#!/bin/bash
# cloud-network-diag.sh — Cloud network comprehensive diagnostic script
TARGET_IP=$1
TARGET_PORT=${2:-443}
VPC_ID=${3:-"auto"}
echo "=== Cloud Network Diagnostics Starting ==="
echo "Target: ${TARGET_IP}:${TARGET_PORT}"
echo ""
# 1. Basic connectivity check
echo "--- 1. Basic Connectivity Check ---"
ping -c 3 -W 2 $TARGET_IP 2>/dev/null
nc -zv -w 3 $TARGET_IP $TARGET_PORT 2>&1
# 2. DNS resolution check
echo "--- 2. DNS Resolution Check ---"
dig +short $TARGET_IP 2>/dev/null || echo "Skipping DNS lookup for IP address"
# 3. Route tracing
echo "--- 3. Route Tracing ---"
traceroute -m 15 -w 2 $TARGET_IP 2>/dev/null
# 4. AWS Security Group check (when running on an instance)
echo "--- 4. Metadata-Based Security Group Check ---"
TOKEN=$(curl -s -X PUT "http://169.254.169.254/latest/api/token" \
-H "X-aws-ec2-metadata-token-ttl-seconds: 60" 2>/dev/null)
if [ ! -z "$TOKEN" ]; then
SG_IDS=$(curl -s -H "X-aws-ec2-metadata-token: $TOKEN" \
http://169.254.169.254/latest/meta-data/security-groups 2>/dev/null)
echo "Attached Security Groups: $SG_IDS"
fi
echo "=== Diagnostics Complete ==="
10.2 CloudWatch Alarm Setup (Network Monitoring)
# NAT Gateway ErrorPortAllocation alarm
aws cloudwatch put-metric-alarm \
--alarm-name "NATGateway-PortAllocation-Error" \
--metric-name ErrorPortAllocation \
--namespace AWS/NATGateway \
--dimensions Name=NatGatewayId,Value=nat-0abc1234 \
--statistic Sum \
--period 300 \
--threshold 100 \
--comparison-operator GreaterThanThreshold \
--evaluation-periods 2 \
--alarm-actions arn:aws:sns:ap-northeast-2:123456789012:network-alerts
# VPN tunnel down alarm
aws cloudwatch put-metric-alarm \
--alarm-name "VPN-Tunnel-Down" \
--metric-name TunnelState \
--namespace AWS/VPN \
--dimensions Name=VpnId,Value=vpn-0abc1234 \
--statistic Maximum \
--period 300 \
--threshold 0.5 \
--comparison-operator LessThanThreshold \
--evaluation-periods 2 \
--alarm-actions arn:aws:sns:ap-northeast-2:123456789012:network-alerts
Conclusion
Systematic methodology is the key to cloud network troubleshooting. Here are the main takeaways:
- Layered approach: Diagnose from upper to lower layers based on the OSI model
- Leverage logs: Always enable VPC Flow Logs and DNS Query Logs
- Use native tools: Actively use each cloud's Reachability Analyzer and Connectivity Tests
- Understand security rules: Clearly distinguish Stateful (SG) vs Stateless (NACL) behavior
- Optimize costs: Use VPC Endpoints to reduce NAT Gateway costs
- Monitor proactively: Build early detection with CloudWatch alarms
Network issues are often hard to reproduce, so the most important thing is to have monitoring and logging properly configured at all times.
Quiz
Q1: What is the main topic covered in "Complete Guide to Cloud Network Architecture
Troubleshooting — Practical Debugging for AWS, GCP, and Azure"?
Covers everything from VPC networking fundamentals, Security Group and NACL debugging, VPC Peering, Transit Gateway, Flow Logs analysis, NAT/IGW issues, cross-region connectivity, to DNS troubleshooting in cloud environments.
Q2: What are the key differences in Security Groups vs NACLs Debugging?
2.1 Key Differences Between Security Groups and NACLs Failure to understand the differences
between these two security mechanisms leads to wasted troubleshooting time.
Q3: Explain the core concept of VPC Peering and Transit Gateway Issues.
3.1 VPC Peering Troubleshooting VPC Peering provides private network connectivity between two
VPCs, but comes with several constraints.
Q4: What approach is recommended for AWS/GCP/Azure-Specific Network Troubleshooting?
4.1 AWS-Specific Network Issues 4.2 GCP Network Debugging 4.3 Azure Network Diagnostics
Q5: How does VPC Flow Logs Analysis work?
5.1 Flow Logs Setup and Field Reference VPC Flow Logs are the essential data source for network
troubleshooting. 5.2 Analysis with CloudWatch Logs Insights 5.3 Large-Scale Flow Log Analysis with
Athena