Skip to content
Published on

Complete Guide to Cloud Network Architecture Troubleshooting — Practical Debugging for AWS, GCP, and Azure

Authors
  • Name
    Twitter

1. VPC Networking Fundamentals and Troubleshooting Overview

Network issues in cloud environments are among the most frequent yet difficult to diagnose. Unlike on-premises setups where you can physically access hardware, cloud troubleshooting requires a systematic approach using tools and logs provided by cloud vendors.

1.1 VPC (Virtual Private Cloud) Basic Structure

A VPC is a logically isolated network space within the cloud. While terminology varies across cloud vendors, the core concepts remain the same.

ConceptAWSGCPAzure
Virtual NetworkVPCVPCVNet
SubnetSubnetSubnetSubnet
Routing TableRoute TableRoutesRoute Table
Firewall (Instance)Security GroupFirewall RulesNSG
Firewall (Subnet)NACLFirewall PoliciesNSG (Subnet-attached)

1.2 Systematic Troubleshooting Approach

When diagnosing network issues, working backwards through the OSI 7-layer model is most effective.

Troubleshooting Order:
1. Application Layer (L7)DNS resolution, HTTP response codes
2. Transport Layer (L4)Port connectivity, TCP handshake
3. Network Layer (L3)IP routing, subnet CIDR
4. Security LayerSecurity Groups, NACLs, firewall rules
5. Infrastructure LayerIGW, NAT GW, VPN, Peering status

1.3 Basic Connectivity Test Commands

# Basic connectivity test
ping -c 4 10.0.1.50

# Specific port connectivity check
nc -zv 10.0.1.50 443 -w 5

# Detailed TCP route tracing
traceroute -T -p 443 10.0.1.50

# DNS lookup test
dig +trace example.com
nslookup example.com 10.0.0.2

# HTTP-level debugging with curl
curl -vvv --connect-timeout 5 https://api.internal.example.com/health

2. Security Groups vs NACLs Debugging

2.1 Key Differences Between Security Groups and NACLs

Failure to understand the differences between these two security mechanisms leads to wasted troubleshooting time.

Security Group (Stateful):
  - Inbound allow automatically permits corresponding response traffic
  - Applied at the instance (ENI) level
  - Allow rules only (implicit deny)
  - Rule evaluation: All rules are evaluated to determine access

NACL (Stateless):
  - Inbound/outbound rules must be configured separately
  - Applied at the subnet level
  - Both allow and deny rules supported
  - Rule evaluation: Processed in rule number order (first match wins)

2.2 Common Security Group Issues and Fixes

# Check Security Group rules with AWS CLI
aws ec2 describe-security-groups \
  --group-ids sg-0abc1234def56789 \
  --query 'SecurityGroups[*].{
    GroupId:GroupId,
    Ingress:IpPermissions,
    Egress:IpPermissionsEgress
  }' \
  --output json

# Check Security Groups attached to a specific instance
aws ec2 describe-instances \
  --instance-ids i-0abcdef1234567890 \
  --query 'Reservations[*].Instances[*].SecurityGroups' \
  --output table

Case 1: Missing Outbound Rules

Problem: EC2 fails to call external APIs
Cause: Default outbound rule (0.0.0.0/0 allow) was deleted from the Security Group

Fix:
aws ec2 authorize-security-group-egress \
  --group-id sg-0abc1234def56789 \
  --protocol tcp --port 443 \
  --cidr 0.0.0.0/0

Case 2: Missing Self-Referencing Rules

Problem: Instances within the same Security Group cannot communicate
Cause: The SG lacks a rule allowing itself as a source

Fix:
aws ec2 authorize-security-group-ingress \
  --group-id sg-0abc1234def56789 \
  --protocol -1 \
  --source-group sg-0abc1234def56789

2.3 NACL Debugging Patterns

# Check NACL rules
aws ec2 describe-network-acls \
  --network-acl-ids acl-0abc1234 \
  --query 'NetworkAcls[*].Entries' \
  --output json

Ephemeral Port Issues — The Most Common NACL Problem:

Problem: Outbound HTTP request responses are not received
Cause: No inbound NACL rule for ephemeral port range (1024-65535)

NACL Rule Example (correct configuration):
Inbound:
  Rule 100: TCP 443    ALLOW (0.0.0.0/0)
  Rule 110: TCP 1024-65535 ALLOW (0.0.0.0/0)  <-- This is essential
  Rule *:   All traffic DENY

Outbound:
  Rule 100: TCP 443 ALLOW (0.0.0.0/0)
  Rule 110: TCP 1024-65535 ALLOW (0.0.0.0/0)  <-- Response traffic
  Rule *:   All traffic DENY

3. VPC Peering and Transit Gateway Issues

3.1 VPC Peering Troubleshooting

VPC Peering provides private network connectivity between two VPCs, but comes with several constraints.

VPC Peering Constraints:
  x Transitive peering is NOT supported
    VPC-A <-> VPC-B <-> VPC-C: A cannot reach C directly
  x Overlapping CIDRs are NOT allowed
  x Cross-region peering may not support Security Group references
# Check VPC Peering status
aws ec2 describe-vpc-peering-connections \
  --filters "Name=status-code,Values=active,pending-acceptance" \
  --query 'VpcPeeringConnections[*].{
    PeeringId:VpcPeeringConnectionId,
    Status:Status.Code,
    Requester:RequesterVpcInfo.{VpcId:VpcId,CidrBlock:CidrBlock},
    Accepter:AccepterVpcInfo.{VpcId:VpcId,CidrBlock:CidrBlock}
  }'

# Check peering routes in routing tables
aws ec2 describe-route-tables \
  --filters "Name=vpc-id,Values=vpc-0abc1234" \
  --query 'RouteTables[*].Routes[?GatewayId!=`local`]'

Common Mistake: Missing Route Table Entries

Problem: Peering connection is Active but communication fails
Cause: Peering routes were not added to both VPCs' route tables

Fix:
# Add VPC-B CIDR to VPC-A route table
aws ec2 create-route \
  --route-table-id rtb-aaa111 \
  --destination-cidr-block 10.1.0.0/16 \
  --vpc-peering-connection-id pcx-0abc1234

# Add VPC-A CIDR to VPC-B route table
aws ec2 create-route \
  --route-table-id rtb-bbb222 \
  --destination-cidr-block 10.0.0.0/16 \
  --vpc-peering-connection-id pcx-0abc1234

3.2 Transit Gateway Problem Resolution

Transit Gateway connects multiple VPCs and on-premises networks using a hub-and-spoke model.

# Check Transit Gateway route tables
aws ec2 describe-transit-gateway-route-tables \
  --transit-gateway-route-table-ids tgw-rtb-0abc1234

# Check Transit Gateway attachment states
aws ec2 describe-transit-gateway-attachments \
  --filters "Name=transit-gateway-id,Values=tgw-0abc1234" \
  --query 'TransitGatewayAttachments[*].{
    AttachmentId:TransitGatewayAttachmentId,
    ResourceType:ResourceType,
    State:State
  }'

# Query routes in TGW route table
aws ec2 search-transit-gateway-routes \
  --transit-gateway-route-table-id tgw-rtb-0abc1234 \
  --filters "Name=type,Values=static,propagated"

TGW Route Propagation Issues:

Problem: Communication fails between VPCs connected via TGW
Checklist:
  [ ] Is the TGW Attachment in "available" state?
  [ ] Are routes propagated in the TGW Route Table?
  [ ] Do each VPC's subnet route tables have TGW routes?
  [ ] Is the TGW Route Table Association correct?

4. AWS/GCP/Azure-Specific Network Troubleshooting

4.1 AWS-Specific Network Issues

# Use VPC Reachability Analyzer (powerful diagnostic tool)
aws ec2 create-network-insights-path \
  --source i-0abc1234 \
  --destination i-0def5678 \
  --protocol TCP \
  --destination-port 443

aws ec2 start-network-insights-analysis \
  --network-insights-path-id nip-0abc1234

# Check results
aws ec2 describe-network-insights-analyses \
  --network-insights-analysis-ids nia-0abc1234 \
  --query 'NetworkInsightsAnalyses[*].{
    Status:Status,
    PathFound:NetworkPathFound,
    Explanations:Explanations
  }'

4.2 GCP Network Debugging

# Check GCP firewall rules
gcloud compute firewall-rules list \
  --filter="network=my-vpc" \
  --format="table(name,direction,priority,allowed,sourceRanges,targetTags)"

# Run Connectivity Test (GCP's Reachability Analyzer)
gcloud network-management connectivity-tests create my-test \
  --source-instance=projects/my-project/zones/us-central1-a/instances/vm-1 \
  --destination-instance=projects/my-project/zones/us-central1-b/instances/vm-2 \
  --protocol=TCP \
  --destination-port=8080

# Check results
gcloud network-management connectivity-tests describe my-test

4.3 Azure Network Diagnostics

# Check NSG rules
az network nsg rule list \
  --nsg-name my-nsg \
  --resource-group my-rg \
  --output table

# Connectivity check with Network Watcher
az network watcher test-connectivity \
  --resource-group my-rg \
  --source-resource vm-source \
  --dest-resource vm-dest \
  --dest-port 443

# Check effective routes
az network nic show-effective-route-table \
  --name my-nic \
  --resource-group my-rg

5. VPC Flow Logs Analysis

5.1 Flow Logs Setup and Field Reference

VPC Flow Logs are the essential data source for network troubleshooting.

# Enable AWS VPC Flow Logs
aws ec2 create-flow-logs \
  --resource-type VPC \
  --resource-ids vpc-0abc1234 \
  --traffic-type ALL \
  --log-destination-type cloud-watch-logs \
  --log-group-name /vpc/flow-logs \
  --deliver-logs-permission-arn arn:aws:iam::123456789012:role/flowlogsRole \
  --max-aggregation-interval 60
Flow Log Record Format:
<version> <account-id> <interface-id> <srcaddr> <dstaddr> <srcport> <dstport> <protocol> <packets> <bytes> <start> <end> <action> <log-status>

Example:
2 123456789012 eni-0abc1234 10.0.1.50 10.0.2.100 49152 443 6 20 4000 1625000000 1625000060 ACCEPT OK
2 123456789012 eni-0abc1234 203.0.113.5 10.0.1.50 0 0 1 4 336 1625000000 1625000060 REJECT OK

5.2 Analysis with CloudWatch Logs Insights

-- Top 10 source IPs with rejected traffic
fields @timestamp, srcAddr, dstAddr, dstPort, action
| filter action = "REJECT"
| stats count(*) as rejectedCount by srcAddr
| sort rejectedCount desc
| limit 10

-- Traffic pattern analysis for a specific IP
fields @timestamp, srcAddr, dstAddr, srcPort, dstPort, protocol, action
| filter srcAddr = "10.0.1.50" or dstAddr = "10.0.1.50"
| sort @timestamp desc
| limit 100

-- Detect anomalous port scanning
fields srcAddr, dstPort, action
| filter action = "REJECT"
| stats count(dstPort) as portCount by srcAddr
| filter portCount > 100
| sort portCount desc

5.3 Large-Scale Flow Log Analysis with Athena

-- Create Athena table
CREATE EXTERNAL TABLE IF NOT EXISTS vpc_flow_logs (
  version int,
  account_id string,
  interface_id string,
  srcaddr string,
  dstaddr string,
  srcport int,
  dstport int,
  protocol bigint,
  packets bigint,
  bytes bigint,
  start bigint,
  end bigint,
  action string,
  log_status string
)
PARTITIONED BY (dt string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '
LOCATION 's3://my-flow-logs-bucket/AWSLogs/123456789012/vpcflowlogs/';

-- Rejected traffic trends by time period
SELECT
  from_unixtime(start) as event_time,
  srcaddr,
  dstaddr,
  dstport,
  protocol,
  sum(packets) as total_packets,
  sum(bytes) as total_bytes
FROM vpc_flow_logs
WHERE action = 'REJECT'
  AND dt = '2026/03/08'
GROUP BY 1, 2, 3, 4, 5
ORDER BY total_packets DESC
LIMIT 50;

6. NAT Gateway and Internet Gateway Issues

6.1 Internet Gateway (IGW) Troubleshooting

Internet Access via IGW Checklist:
  [ ] Is an IGW attached to the VPC?
  [ ] Does the public subnet route table have a 0.0.0.0/0 -> IGW route?
  [ ] Does the instance have a Public IP or Elastic IP assigned?
  [ ] Does the Security Group outbound allow the traffic?
  [ ] Do the NACL rules allow both inbound and outbound traffic?
# Check IGW attachment status
aws ec2 describe-internet-gateways \
  --filters "Name=attachment.vpc-id,Values=vpc-0abc1234" \
  --query 'InternetGateways[*].{IGWId:InternetGatewayId,State:Attachments[0].State}'

# Check public subnet routing
aws ec2 describe-route-tables \
  --filters "Name=association.subnet-id,Values=subnet-pub123" \
  --query 'RouteTables[*].Routes[?DestinationCidrBlock==`0.0.0.0/0`]'

6.2 NAT Gateway Problem Resolution

# Check NAT Gateway status
aws ec2 describe-nat-gateways \
  --filter "Name=vpc-id,Values=vpc-0abc1234" \
  --query 'NatGateways[*].{
    NatGatewayId:NatGatewayId,
    State:State,
    SubnetId:SubnetId,
    PublicIp:NatGatewayAddresses[0].PublicIp,
    PrivateIp:NatGatewayAddresses[0].PrivateIp
  }'

Common NAT Gateway Issues:

Issue 1: No internet access through NAT Gateway
  Cause: NAT Gateway was placed in a private subnet
  Fix: NAT Gateway must be located in a public subnet

Issue 2: Intermittent connection drops
  Cause: NAT Gateway connection tracking limit exceeded
  Check: CloudWatch Metrics > ErrorPortAllocation
  Fix: Distribute across multiple NAT Gateways or use service endpoints

Issue 3: NAT Gateway cost explosion
  Cause: AWS service traffic (S3, DynamoDB) routing through NAT
  Fix: Create VPC Endpoints (Gateway type) for private routing
# Create VPC Endpoint (S3 example — reduces NAT costs)
aws ec2 create-vpc-endpoint \
  --vpc-id vpc-0abc1234 \
  --service-name com.amazonaws.ap-northeast-2.s3 \
  --route-table-ids rtb-priv111 rtb-priv222

# Check NAT Gateway CloudWatch metrics
aws cloudwatch get-metric-statistics \
  --namespace AWS/NATGateway \
  --metric-name ErrorPortAllocation \
  --dimensions Name=NatGatewayId,Value=nat-0abc1234 \
  --start-time 2026-03-07T00:00:00Z \
  --end-time 2026-03-08T00:00:00Z \
  --period 3600 \
  --statistics Sum

7. Cross-Region and Hybrid Cloud Connectivity

7.1 Cross-Region VPC Connectivity

Cross-Region Connection Options:
  1. Inter-Region VPC Peering1:1 connection, simple setup
  2. Transit Gateway Inter-Region PeeringHub-spoke, large-scale
  3. AWS Cloud WANGlobal network management (newest)
# Check Inter-Region TGW Peering status
aws ec2 describe-transit-gateway-peering-attachments \
  --filters "Name=state,Values=available,pendingAcceptance" \
  --query 'TransitGatewayPeeringAttachments[*].{
    AttachmentId:TransitGatewayAttachmentId,
    RequesterTgw:RequesterTgwInfo.TransitGatewayId,
    AccepterTgw:AccepterTgwInfo.TransitGatewayId,
    State:State
  }'

7.2 Hybrid Cloud Connectivity (VPN & Direct Connect)

# Check Site-to-Site VPN tunnel status
aws ec2 describe-vpn-connections \
  --vpn-connection-ids vpn-0abc1234 \
  --query 'VpnConnections[*].{
    VpnId:VpnConnectionId,
    State:State,
    Tunnel1Status:VgwTelemetry[0].Status,
    Tunnel1StatusMessage:VgwTelemetry[0].StatusMessage,
    Tunnel2Status:VgwTelemetry[1].Status,
    Tunnel2StatusMessage:VgwTelemetry[1].StatusMessage
  }'

VPN Tunnel Down Diagnosis:

When a VPN tunnel is DOWN, check the following:
  [ ] Customer Gateway (CGW) configuration
    - IKE version (v1 or v2)
    - Pre-shared key match
    - Phase 1/Phase 2 encryption parameter match
  [ ] On-premises firewall allows UDP 500, UDP 4500
  [ ] NAT-Traversal configuration
  [ ] BGP session status (when using dynamic routing)
  [ ] Dead Peer Detection (DPD) settings match

7.3 Multi-Cloud Network Connectivity

AWS <-> GCP Connection Options:
  1. IPSec VPN (AWS VPN + GCP Cloud VPN)
  2. Partner Interconnect
  3. Third-party SD-WAN solutions

AWS <-> Azure Connection Options:
  1. IPSec VPN (AWS VPN + Azure VPN Gateway)
  2. ExpressRoute + Direct Connect (dedicated line)
  3. Cloud Exchange via Megaport, etc.
# Check GCP VPN tunnel status
gcloud compute vpn-tunnels describe my-tunnel \
  --region=us-central1 \
  --format="value(status,detailedStatus)"

# Check Azure VPN connection status
az network vpn-connection show \
  --name my-vpn-connection \
  --resource-group my-rg \
  --query '{Status:connectionStatus,EgressBytes:egressBytesTransferred,IngressBytes:ingressBytesTransferred}'

8. Cloud DNS Troubleshooting (Route 53, Cloud DNS)

8.1 AWS Route 53 Problem Resolution

# Check Route 53 hosted zone records
aws route53 list-resource-record-sets \
  --hosted-zone-id Z0123456789ABCDEFGHIJ \
  --query 'ResourceRecordSets[?Name==`api.example.com.`]'

# Check Route 53 Resolver query logs (VPC internal DNS)
aws route53resolver list-resolver-query-log-configs

# DNS resolution test
aws route53 test-dns-answer \
  --hosted-zone-id Z0123456789ABCDEFGHIJ \
  --record-name api.example.com \
  --record-type A

Private Hosted Zone Issues:

Problem: Private DNS records are not resolving within the VPC
Checklist:
  [ ] Is the private hosted zone associated with the VPC?
  [ ] Is enableDnsSupport set to true for the VPC?
  [ ] Is enableDnsHostnames set to true for the VPC?
  [ ] Is the DHCP Options Set using AmazonProvidedDNS?
# Check VPC DNS settings
aws ec2 describe-vpc-attribute \
  --vpc-id vpc-0abc1234 \
  --attribute enableDnsSupport

aws ec2 describe-vpc-attribute \
  --vpc-id vpc-0abc1234 \
  --attribute enableDnsHostnames

8.2 GCP Cloud DNS Debugging

# Check Cloud DNS records
gcloud dns record-sets list \
  --zone=my-dns-zone \
  --filter="name=api.example.com."

# Check VPCs associated with a private DNS zone
gcloud dns managed-zones describe my-private-zone \
  --format="value(privateVisibilityConfig.networks)"

# Check DNS policies
gcloud dns policies list

8.3 Hybrid DNS Configuration Issues

DNS Resolution in VPC Peering Environments:
  - Route 53 Resolver inbound/outbound endpoints are required
  - On-premises -> AWS: Use Inbound Endpoint
  - AWS -> On-premises: Use Outbound Endpoint + Forwarding Rule

                +-----------------------------+
                |     Route 53 Resolver       |
  On-prem  -->  |  Inbound     Outbound       |  --> On-prem DNS
  DNS query     |  Endpoint    Endpoint        |     Forwarding
                +-----------------------------+
# Check Route 53 Resolver endpoint status
aws route53resolver list-resolver-endpoints \
  --query 'ResolverEndpoints[*].{
    Id:Id,
    Name:Name,
    Direction:Direction,
    Status:Status,
    IpAddressCount:IpAddressCount
  }'

# Check forwarding rules
aws route53resolver list-resolver-rules \
  --query 'ResolverRules[*].{
    Id:Id,
    DomainName:DomainName,
    RuleType:RuleType,
    Status:Status,
    TargetIps:TargetIps
  }'

9. Practical Cloud Network Debugging Scenarios

Scenario 1: External API Call Timeout from ECS Service

Symptom: Connection timeout when calling external API from ECS Fargate task
Environment: Fargate task deployed in a private subnet

Debugging Steps:
  1. Verify task network mode (awsvpc)
  2. Check Security Group outbound rules on the task's ENI
  3. Verify NAT Gateway route in subnet route table
  4. Check NAT Gateway status and placement (public subnet)
  5. Check NACL rules (outbound + ephemeral port inbound)

Root Cause: NAT Gateway was deleted but a blackhole route remained
Fix: Create a new NAT Gateway and update the route table

Scenario 2: RDS Inaccessible Across VPC Peering

Symptom: Application in VPC-A cannot connect to RDS in VPC-B on TCP 3306
Environment: VPC-A (10.0.0.0/16) <-> Peering <-> VPC-B (10.1.0.0/16)

Debugging Steps:
  1. Check Peering status -> Active
  2. VPC-A route table -> 10.1.0.0/16 -> pcx-xxx (OK)
  3. VPC-B route table -> 10.0.0.0/16 -> pcx-xxx (OK)
  4. RDS Security Group inbound check -> 10.0.0.0/16 on 3306 NOT allowed (FOUND IT!)
  5. RDS only permitted references from VPC-B Security Groups

Fix: Add inbound rule to RDS Security Group allowing 3306 from VPC-A CIDR (10.0.0.0/16)

Scenario 3: Lambda Function Cannot Access VPC Resources

Symptom: VPC-connected Lambda cannot access ElastiCache, and internet is also blocked
Environment: Lambda connected to private subnets

Debugging Steps:
  1. Check Lambda execution role for VPC permissions
     - ec2:CreateNetworkInterface
     - ec2:DescribeNetworkInterfaces
     - ec2:DeleteNetworkInterface
  2. Verify assigned subnets (minimum 2 AZs recommended)
  3. Check Lambda SG -> ElastiCache SG connectivity
  4. Internet access requires: private subnet + NAT Gateway

Root Cause: Lambda Security Group outbound was restrictive and
  ElastiCache port (6379) was not allowed
Fix: Add TCP 6379 outbound allow to Lambda SG

Scenario 4: Cross-Region Data Transfer Latency Spike

Symptom: Data replication latency spike between us-east-1 and ap-northeast-2
Environment: Using Inter-Region VPC Peering

Debugging Steps:
  1. Check VPC Peering status -> Active
  2. Review network bandwidth -> CloudWatch BytesIn/BytesOut metrics
  3. Check instance type network performance limits
  4. Optimize TCP Window Scaling and buffer sizes
  5. Consider AWS Global Accelerator or CloudFront

Optimizations:
  - Use parallel streams for bulk transfers
  - TCP window size tuning: sysctl -w net.core.rmem_max=16777216
  - Leverage S3 Transfer Acceleration (for file transfers)

10. Troubleshooting Toolkit and Automation

10.1 Comprehensive Diagnostic Script

#!/bin/bash
# cloud-network-diag.sh — Cloud network comprehensive diagnostic script

TARGET_IP=$1
TARGET_PORT=${2:-443}
VPC_ID=${3:-"auto"}

echo "=== Cloud Network Diagnostics Starting ==="
echo "Target: ${TARGET_IP}:${TARGET_PORT}"
echo ""

# 1. Basic connectivity check
echo "--- 1. Basic Connectivity Check ---"
ping -c 3 -W 2 $TARGET_IP 2>/dev/null
nc -zv -w 3 $TARGET_IP $TARGET_PORT 2>&1

# 2. DNS resolution check
echo "--- 2. DNS Resolution Check ---"
dig +short $TARGET_IP 2>/dev/null || echo "Skipping DNS lookup for IP address"

# 3. Route tracing
echo "--- 3. Route Tracing ---"
traceroute -m 15 -w 2 $TARGET_IP 2>/dev/null

# 4. AWS Security Group check (when running on an instance)
echo "--- 4. Metadata-Based Security Group Check ---"
TOKEN=$(curl -s -X PUT "http://169.254.169.254/latest/api/token" \
  -H "X-aws-ec2-metadata-token-ttl-seconds: 60" 2>/dev/null)
if [ ! -z "$TOKEN" ]; then
  SG_IDS=$(curl -s -H "X-aws-ec2-metadata-token: $TOKEN" \
    http://169.254.169.254/latest/meta-data/security-groups 2>/dev/null)
  echo "Attached Security Groups: $SG_IDS"
fi

echo "=== Diagnostics Complete ==="

10.2 CloudWatch Alarm Setup (Network Monitoring)

# NAT Gateway ErrorPortAllocation alarm
aws cloudwatch put-metric-alarm \
  --alarm-name "NATGateway-PortAllocation-Error" \
  --metric-name ErrorPortAllocation \
  --namespace AWS/NATGateway \
  --dimensions Name=NatGatewayId,Value=nat-0abc1234 \
  --statistic Sum \
  --period 300 \
  --threshold 100 \
  --comparison-operator GreaterThanThreshold \
  --evaluation-periods 2 \
  --alarm-actions arn:aws:sns:ap-northeast-2:123456789012:network-alerts

# VPN tunnel down alarm
aws cloudwatch put-metric-alarm \
  --alarm-name "VPN-Tunnel-Down" \
  --metric-name TunnelState \
  --namespace AWS/VPN \
  --dimensions Name=VpnId,Value=vpn-0abc1234 \
  --statistic Maximum \
  --period 300 \
  --threshold 0.5 \
  --comparison-operator LessThanThreshold \
  --evaluation-periods 2 \
  --alarm-actions arn:aws:sns:ap-northeast-2:123456789012:network-alerts

Conclusion

Systematic methodology is the key to cloud network troubleshooting. Here are the main takeaways:

  1. Layered approach: Diagnose from upper to lower layers based on the OSI model
  2. Leverage logs: Always enable VPC Flow Logs and DNS Query Logs
  3. Use native tools: Actively use each cloud's Reachability Analyzer and Connectivity Tests
  4. Understand security rules: Clearly distinguish Stateful (SG) vs Stateless (NACL) behavior
  5. Optimize costs: Use VPC Endpoints to reduce NAT Gateway costs
  6. Monitor proactively: Build early detection with CloudWatch alarms

Network issues are often hard to reproduce, so the most important thing is to have monitoring and logging properly configured at all times.