- Authors
- Name
- 1. VPC Networking Fundamentals and Troubleshooting Overview
- 2. Security Groups vs NACLs Debugging
- 3. VPC Peering and Transit Gateway Issues
- 4. AWS/GCP/Azure-Specific Network Troubleshooting
- 5. VPC Flow Logs Analysis
- 6. NAT Gateway and Internet Gateway Issues
- 7. Cross-Region and Hybrid Cloud Connectivity
- 8. Cloud DNS Troubleshooting (Route 53, Cloud DNS)
- 9. Practical Cloud Network Debugging Scenarios
- 10. Troubleshooting Toolkit and Automation
- Conclusion
1. VPC Networking Fundamentals and Troubleshooting Overview
Network issues in cloud environments are among the most frequent yet difficult to diagnose. Unlike on-premises setups where you can physically access hardware, cloud troubleshooting requires a systematic approach using tools and logs provided by cloud vendors.
1.1 VPC (Virtual Private Cloud) Basic Structure
A VPC is a logically isolated network space within the cloud. While terminology varies across cloud vendors, the core concepts remain the same.
| Concept | AWS | GCP | Azure |
|---|---|---|---|
| Virtual Network | VPC | VPC | VNet |
| Subnet | Subnet | Subnet | Subnet |
| Routing Table | Route Table | Routes | Route Table |
| Firewall (Instance) | Security Group | Firewall Rules | NSG |
| Firewall (Subnet) | NACL | Firewall Policies | NSG (Subnet-attached) |
1.2 Systematic Troubleshooting Approach
When diagnosing network issues, working backwards through the OSI 7-layer model is most effective.
Troubleshooting Order:
1. Application Layer (L7) — DNS resolution, HTTP response codes
2. Transport Layer (L4) — Port connectivity, TCP handshake
3. Network Layer (L3) — IP routing, subnet CIDR
4. Security Layer — Security Groups, NACLs, firewall rules
5. Infrastructure Layer — IGW, NAT GW, VPN, Peering status
1.3 Basic Connectivity Test Commands
# Basic connectivity test
ping -c 4 10.0.1.50
# Specific port connectivity check
nc -zv 10.0.1.50 443 -w 5
# Detailed TCP route tracing
traceroute -T -p 443 10.0.1.50
# DNS lookup test
dig +trace example.com
nslookup example.com 10.0.0.2
# HTTP-level debugging with curl
curl -vvv --connect-timeout 5 https://api.internal.example.com/health
2. Security Groups vs NACLs Debugging
2.1 Key Differences Between Security Groups and NACLs
Failure to understand the differences between these two security mechanisms leads to wasted troubleshooting time.
Security Group (Stateful):
- Inbound allow automatically permits corresponding response traffic
- Applied at the instance (ENI) level
- Allow rules only (implicit deny)
- Rule evaluation: All rules are evaluated to determine access
NACL (Stateless):
- Inbound/outbound rules must be configured separately
- Applied at the subnet level
- Both allow and deny rules supported
- Rule evaluation: Processed in rule number order (first match wins)
2.2 Common Security Group Issues and Fixes
# Check Security Group rules with AWS CLI
aws ec2 describe-security-groups \
--group-ids sg-0abc1234def56789 \
--query 'SecurityGroups[*].{
GroupId:GroupId,
Ingress:IpPermissions,
Egress:IpPermissionsEgress
}' \
--output json
# Check Security Groups attached to a specific instance
aws ec2 describe-instances \
--instance-ids i-0abcdef1234567890 \
--query 'Reservations[*].Instances[*].SecurityGroups' \
--output table
Case 1: Missing Outbound Rules
Problem: EC2 fails to call external APIs
Cause: Default outbound rule (0.0.0.0/0 allow) was deleted from the Security Group
Fix:
aws ec2 authorize-security-group-egress \
--group-id sg-0abc1234def56789 \
--protocol tcp --port 443 \
--cidr 0.0.0.0/0
Case 2: Missing Self-Referencing Rules
Problem: Instances within the same Security Group cannot communicate
Cause: The SG lacks a rule allowing itself as a source
Fix:
aws ec2 authorize-security-group-ingress \
--group-id sg-0abc1234def56789 \
--protocol -1 \
--source-group sg-0abc1234def56789
2.3 NACL Debugging Patterns
# Check NACL rules
aws ec2 describe-network-acls \
--network-acl-ids acl-0abc1234 \
--query 'NetworkAcls[*].Entries' \
--output json
Ephemeral Port Issues — The Most Common NACL Problem:
Problem: Outbound HTTP request responses are not received
Cause: No inbound NACL rule for ephemeral port range (1024-65535)
NACL Rule Example (correct configuration):
Inbound:
Rule 100: TCP 443 ALLOW (0.0.0.0/0)
Rule 110: TCP 1024-65535 ALLOW (0.0.0.0/0) <-- This is essential
Rule *: All traffic DENY
Outbound:
Rule 100: TCP 443 ALLOW (0.0.0.0/0)
Rule 110: TCP 1024-65535 ALLOW (0.0.0.0/0) <-- Response traffic
Rule *: All traffic DENY
3. VPC Peering and Transit Gateway Issues
3.1 VPC Peering Troubleshooting
VPC Peering provides private network connectivity between two VPCs, but comes with several constraints.
VPC Peering Constraints:
x Transitive peering is NOT supported
VPC-A <-> VPC-B <-> VPC-C: A cannot reach C directly
x Overlapping CIDRs are NOT allowed
x Cross-region peering may not support Security Group references
# Check VPC Peering status
aws ec2 describe-vpc-peering-connections \
--filters "Name=status-code,Values=active,pending-acceptance" \
--query 'VpcPeeringConnections[*].{
PeeringId:VpcPeeringConnectionId,
Status:Status.Code,
Requester:RequesterVpcInfo.{VpcId:VpcId,CidrBlock:CidrBlock},
Accepter:AccepterVpcInfo.{VpcId:VpcId,CidrBlock:CidrBlock}
}'
# Check peering routes in routing tables
aws ec2 describe-route-tables \
--filters "Name=vpc-id,Values=vpc-0abc1234" \
--query 'RouteTables[*].Routes[?GatewayId!=`local`]'
Common Mistake: Missing Route Table Entries
Problem: Peering connection is Active but communication fails
Cause: Peering routes were not added to both VPCs' route tables
Fix:
# Add VPC-B CIDR to VPC-A route table
aws ec2 create-route \
--route-table-id rtb-aaa111 \
--destination-cidr-block 10.1.0.0/16 \
--vpc-peering-connection-id pcx-0abc1234
# Add VPC-A CIDR to VPC-B route table
aws ec2 create-route \
--route-table-id rtb-bbb222 \
--destination-cidr-block 10.0.0.0/16 \
--vpc-peering-connection-id pcx-0abc1234
3.2 Transit Gateway Problem Resolution
Transit Gateway connects multiple VPCs and on-premises networks using a hub-and-spoke model.
# Check Transit Gateway route tables
aws ec2 describe-transit-gateway-route-tables \
--transit-gateway-route-table-ids tgw-rtb-0abc1234
# Check Transit Gateway attachment states
aws ec2 describe-transit-gateway-attachments \
--filters "Name=transit-gateway-id,Values=tgw-0abc1234" \
--query 'TransitGatewayAttachments[*].{
AttachmentId:TransitGatewayAttachmentId,
ResourceType:ResourceType,
State:State
}'
# Query routes in TGW route table
aws ec2 search-transit-gateway-routes \
--transit-gateway-route-table-id tgw-rtb-0abc1234 \
--filters "Name=type,Values=static,propagated"
TGW Route Propagation Issues:
Problem: Communication fails between VPCs connected via TGW
Checklist:
[ ] Is the TGW Attachment in "available" state?
[ ] Are routes propagated in the TGW Route Table?
[ ] Do each VPC's subnet route tables have TGW routes?
[ ] Is the TGW Route Table Association correct?
4. AWS/GCP/Azure-Specific Network Troubleshooting
4.1 AWS-Specific Network Issues
# Use VPC Reachability Analyzer (powerful diagnostic tool)
aws ec2 create-network-insights-path \
--source i-0abc1234 \
--destination i-0def5678 \
--protocol TCP \
--destination-port 443
aws ec2 start-network-insights-analysis \
--network-insights-path-id nip-0abc1234
# Check results
aws ec2 describe-network-insights-analyses \
--network-insights-analysis-ids nia-0abc1234 \
--query 'NetworkInsightsAnalyses[*].{
Status:Status,
PathFound:NetworkPathFound,
Explanations:Explanations
}'
4.2 GCP Network Debugging
# Check GCP firewall rules
gcloud compute firewall-rules list \
--filter="network=my-vpc" \
--format="table(name,direction,priority,allowed,sourceRanges,targetTags)"
# Run Connectivity Test (GCP's Reachability Analyzer)
gcloud network-management connectivity-tests create my-test \
--source-instance=projects/my-project/zones/us-central1-a/instances/vm-1 \
--destination-instance=projects/my-project/zones/us-central1-b/instances/vm-2 \
--protocol=TCP \
--destination-port=8080
# Check results
gcloud network-management connectivity-tests describe my-test
4.3 Azure Network Diagnostics
# Check NSG rules
az network nsg rule list \
--nsg-name my-nsg \
--resource-group my-rg \
--output table
# Connectivity check with Network Watcher
az network watcher test-connectivity \
--resource-group my-rg \
--source-resource vm-source \
--dest-resource vm-dest \
--dest-port 443
# Check effective routes
az network nic show-effective-route-table \
--name my-nic \
--resource-group my-rg
5. VPC Flow Logs Analysis
5.1 Flow Logs Setup and Field Reference
VPC Flow Logs are the essential data source for network troubleshooting.
# Enable AWS VPC Flow Logs
aws ec2 create-flow-logs \
--resource-type VPC \
--resource-ids vpc-0abc1234 \
--traffic-type ALL \
--log-destination-type cloud-watch-logs \
--log-group-name /vpc/flow-logs \
--deliver-logs-permission-arn arn:aws:iam::123456789012:role/flowlogsRole \
--max-aggregation-interval 60
Flow Log Record Format:
<version> <account-id> <interface-id> <srcaddr> <dstaddr> <srcport> <dstport> <protocol> <packets> <bytes> <start> <end> <action> <log-status>
Example:
2 123456789012 eni-0abc1234 10.0.1.50 10.0.2.100 49152 443 6 20 4000 1625000000 1625000060 ACCEPT OK
2 123456789012 eni-0abc1234 203.0.113.5 10.0.1.50 0 0 1 4 336 1625000000 1625000060 REJECT OK
5.2 Analysis with CloudWatch Logs Insights
-- Top 10 source IPs with rejected traffic
fields @timestamp, srcAddr, dstAddr, dstPort, action
| filter action = "REJECT"
| stats count(*) as rejectedCount by srcAddr
| sort rejectedCount desc
| limit 10
-- Traffic pattern analysis for a specific IP
fields @timestamp, srcAddr, dstAddr, srcPort, dstPort, protocol, action
| filter srcAddr = "10.0.1.50" or dstAddr = "10.0.1.50"
| sort @timestamp desc
| limit 100
-- Detect anomalous port scanning
fields srcAddr, dstPort, action
| filter action = "REJECT"
| stats count(dstPort) as portCount by srcAddr
| filter portCount > 100
| sort portCount desc
5.3 Large-Scale Flow Log Analysis with Athena
-- Create Athena table
CREATE EXTERNAL TABLE IF NOT EXISTS vpc_flow_logs (
version int,
account_id string,
interface_id string,
srcaddr string,
dstaddr string,
srcport int,
dstport int,
protocol bigint,
packets bigint,
bytes bigint,
start bigint,
end bigint,
action string,
log_status string
)
PARTITIONED BY (dt string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '
LOCATION 's3://my-flow-logs-bucket/AWSLogs/123456789012/vpcflowlogs/';
-- Rejected traffic trends by time period
SELECT
from_unixtime(start) as event_time,
srcaddr,
dstaddr,
dstport,
protocol,
sum(packets) as total_packets,
sum(bytes) as total_bytes
FROM vpc_flow_logs
WHERE action = 'REJECT'
AND dt = '2026/03/08'
GROUP BY 1, 2, 3, 4, 5
ORDER BY total_packets DESC
LIMIT 50;
6. NAT Gateway and Internet Gateway Issues
6.1 Internet Gateway (IGW) Troubleshooting
Internet Access via IGW Checklist:
[ ] Is an IGW attached to the VPC?
[ ] Does the public subnet route table have a 0.0.0.0/0 -> IGW route?
[ ] Does the instance have a Public IP or Elastic IP assigned?
[ ] Does the Security Group outbound allow the traffic?
[ ] Do the NACL rules allow both inbound and outbound traffic?
# Check IGW attachment status
aws ec2 describe-internet-gateways \
--filters "Name=attachment.vpc-id,Values=vpc-0abc1234" \
--query 'InternetGateways[*].{IGWId:InternetGatewayId,State:Attachments[0].State}'
# Check public subnet routing
aws ec2 describe-route-tables \
--filters "Name=association.subnet-id,Values=subnet-pub123" \
--query 'RouteTables[*].Routes[?DestinationCidrBlock==`0.0.0.0/0`]'
6.2 NAT Gateway Problem Resolution
# Check NAT Gateway status
aws ec2 describe-nat-gateways \
--filter "Name=vpc-id,Values=vpc-0abc1234" \
--query 'NatGateways[*].{
NatGatewayId:NatGatewayId,
State:State,
SubnetId:SubnetId,
PublicIp:NatGatewayAddresses[0].PublicIp,
PrivateIp:NatGatewayAddresses[0].PrivateIp
}'
Common NAT Gateway Issues:
Issue 1: No internet access through NAT Gateway
Cause: NAT Gateway was placed in a private subnet
Fix: NAT Gateway must be located in a public subnet
Issue 2: Intermittent connection drops
Cause: NAT Gateway connection tracking limit exceeded
Check: CloudWatch Metrics > ErrorPortAllocation
Fix: Distribute across multiple NAT Gateways or use service endpoints
Issue 3: NAT Gateway cost explosion
Cause: AWS service traffic (S3, DynamoDB) routing through NAT
Fix: Create VPC Endpoints (Gateway type) for private routing
# Create VPC Endpoint (S3 example — reduces NAT costs)
aws ec2 create-vpc-endpoint \
--vpc-id vpc-0abc1234 \
--service-name com.amazonaws.ap-northeast-2.s3 \
--route-table-ids rtb-priv111 rtb-priv222
# Check NAT Gateway CloudWatch metrics
aws cloudwatch get-metric-statistics \
--namespace AWS/NATGateway \
--metric-name ErrorPortAllocation \
--dimensions Name=NatGatewayId,Value=nat-0abc1234 \
--start-time 2026-03-07T00:00:00Z \
--end-time 2026-03-08T00:00:00Z \
--period 3600 \
--statistics Sum
7. Cross-Region and Hybrid Cloud Connectivity
7.1 Cross-Region VPC Connectivity
Cross-Region Connection Options:
1. Inter-Region VPC Peering — 1:1 connection, simple setup
2. Transit Gateway Inter-Region Peering — Hub-spoke, large-scale
3. AWS Cloud WAN — Global network management (newest)
# Check Inter-Region TGW Peering status
aws ec2 describe-transit-gateway-peering-attachments \
--filters "Name=state,Values=available,pendingAcceptance" \
--query 'TransitGatewayPeeringAttachments[*].{
AttachmentId:TransitGatewayAttachmentId,
RequesterTgw:RequesterTgwInfo.TransitGatewayId,
AccepterTgw:AccepterTgwInfo.TransitGatewayId,
State:State
}'
7.2 Hybrid Cloud Connectivity (VPN & Direct Connect)
# Check Site-to-Site VPN tunnel status
aws ec2 describe-vpn-connections \
--vpn-connection-ids vpn-0abc1234 \
--query 'VpnConnections[*].{
VpnId:VpnConnectionId,
State:State,
Tunnel1Status:VgwTelemetry[0].Status,
Tunnel1StatusMessage:VgwTelemetry[0].StatusMessage,
Tunnel2Status:VgwTelemetry[1].Status,
Tunnel2StatusMessage:VgwTelemetry[1].StatusMessage
}'
VPN Tunnel Down Diagnosis:
When a VPN tunnel is DOWN, check the following:
[ ] Customer Gateway (CGW) configuration
- IKE version (v1 or v2)
- Pre-shared key match
- Phase 1/Phase 2 encryption parameter match
[ ] On-premises firewall allows UDP 500, UDP 4500
[ ] NAT-Traversal configuration
[ ] BGP session status (when using dynamic routing)
[ ] Dead Peer Detection (DPD) settings match
7.3 Multi-Cloud Network Connectivity
AWS <-> GCP Connection Options:
1. IPSec VPN (AWS VPN + GCP Cloud VPN)
2. Partner Interconnect
3. Third-party SD-WAN solutions
AWS <-> Azure Connection Options:
1. IPSec VPN (AWS VPN + Azure VPN Gateway)
2. ExpressRoute + Direct Connect (dedicated line)
3. Cloud Exchange via Megaport, etc.
# Check GCP VPN tunnel status
gcloud compute vpn-tunnels describe my-tunnel \
--region=us-central1 \
--format="value(status,detailedStatus)"
# Check Azure VPN connection status
az network vpn-connection show \
--name my-vpn-connection \
--resource-group my-rg \
--query '{Status:connectionStatus,EgressBytes:egressBytesTransferred,IngressBytes:ingressBytesTransferred}'
8. Cloud DNS Troubleshooting (Route 53, Cloud DNS)
8.1 AWS Route 53 Problem Resolution
# Check Route 53 hosted zone records
aws route53 list-resource-record-sets \
--hosted-zone-id Z0123456789ABCDEFGHIJ \
--query 'ResourceRecordSets[?Name==`api.example.com.`]'
# Check Route 53 Resolver query logs (VPC internal DNS)
aws route53resolver list-resolver-query-log-configs
# DNS resolution test
aws route53 test-dns-answer \
--hosted-zone-id Z0123456789ABCDEFGHIJ \
--record-name api.example.com \
--record-type A
Private Hosted Zone Issues:
Problem: Private DNS records are not resolving within the VPC
Checklist:
[ ] Is the private hosted zone associated with the VPC?
[ ] Is enableDnsSupport set to true for the VPC?
[ ] Is enableDnsHostnames set to true for the VPC?
[ ] Is the DHCP Options Set using AmazonProvidedDNS?
# Check VPC DNS settings
aws ec2 describe-vpc-attribute \
--vpc-id vpc-0abc1234 \
--attribute enableDnsSupport
aws ec2 describe-vpc-attribute \
--vpc-id vpc-0abc1234 \
--attribute enableDnsHostnames
8.2 GCP Cloud DNS Debugging
# Check Cloud DNS records
gcloud dns record-sets list \
--zone=my-dns-zone \
--filter="name=api.example.com."
# Check VPCs associated with a private DNS zone
gcloud dns managed-zones describe my-private-zone \
--format="value(privateVisibilityConfig.networks)"
# Check DNS policies
gcloud dns policies list
8.3 Hybrid DNS Configuration Issues
DNS Resolution in VPC Peering Environments:
- Route 53 Resolver inbound/outbound endpoints are required
- On-premises -> AWS: Use Inbound Endpoint
- AWS -> On-premises: Use Outbound Endpoint + Forwarding Rule
+-----------------------------+
| Route 53 Resolver |
On-prem --> | Inbound Outbound | --> On-prem DNS
DNS query | Endpoint Endpoint | Forwarding
+-----------------------------+
# Check Route 53 Resolver endpoint status
aws route53resolver list-resolver-endpoints \
--query 'ResolverEndpoints[*].{
Id:Id,
Name:Name,
Direction:Direction,
Status:Status,
IpAddressCount:IpAddressCount
}'
# Check forwarding rules
aws route53resolver list-resolver-rules \
--query 'ResolverRules[*].{
Id:Id,
DomainName:DomainName,
RuleType:RuleType,
Status:Status,
TargetIps:TargetIps
}'
9. Practical Cloud Network Debugging Scenarios
Scenario 1: External API Call Timeout from ECS Service
Symptom: Connection timeout when calling external API from ECS Fargate task
Environment: Fargate task deployed in a private subnet
Debugging Steps:
1. Verify task network mode (awsvpc)
2. Check Security Group outbound rules on the task's ENI
3. Verify NAT Gateway route in subnet route table
4. Check NAT Gateway status and placement (public subnet)
5. Check NACL rules (outbound + ephemeral port inbound)
Root Cause: NAT Gateway was deleted but a blackhole route remained
Fix: Create a new NAT Gateway and update the route table
Scenario 2: RDS Inaccessible Across VPC Peering
Symptom: Application in VPC-A cannot connect to RDS in VPC-B on TCP 3306
Environment: VPC-A (10.0.0.0/16) <-> Peering <-> VPC-B (10.1.0.0/16)
Debugging Steps:
1. Check Peering status -> Active
2. VPC-A route table -> 10.1.0.0/16 -> pcx-xxx (OK)
3. VPC-B route table -> 10.0.0.0/16 -> pcx-xxx (OK)
4. RDS Security Group inbound check -> 10.0.0.0/16 on 3306 NOT allowed (FOUND IT!)
5. RDS only permitted references from VPC-B Security Groups
Fix: Add inbound rule to RDS Security Group allowing 3306 from VPC-A CIDR (10.0.0.0/16)
Scenario 3: Lambda Function Cannot Access VPC Resources
Symptom: VPC-connected Lambda cannot access ElastiCache, and internet is also blocked
Environment: Lambda connected to private subnets
Debugging Steps:
1. Check Lambda execution role for VPC permissions
- ec2:CreateNetworkInterface
- ec2:DescribeNetworkInterfaces
- ec2:DeleteNetworkInterface
2. Verify assigned subnets (minimum 2 AZs recommended)
3. Check Lambda SG -> ElastiCache SG connectivity
4. Internet access requires: private subnet + NAT Gateway
Root Cause: Lambda Security Group outbound was restrictive and
ElastiCache port (6379) was not allowed
Fix: Add TCP 6379 outbound allow to Lambda SG
Scenario 4: Cross-Region Data Transfer Latency Spike
Symptom: Data replication latency spike between us-east-1 and ap-northeast-2
Environment: Using Inter-Region VPC Peering
Debugging Steps:
1. Check VPC Peering status -> Active
2. Review network bandwidth -> CloudWatch BytesIn/BytesOut metrics
3. Check instance type network performance limits
4. Optimize TCP Window Scaling and buffer sizes
5. Consider AWS Global Accelerator or CloudFront
Optimizations:
- Use parallel streams for bulk transfers
- TCP window size tuning: sysctl -w net.core.rmem_max=16777216
- Leverage S3 Transfer Acceleration (for file transfers)
10. Troubleshooting Toolkit and Automation
10.1 Comprehensive Diagnostic Script
#!/bin/bash
# cloud-network-diag.sh — Cloud network comprehensive diagnostic script
TARGET_IP=$1
TARGET_PORT=${2:-443}
VPC_ID=${3:-"auto"}
echo "=== Cloud Network Diagnostics Starting ==="
echo "Target: ${TARGET_IP}:${TARGET_PORT}"
echo ""
# 1. Basic connectivity check
echo "--- 1. Basic Connectivity Check ---"
ping -c 3 -W 2 $TARGET_IP 2>/dev/null
nc -zv -w 3 $TARGET_IP $TARGET_PORT 2>&1
# 2. DNS resolution check
echo "--- 2. DNS Resolution Check ---"
dig +short $TARGET_IP 2>/dev/null || echo "Skipping DNS lookup for IP address"
# 3. Route tracing
echo "--- 3. Route Tracing ---"
traceroute -m 15 -w 2 $TARGET_IP 2>/dev/null
# 4. AWS Security Group check (when running on an instance)
echo "--- 4. Metadata-Based Security Group Check ---"
TOKEN=$(curl -s -X PUT "http://169.254.169.254/latest/api/token" \
-H "X-aws-ec2-metadata-token-ttl-seconds: 60" 2>/dev/null)
if [ ! -z "$TOKEN" ]; then
SG_IDS=$(curl -s -H "X-aws-ec2-metadata-token: $TOKEN" \
http://169.254.169.254/latest/meta-data/security-groups 2>/dev/null)
echo "Attached Security Groups: $SG_IDS"
fi
echo "=== Diagnostics Complete ==="
10.2 CloudWatch Alarm Setup (Network Monitoring)
# NAT Gateway ErrorPortAllocation alarm
aws cloudwatch put-metric-alarm \
--alarm-name "NATGateway-PortAllocation-Error" \
--metric-name ErrorPortAllocation \
--namespace AWS/NATGateway \
--dimensions Name=NatGatewayId,Value=nat-0abc1234 \
--statistic Sum \
--period 300 \
--threshold 100 \
--comparison-operator GreaterThanThreshold \
--evaluation-periods 2 \
--alarm-actions arn:aws:sns:ap-northeast-2:123456789012:network-alerts
# VPN tunnel down alarm
aws cloudwatch put-metric-alarm \
--alarm-name "VPN-Tunnel-Down" \
--metric-name TunnelState \
--namespace AWS/VPN \
--dimensions Name=VpnId,Value=vpn-0abc1234 \
--statistic Maximum \
--period 300 \
--threshold 0.5 \
--comparison-operator LessThanThreshold \
--evaluation-periods 2 \
--alarm-actions arn:aws:sns:ap-northeast-2:123456789012:network-alerts
Conclusion
Systematic methodology is the key to cloud network troubleshooting. Here are the main takeaways:
- Layered approach: Diagnose from upper to lower layers based on the OSI model
- Leverage logs: Always enable VPC Flow Logs and DNS Query Logs
- Use native tools: Actively use each cloud's Reachability Analyzer and Connectivity Tests
- Understand security rules: Clearly distinguish Stateful (SG) vs Stateless (NACL) behavior
- Optimize costs: Use VPC Endpoints to reduce NAT Gateway costs
- Monitor proactively: Build early detection with CloudWatch alarms
Network issues are often hard to reproduce, so the most important thing is to have monitoring and logging properly configured at all times.