Skip to content
Published on

BGP Internet Routing Deep Dive 2025: AS, Peering, Path Selection, RPKI, Anycast — How the Internet Really Works

Authors

Intro: The Internet Is Not Actually One Network

What you call "the internet" is not a single network. It is a loose federation of 75,000+ independent networks. Each one is called an AS (Autonomous System).

  • Google = AS 15169
  • Facebook = AS 32934
  • AT&T = AS 7018
  • Cloudflare = AS 13335
  • SK Telecom = AS 9318

When you reach Google from home Wi-Fi, packets cross dozens of ASes. Each AS tells the next, "that prefix goes this way." This runs on BGP (Border Gateway Protocol).

Why BGP Matters

  • The entire internet routing layer depends on BGP.
  • If BGP stops for 30 minutes, the global internet effectively halts.
  • BGP mistakes have killed pieces of the internet many times.
  • 2021 Facebook outage (6 hours): BGP.
  • 2019 Cloudflare outage: Verizon's BGP error.

BGP is as essential as TCP/IP but far less known.


1. Autonomous System: Internet Building Block

An AS is a set of IP networks under a single administrative entity:

  • Unique ASN (AS Number).
  • Independent routing policy.
  • Talks to other ASes via BGP.
  • Manages one or more IP prefixes.

Examples: ISPs (Comcast, KT, NTT), hyperscalers (Google, Amazon), universities, CDNs (Cloudflare, Akamai).

ASN Allocation

IANA manages ASNs and delegates to regional RIRs: ARIN (NA), RIPE NCC (EU/ME), APNIC (APAC), LACNIC (SA), AFRINIC (Africa).

  • 16-bit ASN: 0-65535 (nearly exhausted).
  • 32-bit ASN: 0 ~ 4,294,967,295 (since 2007).

Prefix Allocation

Google prefix examples:
- 8.8.8.0/24  (DNS)
- 74.125.0.0/16
- 142.250.0.0/15

/24 = 256 addresses, /16 = 65536 addresses. Each AS advertises its prefixes via BGP.

Current Scale (2025)

  • ASes: ~75,000+
  • IPv4 prefixes: ~1M
  • IPv6 prefixes: ~200K
  • BGP full table: ~1M routes (multiple GB of memory per router).

2. BGP Basics: Path Vector Protocol

BGP is unusual among routing protocols:

  • Path vector: each route carries the full AS path.
  • Policy-based: not shortest path.
  • TCP-based: port 179.
  • Incremental updates.
  • Stateful sessions.

BGP Messages

  • OPEN: session negotiation.
  • UPDATE: route info (NLRI + path attributes + withdrawn routes).
  • KEEPALIVE: session liveness (default 30s).
  • NOTIFICATION: error + teardown.

Path Attributes

  • AS_PATH (mandatory): ordered list of ASes the route traversed. Example: [65001, 7018, 15169].
  • NEXT_HOP (mandatory): next router IP.
  • ORIGIN (mandatory): IGP / EGP / Incomplete.
  • LOCAL_PREF (optional, iBGP only): internal preference, higher is better.
  • MED: preference among multiple entry points of the same AS, lower is better.
  • COMMUNITIES: policy tags.

BGP UPDATE Example

UPDATE from 10.0.0.1:
  Withdrawn: []
  Attributes:
    ORIGIN = IGP
    AS_PATH = [65001, 7018, 15169]
    NEXT_HOP = 10.0.0.1
    LOCAL_PREF = 100
    MED = 50
  NLRI: [8.8.8.0/24, 8.8.4.0/24]

Session States

Idle → Connect → OpenSent → OpenConfirm → Established (exchange routes).


3. Peering and Transit

Two ways ASes connect:

Transit — customer-provider:

  • Customer pays provider for full internet reachability.
  • Provider forwards customer traffic anywhere.

Peering — equal exchange:

  • Two ASes exchange traffic directly between their customers.
  • Usually free (or low-cost).
  • Third-party traffic not included.

Tier 1, 2, 3

Tier 1 (~12 networks): peering only, buys no transit. Level3, Cogent, AT&T, NTT, Telia, Deutsche Telekom, Tata.

Tier 2: buys some transit, peers a lot. Most ISPs.

Tier 3: mostly transit, limited peering.

Modern hyperscalers (Google, Microsoft, Facebook) built their own global backbones and sit effectively at Tier 1 level.

Peering Flavors

  • Private peering: dedicated cross-connect, high bandwidth, low latency.
  • Public peering (IXP): many ASes at one exchange — DE-CIX, LINX, AMS-IX.

4. BGP Path Selection

BGP picks by policy, not shortest path. Why: money, politics, performance, trust.

Decision process (standard order):

  1. Highest LOCAL_PREF
  2. Shortest AS_PATH
  3. Lowest ORIGIN (IGP < EGP < Incomplete)
  4. Lowest MED (from same AS)
  5. eBGP over iBGP
  6. Lowest IGP cost to NEXT_HOP
  7. Oldest route
  8. Lowest Router ID

Most decisions end at steps 1-3.

LOCAL_PREF Economics

Customer route: LOCAL_PREF = 200 (revenue)
Peer route:     LOCAL_PREF = 100 (free)
Transit route:  LOCAL_PREF = 50  (cost)

AS_PATH Prepending

Repeat your own ASN to make a path look longer and less attractive:

Normal:    AS_PATH = [65001, 7018, 15169]
Prepended: AS_PATH = [65001, 65001, 65001, 7018, 15169]

Used for traffic engineering — primary/backup links, load balancing.


5. iBGP vs eBGP

  • eBGP: between different ASes, adds ASN to AS_PATH, TTL=1 by default.
  • iBGP: inside same AS, does not add ASN, requires full mesh.

Full Mesh Problem

N iBGP routers need N(N-1)/2 sessions. 100 routers → 4950 sessions.

Solutions:

  • Route Reflector: central router reflects client routes to other clients.
  • Confederation: split AS into sub-ASes.

iBGP Next-Hop Trap

iBGP does not modify NEXT_HOP. Internal routers must be able to reach the external peer's IP, or the route is invalid. Fix: next-hop-self on the edge router.


6. Convergence and Stability

  • Local change: seconds.
  • Tier 1 change: minutes.
  • Route flap damping: ignore flapping routes for a while (now used conservatively).
  • Hold timer: default 180s, kill session on missed keepalives.
  • BFD: sub-second failure detection, independent of BGP timers.

7. BGP Hijacking and RPKI

BGP is trust-based — anyone can claim "I own this prefix."

Famous Incidents

  • 2008 Pakistan / YouTube: Pakistan Telecom leaked a /24 globally, took YouTube down for hours.
  • 2017 Russia: briefly hijacked Visa / MasterCard prefixes.
  • 2018 Amazon Route 53: attacker hijacked DNS, redirected a crypto wallet to a phishing site, stole hundreds of thousands of dollars.

RPKI: Resource Public Key Infrastructure

  1. RIRs issue certificates to ASes.
  2. AS creates a ROA (Route Origin Authorization) for each prefix.
  3. ROA = "AS 15169 may originate 8.8.8.0/24."
  4. Routers validate BGP announcements against ROAs.
  5. Unauthorized announcements are dropped.

Adoption (2025)

  • ~50% of prefixes covered by ROAs.
  • Strong enforcement by Cloudflare, Google, Microsoft.

Limits

  • Validates only origin, not the full path.
  • Mid-path manipulation still possible.
  • BGPsec (full path signing) exists but has near-zero deployment.
  • Misconfigured ROAs can reject legitimate routes.

8. Anycast: One IP, Many Locations

Advertise the same IP from many places; BGP routes each user to the nearest.

1.1.1.1 (Cloudflare DNS)
Advertised from: US East, US West, London, Tokyo, ... 300+ POPs.

Benefits

  • Low latency (physically close servers).
  • Automatic failover.
  • DDoS absorbed across POPs.
  • Single endpoint for clients.

Use Cases

  • DNS (1.1.1.1, 8.8.8.8).
  • CDN (Cloudflare, Akamai, Fastly).
  • NTP, DDoS scrubbing.

Limits

  • TCP connection loss if path shifts mid-session.
  • Complex setup.
  • Hard to debug "which POP am I hitting?"
  • Better fit for UDP (DNS, QUIC).

HTTPS + anycast was historically awkward (TCP) — solved via sticky routing, and natively by QUIC / HTTP/3 with connection migration.


9. Historical BGP Outages

  • 2008 Pakistan / YouTube: accidental global leak.
  • 2010 China Telecom: 18 min of ~15% global traffic via China.
  • 2014 Indosat: misannouncement of thousands of prefixes.
  • 2019 Verizon / Cloudflare / DQE: Verizon propagated a BGP optimizer's leak unchecked. Hours of outage. Lesson: route filtering.
  • 2021 Facebook (6 h): removed itself from the internet (see quiz Q5).
  • 2023 Cloudflare: self-inflicted BGP misconfiguration outage.

Common lessons: BGP is fragile, automation amplifies mistakes, peer validation is essential, RPKI must spread, keep out-of-band management.


10. BGP Operations in Practice

Full Table vs Default Route

  • Small site: only a default from ISP, small memory, limited failover.
  • Mid-size: full table, fine-grained control, multi-homed.

Multi-homing

        Your AS
       /       \
   ISP A      ISP B

Redundancy, load balancing, negotiating leverage.

Route Filtering (Cisco example)

ip prefix-list FILTER-IN deny 0.0.0.0/0 le 7
ip prefix-list FILTER-IN deny 0.0.0.0/0 ge 25
ip prefix-list FILTER-IN permit 0.0.0.0/0 le 32

Block bogons (RFC 1918, unallocated) and martians.

RPKI Validation

rtr client rpki-server 192.0.2.10
route-map RPKI-CHECK permit 10
  match rpki valid
  set local-preference 200
route-map RPKI-CHECK permit 20
  match rpki not-found
  set local-preference 100
route-map RPKI-CHECK permit 30
  match rpki invalid
  drop

Looking Glass

Public BGP viewers: bgp.he.net, looking-glass.nl. Check how others see your announcements.

Debug Commands

  • show bgp summary
  • show bgp ipv4 unicast
  • show bgp ipv4 unicast 8.8.8.0/24
  • show bgp ipv4 unicast neighbor 10.0.0.1 received-routes

  • SDN + BGP: BGP as the SDN data plane; BGP Flowspec for firewall rules.
  • Segment Routing: SR-MPLS, SRv6; distributed via BGP-LU and BGP-LS.
  • BGP EVPN: advertise MAC addresses via BGP; VXLAN for L2 over L3; standard in modern data centers.
  • Hyperscaler backbones: Google B4, Facebook Express Backbone, Azure Network — bypass the public internet.
  • CDN edges: hundreds of anycast POPs.

Quiz

Q1. Why does BGP use "policy-based" routing instead of "shortest path"?

A. IGPs (OSPF, IS-IS) operate within a single administrative domain where shortest path is adequate. BGP connects tens of thousands of independently run ASes with conflicting interests.

Economics first: transit costs money; peering is free. An operator prefers the cheaper path even if longer. LOCAL_PREF encodes this — customer (200) > peer (100) > transit (50).

Performance vs hop count: 3 congested hops can be worse than 5 uncongested ones. Policy lets you override naive metrics.

Trust and politics: operators avoid suspect ASes, bypass hostile jurisdictions, comply with regulation.

Traffic engineering: AS_PATH prepending, MED, communities balance load and build primary/backup paths.

Valley-free routing: peering contracts say "only advertise my customer routes to you" — pure shortest path cannot express this.

BGP is "the internet is politics." It finds an economically and politically acceptable path, not a technically optimal one. That is a feature: 75,000 independent organizations only cooperate because each is allowed to encode its own interests.

Q2. What separates peering from transit, and why do Tier 1 ISPs only peer?

A. Transit = customer pays provider for full internet reachability. Peering = two ASes exchange traffic between their own customers, usually free.

Transit (rough): 5002000/moperGbpsinmetro,muchmoreinsecondarymarkets.Peering(IXP):500–2000/mo per Gbps in metro, much more in secondary markets. Peering (IXP): 500–5000/mo for the port; private peering adds a cross-connect fee (~$200/mo); traffic is free.

Tier 1 = defined as "AS that reaches the whole internet without buying transit." About 12–20 networks: AT&T, Verizon, Lumen (Level 3), Cogent, NTT, Telia, Deutsche Telekom, Tata.

They peer only because:

  1. Buying transit would disqualify them by definition.
  2. Settlement-free peering enforces mutual equality; paying would imply subordination.
  3. They already reach everyone via their own backbones and customers.
  4. Their business is selling transit to others — buying transit would cut margin.

Peering requires roughly symmetric traffic, geography, and scale. Asymmetric peering requests (Netflix pushing vastly more than it pulls) historically get rejected or converted to paid peering — like Netflix / Comcast 2014, which triggered the net neutrality debate.

Tier 1 relationships are brittle. Cogent vs Level 3 (2005): Level 3 depeered Cogent; for three weeks parts of the internet were partitioned. "One internet" depends on political agreement.

Hyperscalers (Google, Facebook, Netflix, Amazon) now run their own global networks and effectively act as Tier 1 peers. The power structure has shifted from classic ISPs to content networks.

Q3. How does BGP hijacking work, and how does RPKI defend against it?

A. BGP was designed in the 1980s on implicit trust. It does not verify that an advertised prefix really belongs to the originator.

Prefix hijacking: attacker announces someone else's prefix. Because BGP prefers more specific prefixes, announcing 8.8.8.0/25 beats Google's 8.8.8.0/24. Traffic now flows to the attacker.

AS path manipulation: attacker forges an AS_PATH to impersonate a relationship with the victim AS.

Route leak: accidentally relaying a peer's routes to a transit provider (2019 Verizon).

Incidents: 2008 Pakistan/YouTube, 2018 Amazon Route 53 crypto-wallet theft, 2021 Twitter.

RPKI (2011):

  1. RIRs issue certificates binding prefixes to ASNs.
  2. Owners publish ROAs: signed statements "AS X may originate prefix Y up to max-length N."
  3. Validators fetch the RPKI repository and feed routers.
  4. Routers check each BGP UPDATE against ROAs: Valid / Invalid / NotFound.
  5. Invalid routes are dropped.

Limits:

  • Origin-only — middle-path tampering still possible.
  • ~50% of prefixes covered in 2025.
  • Misconfigured ROAs can reject legitimate routes (Cloudflare 2019).

BGPsec would sign the full path but has near-zero deployment because of CPU and memory cost and the need for universal upgrade.

Complementary defenses: IRR-based filtering, BGP monitoring (BGPStream, RIPE RIS, Cloudflare Radar), MANRS (800+ participating ASes committing to best practices).

Lesson: BGP will remain partially insecure for a long time. Defense in depth — encryption, authentication, active monitoring — is mandatory for anything important.

Q4. How does Anycast deliver "one IP, many locations"?

A. Advertise the same prefix from multiple sites over BGP. Each client's ISP picks one route via normal BGP path selection — usually to the nearest site by AS hops.

Cloudflare: 1.1.1.0/24 advertised from 300+ POPs via the same AS 13335.
Korean user → Tokyo POP
UK user     → London POP
US user     → Virginia POP

BGP "closeness" is AS-hop based, not physical distance. Peering choices of the local ISP can put a user on a surprising POP.

Great for: DNS (1.1.1.1, 8.8.8.8), CDNs, DDoS absorption, NTP, QUIC/HTTP/3.

TCP historically struggled: path shifts mid-connection → RST. Solutions: sticky routing (short-lived flows tolerate it), session sync (rare, expensive), L4 load balancers behind the anycast VIP, or moving to QUIC with connection IDs that survive IP changes.

Cloudflare's design: every customer behind the same anycast IP range, every POP serves every customer, adding a POP auto-balances. Effect: sub-50ms global latency, automatic failover, DDoS distributed across the edge.

Tradeoffs: non-deterministic routing (hard to debug), needs BGP expertise, requires replicated data at every POP, international routing can look weird.

Anycast is a small idea with disproportionate consequences: "advertise from many places" lets a 40-year-old protocol power modern global services. Users see 1.1.1.1; BGP quietly chooses the best POP for them.

Q5. What really caused Facebook's 6-hour outage in October 2021?

A. 2021-10-04 15:40 UTC: a Facebook engineer ran a backbone capacity-assessment command. A bug in the config validation let a dangerous variant through, and the command removed all BGP routes reaching Facebook's DNS servers.

The DNS servers had a safety feature: if they detect they are isolated from the backbone, they withdraw their own BGP announcements so stale responses don't leak. Normally good; here every DNS server withdrew simultaneously. Facebook, Instagram, WhatsApp, Messenger all became unresolvable.

Cascading failures:

  1. Internal tools (VPN, SSO, mail, Workplace) all lived on facebook.com — engineers couldn't log in remotely.
  2. Data-center badge readers depended on Facebook's auth API. Staff couldn't physically enter either.
  3. Engineers coordinated on Twitter and Discord.

Recovery required on-site engineers, manual router reconfiguration, and staged BGP re-advertisement to avoid a traffic storm. Full restore at 21:05 UTC — about 6 hours.

Facebook's post-mortem improvements: stricter config validation, out-of-band management independent of BGP, gradual DNS withdraw, badge systems decoupled from the main service auth, internal tools independent of the public domain.

Industry lessons:

  1. Blast radius — one command should not destroy an entire organization.
  2. Dependency cycles — if your fix tools depend on your broken service, you can't fix it.
  3. BGP == existence — with BGP down, physically healthy servers are invisible to the internet.
  4. Change is the cause — most major outages trace to a recent change.
  5. Post-mortem culture — Facebook publishing the root cause publicly is what lets the industry learn.

Similar patterns: AWS S3 2017 (typo), Slack 2022 (network config), Google Cloud 2020 (auth). The pattern repeats: automation amplifies small mistakes into global incidents.

BGP is simultaneously the most essential and one of the most fragile layers of the internet.


Closing

Summary

  1. AS: internet building blocks — 75,000+.
  2. BGP: policy-based path vector protocol.
  3. Peering vs Transit: the economic structure of the internet.
  4. Path Selection: many-stage decision process.
  5. RPKI: cryptographic origin validation.
  6. Anycast: BGP-powered global distribution.
  7. Hijacking: consequence of trust-based BGP.
  8. Outages: BGP mistakes have outsized impact.

What BGP Teaches

BGP is an extreme case of distributed systems:

  • 75,000 independent organizations voluntarily cooperating.
  • No central authority.
  • Economics and politics dominate technology.
  • Convergence without a consensus algorithm.
  • Security bolted on after the fact.

That it works at all after 40+ years is remarkable. Understanding BGP means understanding that "the internet" is tens of thousands of contracts, dozens of Tier 1 networks, hundreds of IXPs, millions of prefixes, thousands of BGP updates per second — held together by agreement, not by design.

Next time a packet lands on your screen, remember each hop involved money, politics, and policy. BGP is what makes that miracle routine — and also what makes it fragile.


References