- Published on
Internet Archive & Digital Preservation & Web Archiving 2026 — Wayback Machine / archive.today / Conifer / Browsertrix / WARC / Perma.cc / NDL WARP / OASIS Deep Dive
- Authors

- Name
- Youngju Kim
- @fjvbn20031
"Universal access to all knowledge. If we cannot keep all of it, at least let us keep a copy before it disappears." — Brewster Kahle, founder of the Internet Archive, 2019 TED Talk
The web has a remarkably short lifespan. A 2014 Harvard Law School study reported that roughly 49% of external URLs cited in US Supreme Court opinions had broken within five years, and a 2024 Pew Research follow-up found that about 38% of pages created in 2013 were already gone by 2023. This is called linkrot — and it means the web we cite and share every day is quietly losing several percent of itself each year, permanently.
As of May 2026, the digital preservation ecosystem fighting linkrot is richer than ever. Internet Archive , founded by Brewster Kahle in 1996, holds 935B+ pages in its Wayback Machine and effectively acts as the worlds global backup. archive.today (archive.ph) is the anonymous, immediate-capture archive that has become a staple for journalists. The Webrecorder family (Conifer, Browsertrix Crawler, Browsertrix Cloud, Replay.web.page) is pushing a new model based on high-fidelity, interaction-aware archives. The Hachette v. Internet Archive ruling of 2024 dealt a serious blow to IAs Controlled Digital Lending program, but also forced the community to rethink how digital preservation is governed.
This post surveys the May 2026 state of digital preservation across six axes — global nonprofit / anonymous / Webrecorder / government / academic / self-hosted — and ends with practical strategies for individuals, researchers and libraries living in the linkrot era.
1. The 2026 digital preservation map — global / government / self-hosted / academic
Heres an at-a-glance categorization of the major preservation tools and services.
| Category | Flagship projects | Operated by | Primary users |
|---|---|---|---|
| Global nonprofit | Internet Archive (Wayback Machine), archive.today, Common Crawl | Nonprofit / anonymous | Everyone |
| Webrecorder family | Conifer, Browsertrix Crawler, Browsertrix Cloud, Replay.web.page | Webrecorder Software (Ilya Kreymer) | Journalists / researchers / curators |
| Government / national libraries | Library of Congress Web Archives (LCWA), Japans NDL WARP, Koreas OASIS, UK Web Archive | National libraries | Governments / scholars |
| Academic / legal | Perma.cc (Harvard Law), Permanent.org | Nonprofits and consortia | Lawyers / scholars / individuals |
| Self-hosted | ArchiveBox, SingleFile, Pywb, Wallabag | Open source | Developers / library IT |
| Standards / infrastructure | WARC (ISO 28500), WACZ, Heritrix, Pywb | IIPC / Webrecorder | Operators |
This is not just taxonomy — it captures a real governance difference. IA is a US nonprofit funded by donations and library partners. archive.today is run by an anonymous operator on donations. National libraries are tax-funded and operate under legal-deposit law. Perma.cc is a Harvard-led academic consortium. ArchiveBox is open source that anyone can self-host.
The key insight of 2026 is that the question is no longer "which one should I trust" but "how do I keep copies in several different places at once" — the LOCKSS (Lots Of Copies Keep Stuff Safe) principle. The Hachette ruling cast doubt over IAs future and made the case for distributed, governance-diverse preservation stronger than ever.
2. Internet Archive — Brewster Kahle, founded 1996
Internet Archive is a nonprofit digital library founded by Brewster Kahle in San Francisco in May 1996. Kahle came out of Thinking Machines, created WAIS (Wide Area Information Servers) in 1989, then founded Alexa Internet and sold it to Amazon in 1999 for about $250M, plowing part of the proceeds into building IA in earnest.
As of May 2026, the scale is enormous.
- Total data: approximately 866PB+ (petabytes) — the largest single nonprofit digital archive in the world
- Wayback Machine: 935B+ pages (billion pages)
- Books: 42 million digitized (includes Open Library)
- Audio: 20 million items (including the Live Music Archive and Grateful Dead collections)
- Video: 10 million items (TV News Archive, films, academic lectures)
- Software: 1 million+ items (DOS, Mac OS Classic, game ROMs, MAME emulations playable in-browser)
- Images: 5 million+
IA replicates data across five data centers — San Francisco HQ, Richmond, Petaluma, Vancouver, and Amsterdam — onto its own petabyte-scale Petabox nodes, with open-source Hadoop, Solr and Elasticsearch driving search.
Annual operating cost runs around $35M–$40M, with more than 90% from individual donations and library/archive partner dues. Kahle calls IA the "digital Library of Alexandria," and the mission is free access to all of it.
The five most-used entry points on archive.org are:
- web.archive.org — Wayback Machine for web snapshots
- archive.org/details/ — collections and item pages
- openlibrary.org — Open Library catalog and lending
- scholar.archive.org — academic search
- archive.org/details/software — software and games (playable directly in the browser via emulators)
3. Wayback Machine — a time machine for 935B+ pages
Wayback Machine is the web-snapshot search-and-replay service within the Internet Archive. Crawling started in 1996 and the public interface opened in 2001. The name is a reference to the WABAC time machine from the Peabody and Sherman cartoons.
Basic usage is simple. At web.archive.org, you enter a URL and a calendar view shows you every time the page was captured. Each dot is one snapshot; clicking it replays the page as it looked at that moment.
The internal pipeline looks like this.
[Crawler — Heritrix / Save Page Now]
|
v
[WARC files (~tens of GB per day)]
|
v
[CDX index — URL + timestamp + offset]
|
v
[Pywb replay engine] <- user request
|
v
[Replayed page delivered to the client]
The URL pattern is highly consistent.
https://web.archive.org/web/[YYYYMMDDhhmmss]/[original URL]
https://web.archive.org/web/2026*/https://example.com # all 2026 captures
https://web.archive.org/web/2*/https://example.com # latest capture
Notable features as of May 2026 include:
- 935B+ pages — cumulative since 1996 (crossed 900B in August 2025)
- TimeTravel API — federated lookup across other archives (Library of Congress, UK Web Archive, etc.) via the Memento protocol (RFC 7089)
- Save Page Now (SPN) — user-initiated immediate capture (next chapter)
- Wayback Machine Chrome extension — auto-redirect from broken links to nearest snapshot
- Brozzler — IAs headless-Chromium crawler used alongside Heritrix for JS-heavy sites
- CDX Server API — direct index queries, popular with researchers
A typical researcher query against the CDX API:
# All capture metadata for a domain
curl "https://web.archive.org/cdx/search/cdx?url=example.com/*&output=json&limit=100"
# Only captures since 2020
curl "https://web.archive.org/cdx/search/cdx?url=example.com&from=20200101&to=20260101&output=json"
4. Hachette v. Internet Archive 2024 — a serious blow to IA
When COVID-19 closed schools and libraries in March 2020, IA started a temporary program called the National Emergency Library (NEL). Normally, IA practiced Controlled Digital Lending (CDL) — one digital loan per physical copy owned — but during NEL it briefly removed the per-copy cap, arguing that the emergency required it.
In June 2020, four major publishers — Hachette Book Group, HarperCollins, John Wiley, and Penguin Random House — sued IA for copyright infringement (Hachette v. Internet Archive, 1:20-cv-04160, S.D.N.Y.). Two questions were at the center.
- Is CDL itself lawful? IA argued that the 1-to-1 ratio between owned physical and lent digital copies fits the first-sale doctrine. The publishers argued digital duplication itself is a new infringement.
- Was the NEL waiver fair use? IA argued public-health and educational necessity. The publishers argued it was straightforward unauthorized copying.
In March 2023, Judge John G. Koeltl granted summary judgment for the publishers, ruling CDL was not fair use. On September 4, 2024, the US Second Circuit Court of Appeals affirmed, finalizing IAs loss.
The core reasoning was that IAs digital lending directly competed with the publishers e-book market and was not transformative. Potential damages were calculated at roughly $620M (six hundred and twenty million US dollars). The parties reached a confidential settlement on the amount in late 2024, but IA still lost the following.
- About 500,000 digital books removed from the lending catalog (starting late 2023)
- A US legal precedent against CDL that chilled similar programs at other libraries
- Direct pressure on operating funds — IA runs on roughly $40M a year, while settlement exposure was several multiples of that
After the ruling IA scaled back lending and shifted weight to web archiving, software preservation and scholarly materials. A separate music-industry lawsuit (UMG v. Internet Archive, tied to the "Great 78 Project") remained active through 2024, leaving IAs future uncertain.
The lesson is clear. No single nonprofit can carry all of digital preservation. LOCKSS, multi-archive backups, and a mix of government/academic/anonymous governance matter more than ever.
5. archive.today (archive.ph / archive.is) — the anonymous archive
archive.today is a web-snapshot service started in 2012 by an anonymous operator. The same site is mirrored at archive.today, archive.ph, archive.is, archive.li, archive.fo and a few other domains. The operator does not disclose their identity (IPs trace to the Czech Republic).
The three biggest differences from IA are:
- Non-cooperation with DMCA — IA honors robots.txt and publisher requests; archive.today, anonymously operated overseas, effectively preserves everything forever.
- Static post-render snapshots — a headless browser renders the page first, then stores both the rendered HTML and a full-page PNG screenshot.
- Instant capture + permanent short URLs — a 5–6 character code like
archive.ph/abc12for permanent citation.
Among journalists and the OSINT community, archive.today is effectively standard. Citing a news site that may edit or delete the article, capturing a politicians tweet that may disappear, or quoting a paywalled piece — all of these now routinely come with an archive.today URL.
Usage is dead simple.
# Save
https://archive.ph/?url=https://example.com/article
# Or directly
https://archive.ph/https://example.com/article
The snapshot page shows the original URL, capture timestamp, short URL, and links to other snapshots. You can download both the rendered HTML and the full-page PNG.
The limitations:
- Weak search indexing — no broad full-text search like Wayback Machine
- No real API — automation relies on scraping
- Single-operator dependency — everything rides on one anonymous person, a SPOF risk
- Variable speed — under load, captures queue or are rejected
Even so, the "archive that does not honor takedown requests" niche is unique, and archive.today complements rather than competes with the Wayback Machine.
6. Save Page Now — fast, on-demand archiving
Save Page Now (SPN) is IAs user-initiated immediate archiving feature. Anyone can submit a URL for capture into Wayback Machine, and the v2 release in 2019 added options to capture outbound links, attachments and embedded media along with the target page.
Three entry points:
- Web UI —
web.archive.org/save - Bookmarklet — a one-click JavaScript bookmark
- Official Chrome / Firefox / Safari extension — right-click "Save Page Now"
For automation, use the SPN2 API. After getting a key, one POST queues the job.
curl -X POST "https://web.archive.org/save/" \
-H "Authorization: LOW <access_key>:<secret>" \
-d "url=https://example.com/article&capture_all=1"
The response includes a job_id and you can poll /save/status/<job_id>. Most jobs finish in 10–60 seconds; JS-heavy pages can take 2–3 minutes.
A common journalistic pattern: capture via SPN before citing an article, then publish both the original URL and the web.archive.org URL side by side. If the source is later edited or deleted, the citation still resolves.
Since 2023, the average time from SPN capture to Wayback Machine index visibility is under 5 minutes, meaning "see a piece, capture it, tweet it" is realistic.
7. Conifer (formerly Webrecorder.io) — interactive archives
Conifer is the service rebranded from Webrecorder.io in 2020. Users browse a site through it, and every interaction is recorded into a WARC. Behind it is Ilya Kreymer, who created Pywb at IA, joined Rhizome (a New York digital-art preservation nonprofit), then spun out Webrecorder Software.
Crawler-based archiving had clear limits.
- Difficult to capture login-gated content
- Only partial capture of JS-heavy SPAs, infinite scroll, dynamic loaders
- User-triggered modals and dropdowns never captured
- Paywalled and subscription content out of reach
Conifer addresses these by letting a human drive a browser while recording all traffic into a WARC. The flow:
- Sign up at conifer.rhizome.org, create a new collection
- Click "Start Recording" — a proxied browser opens in a new tab
- Browse normally — log in, scroll, click, open modals
- All network traffic is silently saved to a WARC in the background
- Stop recording — the collection is now permanent, replayable on demand
Free tier is 5GB, paid tiers go to 100GB+. Digital art, interactive fiction and interactive data visualization — anywhere the code and interaction are the work — benefit most. MoMA, Rhizomes ArtBase, and the British Library all use Conifer for digital-art preservation.
The limit is clear: it does not scale. A 10-page news site is fine; a 10,000-page wiki is not. Browsertrix in the next chapter is the response.
8. Browsertrix Crawler + Browsertrix Cloud — automated high-fidelity crawling
The Webrecorder familys second tool is Browsertrix — automating Conifers human-driven approach with a Playwright-based headless browser.
| Product | Form factor | License |
|---|---|---|
| Browsertrix Crawler | Docker image and CLI | AGPL-3.0 |
| Browsertrix Cloud | SaaS wrapper around Browsertrix Crawler | Paid / nonprofit free tier |
A basic Browsertrix Crawler run:
docker run -v $PWD/crawls:/crawls \
-it webrecorder/browsertrix-crawler crawl \
--url https://example.com \
--scopeType domain \
--depth 3 \
--behaviors autoscroll,autoplay,autofetch,siteSpecific \
--generateWACZ \
--collection my-crawl
Key flags:
--url— start URL (can be repeated)--scopeType—page,prefix,host,domain,any— how far to follow links--depth— how many hops from the seed--behaviors— site-specific simulations (autoscroll, autoplay, infinite scroll, etc.)--generateWACZ— package the output as WACZ--profile— apply a pre-built browser profile (e.g. with login cookies)
The real differentiator is --behaviors. Browsertrix ships site-specific heuristics for Twitter/X, YouTube, Instagram, Facebook, Medium and others — "Twitter only loads the full timeline after you scroll to the end", "YouTube needs autoplay enabled to capture video content", and so on.
Browsertrix Cloud is the SaaS form. After a 2024 beta it shipped GA in 2025. As of May 2026 the following organizations run on it.
- Stanford Libraries — English literature and arts digital collections
- The New York Times R&D — internal news-article archive
- Internet Archive — selected curated collections
- Bibliothèque nationale de France — French cultural-heritage sites
Pricing combines free nonprofit/educational tiers with usage-based paid plans, but you can always download the raw WACZ — no vendor lock-in.
9. Replay.web.page + WACZ — the rise of a new format
The third pillar of the Webrecorder stack is the WACZ format and Replay.web.page.
WACZ (Web Archive Collection Zipped), proposed by Webrecorder in 2021, is effectively "WARC files inside a ZIP, plus an index and metadata." Layout:
my-collection.wacz (ZIP container)
|-- archive/
| |-- data-001.warc.gz
| |-- data-002.warc.gz
|-- indexes/
| |-- index.cdx.gz # CDXJ index
| |-- index.idx # secondary index
|-- pages/
| |-- pages.jsonl # page list and metadata
|-- metadata.yaml # collection metadata
|-- datapackage.json # Frictionless Data Package standard
|-- datapackage-digest.json # SHA-256 hashes
WARC is a low-level format that just records HTTP request/response sequences. WACZ wraps it with standardized metadata, signatures, a page list, and indexes so an entire collection becomes a single distributable, verifiable file. WACZ supports a detached cryptographic signature for tamper detection.
WACZ shines when paired with Replay.web.page, a client-side replay engine.
<replay-web-page source="https://example.com/my-archive.wacz" url="https://example.com/page"></replay-web-page>
Drop that Web Component into a page and the browser downloads the WACZ and replays it locally — no server-side replay engine like Pywb required. Host the WACZ on any static host (GitHub Pages, Netlify, Cloudflare Pages, S3) and you have a permanent archive.
Useful scenarios:
- Museums and libraries publish digital collections forever with no server cost
- Journalists embed a WACZ in the article body to permanently preserve a paywalled source
- Academic papers attach a WACZ of cited URLs as supplemental material
- Individual bloggers self-host WACZ snapshots of pages they cite
WACZ is being standardized via an IIPC (International Internet Preservation Consortium) working group. As of May 2026, spec 1.1.1 is the stable version.
10. WARC (ISO 28500) + Heritrix + Pywb — the infrastructure trio
If WACZ is the new package, WARC is the underlying standard.
WARC (Web ARChive) is ISO 28500, standardized in 2009. It succeeded the ARC format IA had used since 1996, and stores multiple HTTP request/response pairs sequentially in one file. ISO 28500:2017 is the current revision and remains valid in 2026.
A WARC record looks roughly like this.
WARC/1.1
WARC-Type: response
WARC-Record-ID: urn-uuid-abc-123
WARC-Date: 2026-05-16T10:00:00Z
WARC-Target-URI: https://example.com/page
Content-Type: application/http; msgtype=response
Content-Length: 12345
HTTP/1.1 200 OK
Content-Type: text/html
<!doctype html>...
Record types include response (the actual response), request, warcinfo (file metadata), metadata (auxiliary), and revisit (deduplication reference). A WARC file is typically rolled at around 1GB.
Heritrix is IAs Java-based, large-scale distributed crawler, in development since 2003. It is the reference implementation of WARC and is used not only by IA but by the Library of Congress, the British Library, the National and University Library of Iceland, and effectively every national-library web-archiving program. Current stable as of May 2026 is 3.4.0, Apache License 2.0.
Heritrixs strength is proven stability at the billions-of-pages scale — robots.txt compliance, per-domain politeness, distributed multi-instance operation, disk caches, URL normalization, dedup. Its weakness is no JS rendering, which is why IA runs Brozzler (headless Chromium) alongside it.
Pywb (Python Wayback) is the WARC replay engine Ilya Kreymer wrote at IA and open-sourced. Written in Python, it lets anyone run a personal Wayback Machine. Conifer, Browsertrix Cloud, parts of the Library of Congress, and a long list of university libraries all run on Pywb.
A minimal Pywb setup:
pip install pywb
wb-manager init my-archive
wb-manager add my-archive ./my-crawl.warc.gz
wayback --port 8080
# Browse to http://localhost:8080/my-archive/2026*/https://example.com
This trio — WARC + Heritrix + Pywb — is the de facto standard preservation stack.
11. Perma.cc — Harvard Law and permanent citation links
Perma.cc is a permanent-citation service launched in 2013 by Harvard Law School Library. The motivation was research showing over 70% of URLs cited in law-review articles eventually broke.
The flow:
- A registered user (typically a lawyer, legal scholar, or journal editor) submits a URL.
- Perma.cc captures both rendered HTML and a PNG screenshot.
- A permanent short URL of the form
perma.cc/ABC1-DEF2is issued. - Papers and court opinions cite both the original URL and the perma.cc URL.
The crucial part is governance. Perma.cc is run by a consortium of 160+ law libraries (Perma.cc Registrars), and captures are stored across distributed storage owned by the consortium. Even if Harvard alone collapsed, another library would inherit the data.
Pricing:
- Public users: 10 captures per month free
- Faculty / registrars: unlimited free (the home library covers dues)
- Subscriber organizations: paid plans
Perma.cc has effectively become the standard citation tool in US legal practice. Since the 20th edition of The Bluebook (the US legal-citation guide), perma.cc URLs have been formally recommended as the "stable citation form when a URL is at risk of decay," and as of 2024 many US Supreme Court opinions include perma.cc URLs directly in their text.
Perma.cc itself uses WARC as the backend and lets organizations export whole collections as WARC. That is the model citation: users see permanent short URLs, but the underlying storage is the open standard.
12. Permanent.org — preserving personal digital heritage
Permanent.org is a nonprofit digital-heritage service founded in 2017. It is designed to let individuals preserve their own photos, documents and videos for life and beyond. Some ex-IA engineers are involved, bringing infrastructure know-how with them.
Distinctive features:
- One-time payment — instead of a monthly subscription, you pay once and keep storage forever (current rates around $10 for 100GB, $50 for 1TB)
- Beneficiary system — designated heirs inherit accounts automatically on death
- Open content option — collections can be made publicly accessible if the user chooses
- Migration guarantee — format conversions and storage refreshes are included in the dues
The bet is straightforward: unlike cloud storage (Google Drive, iCloud, Dropbox) where missed payments mean lost files, Permanent.org collects a one-time fee and runs an endowment to sustain storage for a century or more. Roughly 70% of funds go to the endowment and 30% to current operations.
Anyone who has watched family photos drift from one cloud to another, eventually losing some, will recognize the problem. Permanent.orgs nonprofit + one-time + legal-inheritance structure is one of the more interesting experiments in the space.
As of 2026 it has roughly 10,000 members and 200TB stored — small, but considered an important model in digital-heritage circles.
13. ArchiveBox + SingleFile — the self-hosted approach
If you want your own preservation stack on your own hardware, ArchiveBox is the de facto standard. Started by Nick Sweeting in 2017, this Python-based open-source tool combines "bookmark manager + Wayback Machine + permanent storage" in a self-hostable package.
Features:
- Multiple backends — a single URL is captured simultaneously as WARC, HTML, PDF, PNG screenshot, YouTube-DL (video), Git clone, Readability-extracted article and more (7–10 formats by default)
- CLI + Web UI —
archivebox add <url>to add, web UI for search and browsing - JSON + SQLite — metadata stored in standard formats, easy to export
- Docker / Docker Compose —
docker run -v ./data:/data archivebox/archivebox
Initial setup is short.
# Docker
docker run -v $PWD/data:/data -it archivebox/archivebox init --setup
# Add a URL
docker run -v $PWD/data:/data archivebox/archivebox add 'https://example.com/article'
# Run the web UI
docker run -v $PWD/data:/data -p 8000:8000 archivebox/archivebox server 0.0.0.0:8000
ArchiveBoxs strength is format diversity. A WARC alone needs a replay engine; PDF, PNG and standalone HTML can be read by any tool in any future. ArchiveBox is the cleanest implementation of the preservation communitys "format diversity" principle.
SingleFile is a Chrome / Firefox extension by Gildas Lormeau that saves the page as currently rendered in your browser as a single, self-contained HTML file, inlining images, CSS, fonts and JS so the file is dependency-free.
# CLI version exists
npm install -g single-file-cli
single-file https://example.com output.html
ArchiveBox uses SingleFile as one of its backends. For the lightest "I want to keep this page" workflow, the SingleFile extension and one click are hard to beat.
Adjacent tools include Wallabag (RSS / read-it-later focus), Hypothesis (annotation focus), and Zotero (scholarly citation focus). None of them quite matches ArchiveBox for the combination of WARC + Markdown export + multi-format.
14. Korea — OASIS at the National Library, KEPRI archive, Academy of Korean Studies
Web archiving in Korea is led by the National Library of Korea (NLK).
OASIS (Online Archiving and Searching Internet Sources) is the Korean governments official web archiving program, started in 2003 at oasis.nl.go.kr. As of May 2026:
- Cumulative websites collected: approximately 32 million
- Annual new collection: approximately 2 million
- Storage: approximately 2PB
- Targets: government and public-sector sites, scholarly resources, cultural and current-affairs sites, plus event-based special collections (elections, disasters, etc.)
OASIS rests on the 2010 amended Library Act and its enforcement decree, which established Koreas online legal-deposit regime. Digital works published in Korea must be deposited at NLK; web collection mixes opt-in agreements with national-library-driven harvesting.
Common user entry points:
oasis.nl.go.kr/search— keyword searchoasis.nl.go.kr/wayback/YYYYMMDDhhmmss/original-URL— Wayback-style point-in-time replay- Bulk academic exports available on application
KEPRI (Korea Electric Power Research Institute) Digital Archive preserves blueprints, technical reports and standards for the power industry. Access is primarily through industry and academic channels rather than open public browsing.
The Academy of Korean Studies (AKS) runs the Korean Studies Resource Center (kostma.korea.ac.kr) and a Korean Studies portal, digitizing primary Korean Studies sources — old manuscripts, lineage records (jokbo), local gazetteers (eupji), and colonial-era newspapers — and making them freely accessible. Its IIIF viewer supports page-level zoom and annotation for rare manuscripts.
A handy map of Korean archives by domain:
| Domain | Institution | URL |
|---|---|---|
| General web | National Library of Korea | oasis.nl.go.kr |
| Korean Studies primary sources | Academy of Korean Studies | kostma.korea.ac.kr |
| National records | National Archives of Korea | archives.go.kr |
| Academic papers | KISTI, RISS, KCI | kiss.kstudy.com, riss.kr, kci.go.kr |
| News | BIGKinds (Korea Press Foundation) | bigkinds.or.kr |
| Film and broadcasting | Korean Film Archive, KBS Archive | koreafilm.or.kr |
| Power and engineering | KEPRI digital archive | kepri.re.kr |
OASIS is small compared to the Wayback Machines 935B pages, but it represents government-level responsibility for preserving the .kr domain. If IA is the global backup, OASIS is the primary keeper of Koreas cultural heritage on the web.
15. Japan — National Diet Library NDL + WARP
Japans official web archiving is run by the National Diet Library (NDL) through WARP (Web Archiving Project).
WARP started as a pilot in 2002 and gained legal authority in the 2010 amendment of the National Diet Library Law. As of May 2026:
- Cumulative URLs collected: approximately 2.7 billion
- Storage: approximately 1.5PB
- Targets: national, prefectural and municipal government sites (collected in full), public-interest corporations, academic sites, and selectively collected current-affairs and event sites
A distinctive feature: government sites are harvested in full without per-site consent. Article 25-3 of the National Diet Library Law authorizes NDL to collect and preserve government online materials. Private sites operate under an opt-in model that requires owner consent.
Entry points:
warp.da.ndl.go.jp/search/— keyword searchwarp.da.ndl.go.jp/info:ndljp/pid/ID— NDL persistent identifierwarp.da.ndl.go.jp/waybackmachine/YYYYMMDDhhmmss/URL— Wayback-style replay
WARPs harvest frequency varies by site type — central ministries once or twice a month, prefectures and municipalities quarterly, academic institutions every six months. When elections or disasters occur, daily emergency harvests are turned on. After the 2011 Tohoku earthquake NDL put about 10,000 sites into emergency-harvest mode.
NDL also runs Digital Collections, a project digitizing Japanese books, magazines, dissertations, audio and video. Around 600,000 items are freely readable online, and an additional 2,000,000 items are available via library-transmission services.
The National Archives of Japan (NAA) publishes digitized government records, and the Diet Proceedings Search System offers full-text search of Diet minutes since 1947. NDL + NAA + the Diet proceedings search form the three pillars of Japans digital recordkeeping.
16. The linkrot crisis — how we respond
The linkrot numbers from the intro are taken seriously across academia, journalism and government. The 2024 Pew Research report documented:
- 38% of pages from 2013 were gone by 2023
- 11% of external citation links on Wikipedia were broken in 2024
- 21% of US government URLs broke between 2020 and 2024 due to domain migration or restructuring
- More than half of social-media posts become private, deleted or otherwise inaccessible within 5 years
Linkrot has many causes.
- DNS / server shutdown — companies fold, hosting expires
- URL structure changes — CMS migrations, site redesigns
- Policy-driven deletion — copyright, defamation, GDPR right-to-erasure
- Social media account closure — user-initiated or platform action
- Paywalling — content that was effectively public becomes private
Strategies divide into the production side and the citation side.
Producer side
- Use standard formats — WARC and WACZ
- Distribute across multiple archives — IA, archive.today, Perma.cc, your own ArchiveBox
- Get legal and contractual cover — explicit licensing, library partnerships
- Plan format migrations — review and re-encode every 10 years or so
Citer side
- Capture at the moment of citation — Save Page Now and archive.today together
- Cite at least two short URLs — original + perma.cc + web.archive.org
- Include the full text or critical excerpts — even if every URL dies, the meaning of the citation survives
- Keep a local PDF backup — on your machine or NAS
Especially in academic journals and news media, "every URL we cite is archived before publication" is becoming a standard. The New York Times, The Atlantic, ProPublica, Japans NHK, and Koreas Hankyoreh and Kyunghyang have all adopted internal citation-preservation policies.
17. Who should care about digital preservation — librarians, journalists, researchers, citizens
Digital preservation is no longer just for librarians and archivists. By 2026, almost every information worker carries part of the responsibility for preserving their domain.
Librarians and archivists
- Run your own collections with WARC + Pywb
- Use Browsertrix Cloud or Conifer as curation tools
- Join the Perma.cc consortium (academic libraries)
- Collaborate under LOCKSS with peer institutions
Journalists
- Capture every external URL through archive.today / Save Page Now before publication
- Politicians social posts, corporate filings, and other deletion-prone material should be in at least two archives
- Use WACZ to build your own media archive (see NYT R&D, ProPublica)
Researchers
- Permanentize citation URLs with Perma.cc
- Preserve data and code separately on Zenodo, OSF, or the GitHub Archive Program
- Store interviews and field notes on Permanent.org with family-rights metadata
Citizens and individuals
- Back up family photos and documents to Permanent.org plus a NAS-plus-cloud combo
- Periodically back up your blog and social posts with SingleFile / ArchiveBox
- Save favorite pages to the Wayback Machine on first read (bookmarklet or extension)
Developers and operators
- Periodically back up company wikis and internal docs with ArchiveBox
- Bundle external dependency docs (SaaS docs, blog posts) into WACZ at build time
- Mirror open-source projects onto Software Heritage (softwareheritage.org)
The single most important fact: there is no such thing as "digital forever". The person who will keep your data alive is you; global infrastructure (IA, archive.today, NDL, OASIS) is auxiliary. In 2026, the default citizen posture is "I am responsible for what happens to my digital records 30 years from now."
18. References
- Internet Archive — https://archive.org
- Wayback Machine — https://web.archive.org
- archive.today — https://archive.today (mirrors: archive.ph, archive.is)
- Save Page Now — https://web.archive.org/save/
- Common Crawl — https://commoncrawl.org
- Conifer — https://conifer.rhizome.org
- Webrecorder Software — https://webrecorder.net
- Browsertrix Crawler — https://github.com/webrecorder/browsertrix-crawler
- Browsertrix Cloud — https://browsertrix.com
- Replay.web.page — https://replayweb.page
- WACZ Specification — https://specs.webrecorder.net/wacz/latest/
- WARC (ISO 28500) — https://www.iso.org/standard/68004.html
- Heritrix — https://github.com/internetarchive/heritrix3
- Pywb — https://github.com/webrecorder/pywb
- Brozzler — https://github.com/internetarchive/brozzler
- Hachette v. Internet Archive (2024 ruling) — Second Circuit decision, 2024.09.04
- Perma.cc — https://perma.cc
- Permanent.org — https://www.permanent.org
- ArchiveBox — https://archivebox.io
- SingleFile — https://github.com/gildas-lormeau/SingleFile
- Library of Congress Web Archives — https://www.loc.gov/programs/web-archiving/
- National Library of Korea OASIS — https://oasis.nl.go.kr
- Academy of Korean Studies — https://www.aks.ac.kr (Korean Studies Resource Center: kostma)
- KEPRI Digital Archive — https://www.kepri.re.kr
- National Diet Library (NDL), Japan — https://www.ndl.go.jp
- NDL WARP — https://warp.da.ndl.go.jp
- IIPC (International Internet Preservation Consortium) — https://netpreserve.org
- Software Heritage — https://www.softwareheritage.org
- LOCKSS — https://www.lockss.org
- Memento Protocol (RFC 7089) — https://datatracker.ietf.org/doc/html/rfc7089
- Pew Research linkrot study (2024) — https://www.pewresearch.org/internet/2024/05/17/when-online-content-disappears/
- Harvard Law School Library on linkrot — https://cyber.harvard.edu/research/linkrot