The Internet Archive’s Wayback Machine and the fragile history of the web

The Internet Archive’s Wayback Machine is one of those tools people often discover in a moment of irritation. A page is gone. A company has changed its policy page. A government document no longer loads. A journalist needs proof that a sentence existed before it was quietly edited away. You paste in a URL, choose a date, and hope the archive caught it. Quite often, it did. The service began with the Internet Archive’s web archiving efforts in 1996, opened public access about five years later, and now advertises access to more than 1 trillion saved web pages.

Table of Contents

That scale can make the Wayback Machine look magical. It is not magic. It is infrastructure. It is a public-facing replay system built on crawling, indexing, storage standards, and a long-running institutional commitment to preserving a medium that was never built to preserve itself. The web changes constantly. Pages break, disappear, move behind paywalls, or survive only as half-functioning shells. The Wayback Machine exists because the live web is a bad place to keep history.

A memory system for a disposable medium

The most useful way to understand the Wayback Machine is to stop thinking about it as a quirky nostalgia site. Yes, it lets you look at old homepages, dead fan sites, forgotten corporate logos, and awkward redesigns from 2003. That is the fun part. The serious part is that it gives the web something it rarely has on its own: temporal memory. A page on the live web usually shows only its current state. A page in a web archive can reveal a sequence of states, which is far closer to how historians, reporters, researchers, and courts need to understand public information.

The need for that memory is no longer a niche concern. Pew Research Center found that a quarter of webpages that existed at some point between 2013 and 2023 were no longer accessible by October 2023. Its study also found broken links across government and news sites, and even within Wikipedia references. The Library of Congress has written about the same decay in the language archivists use every day: link rot and reference rot. One is disappearance. The other is quieter and sometimes worse. A link still works, but the page no longer says what it said when someone cited it.

Legal scholarship has been warning about that problem for years. A Harvard Law Review Forum piece on Perma described severe rates of link and reference rot in legal scholarship and U.S. Supreme Court opinions. That matters well beyond law journals. It tells you something blunt about the web as a public record: without preservation, citation on the web is often a wager against time. You may be pointing readers to something real today and something nonexistent or altered tomorrow.

That is the hole the Wayback Machine fills. It does not solve every preservation problem, and it does not give perfect coverage. It does something simpler and more radical. It keeps versions. The archive lets a reader compare a page across dates, inspect capture timestamps, and retrieve a public record that no longer depends on the current owner of the site deciding to keep it online. That changes the balance of power between publisher and public. A page owner still controls the live page. The archive can preserve evidence that another version existed.

That is also why the Wayback Machine can feel politically charged even when it is doing something technically plain. It reduces the ease with which institutions can pretend the past vanished just because the URL changed. Governments revise guidance. Companies rewrite claims. Media outlets correct or restructure articles. None of that is inherently suspicious. The point is that public memory should not depend only on the current version of a page. On the modern web, the absence of history is often treated as normal. The Wayback Machine refuses that bargain.

An invisible layer of public infrastructure

The Wayback Machine is famous, but its deeper importance comes from the fact that it has become part of other people’s workflows. Journalists use it to trace edits and deletions. Researchers use it to study site evolution, media history, and platform behavior. Ordinary users use it to recover lost pages, old documentation, or dead links. The Internet Archive itself has highlighted how reporters rely on web archives in digital investigations, and recent research surfaced multiple recurring journalistic uses, from following page changes to verifying disappearing material.

Google’s 2024 integration made that role even more visible. The Internet Archive announced that users could reach archived versions of pages directly through Google Search’s “More About This Page” panel. That is a small interface choice with a big implication. It treats web history as something a search engine user may reasonably need, not as a specialist trick reserved for archivists. A search result is no longer just a pointer to the present page. It can also become a path to prior versions.

The browser extension tells the same story from a different angle. The Wayback Machine extension can detect dead pages and suggest archived versions, which turns web archiving into a repair layer for broken browsing. That is an unglamorous function, but it matters. A web archive is not only for historians looking backward decades at a time. It is also for people trying to follow a citation that broke sometime last month. In that sense the archive behaves less like a museum and more like public maintenance for a decaying medium.

Government archiving makes the civic value clearer. The End of Term Web Archive preserves U.S. government websites during presidential transitions, precisely because those transitions are moments when public pages can change, vanish, or be rewritten. The project describes itself as a collaborative effort to collect, preserve, and make government sites accessible at the end of administrations. Internet Archive posts about the 2024–2025 crawl stressed the same point: preserving these pages protects access to policy history, public messaging, and official information that may later disappear from the live web.

That is why the Wayback Machine keeps showing up in stories that have nothing to do with nostalgia. It appears in reporting about misinformation, in legal disputes, in scholarship on digital media, and in public accountability work. The web is where institutions publish, retract, revise, and sometimes quietly erase. A functioning public archive turns those acts into something that can be checked. It does not end disputes about authenticity. It gives those disputes a record to work with. That alone makes it more than a convenience tool.

The shape of a capture

A lot of confusion around the Wayback Machine comes from thinking it archives “the internet” in one undifferentiated sweep. It does not. It archives web content through distinct mechanisms, each with its own strengths and blind spots. The Internet Archive’s help materials explain that the Wayback Machine draws from many different web crawls, and that each capture can belong to a specific collection with its own story about who collected it, when, and why. Some captures come from broad crawling. Some come from user requests. Some come from partner programs focused on preserving particular domains or institutions.

The simplest public entry point is Save Page Now. That tool lets a user archive a specific page on demand. The help documentation is careful about what that does and does not do. Save Page Now saves one page at a time; it does not automatically add that URL to future crawls, and it does not preserve an entire site just because one page was requested. That limit matters. People often assume one successful save means an entire domain is safely preserved. It does not. It means you likely preserved the page you asked for and whatever embedded resources the system could fetch for that save.

For broader work, the Internet Archive directs organizations toward Archive-It, its subscription service for building collections and archiving whole sites. That division is useful because it shows the Wayback Machine is not only a public viewer. It sits inside a larger preservation ecosystem that includes institutional collections, event-based crawls, domain-focused projects, and developer tools. The public calendar view is the visible tip of a much larger archiving system.

The main ways the Wayback Machine gathers and serves material

Part of the system	What it does	Best use
Broad web crawls	Collects pages at scale from across the web	Long-term background coverage
Save Page Now	Preserves a page on request	Urgent single-page saving
Archive-It and partner collections	Archives whole sites or curated collections	Institutional, thematic, or event-based preservation
Wayback APIs and CDX indexes	Exposes capture data for lookup and analysis	Research, automation, and verification

That table matters because it corrects a common misunderstanding. People often judge the archive as though it were a single uniform record. It is not. It is a layered record assembled from different collection logics. A page may be present because it was part of a large crawl. Another may exist only because someone saved it in time. A third may be preserved because a library, archive, or project team targeted it deliberately. Coverage is shaped by mission, timing, crawl scope, permissions, and technical feasibility.

That is also why provenance matters. The Internet Archive’s help pages encourage users to inspect capture details and the collections behind them. A serious user should do that. If you are using the Wayback Machine for evidence, research, or reporting, a capture date alone is not the whole story. You want to know whether the page came from a routine crawl, a partner archive, or an on-demand save. The archive does not only store pages. It stores context about the act of capture, and that context is part of the record.

Inside a saved page

Underneath the friendly calendar interface sits a set of preservation standards and indexing systems that matter a great deal if you want to understand what the archive is really showing you. The Library of Congress describes the WARC format as the standard way of combining harvested web resources and related metadata into archival files. WARC is not just a dump of HTML. It is a preservation container designed to store payloads, metadata, request and response information, and the relationships between archived resources. A web archive is closer to a forensic package than a screenshot folder.

The Wayback Machine also relies on indexing layers such as CDX. Internet Archive documentation explains that a CDX file summarizes archived web documents line by line, making them queryable and sortable. That matters for more than developer curiosity. It is one reason researchers can ask structured questions about captures across time instead of manually clicking through calendars for hours. The replay view feels simple because the hard work happened earlier in collection, storage, and indexing.

Then there is Memento, the HTTP framework that gives web archives a language for time-based access. RFC 7089 describes it as a way to bridge the present and past web through datetime negotiation, TimeMaps, and prior-state resources. The mementoweb project puts it more plainly: it adds a time dimension to HTTP. That framing is useful because it captures the real conceptual achievement. The Wayback Machine is not only saving pages; it is helping normalize the idea that a web resource may need to be requested as it existed at a particular moment, not only as it exists now.

The technical side has also changed as the web changed. In 2019 the Internet Archive described a major rewrite of Save Page Now built on Brozzler, browser-based crawling software capable of running page JavaScript during capture. The Brozzler repository describes it as a distributed crawler that uses a real browser such as Chrome or Chromium to fetch pages, embedded URLs, and links. That does not solve every modern web problem, but it narrows an old gap. Earlier archival workflows often struggled badly with client-side rendering. A browser-driven capture system can replay some modern pages more faithfully than old crawler models could.

Faithfully is not the same as perfectly. The archive’s own help documentation says dynamic pages vary widely in how well they can be stored. Standard HTML is straightforward. Forms, complex JavaScript interactions, and elements that require live communication with the originating host can fall apart in replay. That limitation is not a bug in the narrow sense. It is a consequence of what the web has become. Many modern pages are not static documents at all. They are temporary performances assembled on demand from scripts, APIs, authentication states, and remote assets. Archiving them is far harder than archiving a page from the late 1990s.

An archive full of holes

A Wayback capture can look complete even when it is not. The page loads. The text is there. The headline appears. The layout more or less resembles the original. Then you notice missing images, dead scripts, broken styles, or a search box that does nothing. The Internet Archive’s help pages say this bluntly: broken or gray images usually mean those images were not archived on their servers. That is a useful reminder that replay is an act of reconstruction. The archive can only replay what it captured. If a resource was missed, blocked, or loaded from a source the crawler could not fetch, the replay may be partial.

Some of those gaps come from technical complexity. Client-side apps, interactive forms, authenticated sessions, media loaded on demand, geofenced content, and pages built through third-party calls can all reduce archival completeness. Some gaps come from crawl scope. Save Page Now may preserve the page you submitted, but not the entire site around it. Some come from indexing delays. The help page on recently archived pages notes that new saves can take time to become visible because different indexes cover different periods. A user who assumes failure the moment a fresh save does not instantly replay may simply be ahead of the system.

Other holes are social and legal rather than technical. The Internet Archive provides a process for site owners or account holders to request exclusion of their archives from web.archive.org, and copyright complaints follow separate policy channels. That means the archive is not a unilateral preservation machine that simply takes and keeps everything forever. It exists inside a field of permissions, objections, takedowns, and negotiated limits. Preservation is public-facing, but it is not sovereign.

The robots question sits near the center of that tension. Internet Archive writing has argued that robots.txt rules built for search engines do not necessarily fit archival goals, yet robots behavior and archival exclusion remain live governance issues around what gets captured and replayed. Users sometimes treat absence in the Wayback Machine as proof a page never existed or was never public. That is a mistake. It may have existed and simply not been captured, or it may have been excluded, blocked, or later removed. Silence in an archive is not clean evidence of nonexistence.

That problem has grown sharper in the AI era. In April 2026, multiple news reports described a new wave of major publishers blocking the Wayback Machine’s crawler, largely over concerns that archived pages could be exploited for AI training or other unauthorized reuse. Recent reporting also noted Reddit’s restrictions on the Internet Archive’s ability to archive most of its content. Those moves do not merely affect historians decades from now. They affect journalists, researchers, and readers right now, because material that would once have been captured as part of the public web record may no longer enter that record at all.

Evidence, accountability, and the politics of deletion

The Wayback Machine is useful because it is a record. It is contentious for the same reason. Once archived pages start being used as evidence, people stop talking about cute old websites and start arguing about authenticity, admissibility, and interpretation. Legal writing on the subject has long wrestled with those issues. A Fordham Law Review note called “Best Evidence and the Wayback Machine” examined how courts handled archived web pages and argued that archived materials raise both authentication and best-evidence questions. That is a healthy caution. An archived page is powerful evidence, but it still requires careful handling.

Lawyers dealing with link rot have often paired that caution with a practical response: preserve cited web material before it moves. Perma.cc grew out of exactly that problem, especially in legal and academic citations. The Harvard Law Review Forum piece on Perma set out the scale of link and reference rot, while recent American Bar Association guidance still recommends tools such as the Internet Archive and Perma.cc to reduce the damage. The distinction matters. Perma is targeted, citation-focused, and narrower. The Wayback Machine is broad, public, and historical. They are not rivals so much as different answers to the same instability.

Journalism reveals another layer. Internet Archive reporting on web archives in investigations described reporters using archived pages to follow quiet edits, disappearing claims, or vanished posts. That use is especially important because the web has normalized silent revision. Print culture had editions. Broadcast had recordings. The live web often offers only replacement. A web archive reintroduces version history into public discourse. That does not settle arguments about motive or meaning, but it can settle a narrower and vital question: did this page say this on this date?

Government pages make the accountability stakes even plainer. The End of Term Web Archive exists precisely because public institutions are not exempt from digital forgetting. Agencies redesign sites, alter wording, retire reports, move PDFs, and reorganize databases. A preserved version lets researchers track those changes across administrations. That is not merely administrative housekeeping. It is part of the documentary basis for democratic accountability. A public statement should remain inspectable even after the institution that posted it decides to move on.

The politics of deletion runs through all of this. Some removals are justified. Privacy matters. Copyright matters. Harmful publication can create real reasons to limit replay. But there is also a broader public interest in not letting the web become a medium where every inconvenient version is replaced without trace. The Wayback Machine keeps that tension visible. It does not give a neat philosophical answer. It forces society to keep making choices about which forms of forgetting are legitimate and which are a loss to the public record.

Pressure on the archive itself

People often talk about the Wayback Machine as though it were a neutral layer hovering above the internet. It is built and operated by a real institution with legal exposure, technical vulnerabilities, staffing constraints, and political critics. That matters because a fragile institution cannot guarantee stable public memory. The cyberattacks of 2024 made that visible. Internet Archive service updates explained that the organization resumed the Wayback Machine in stages after attacks in October 2024 and brought archive.org back in provisional read-only mode while it rebuilt defenses. Brewster Kahle later wrote that services returned in phases while the organization focused heavily on security work.

Those incidents are easy to treat as isolated operations news. They are not. They show that digital preservation depends on operational resilience. You do not have a public memory system if it can be knocked offline for long stretches or if restoring access becomes too expensive. The Internet Archive’s collections remained safe, but the attacks demonstrated the gap between storing data and providing dependable public access to it. An archive must do both.

Legal pressure matters in a different way. The Hachette litigation was about the Archive’s book-lending program rather than the Wayback Machine itself, but it still weakened the broader institution. In late 2024 the Internet Archive announced it would not seek Supreme Court review in that case and would continue honoring the publisher agreement tied to the outcome. In 2025 the Archive also warned of a major recording-industry lawsuit over 78rpm preservation. Those disputes do not erase the Wayback Machine, but they shape the environment around the nonprofit that runs it. A preservation institution under sustained legal and financial attack has fewer margins everywhere.

The platform blocking trend adds a third form of pressure. News organizations and platforms restricting archival crawling are not attacking the Internet Archive in the same direct way a lawsuit or DDoS campaign does. Still, the result can be similar from a public-record perspective: less of the contemporary web enters the archive. That is a structural risk, not a passing annoyance. The web becomes harder to preserve at the very moment more public life moves onto services that are tightly controlled, dynamically rendered, and guarded by anti-scraping policy.

That is why the one-trillion-page milestone should be read in two ways. It is an extraordinary preservation achievement. It is also a reminder that success in web archiving is never final. The archive can cross a historic threshold and still face new barriers to collecting what tomorrow’s web will become. Past scale does not guarantee future coverage.

Reading the record with care

The best Wayback users are not the ones who click most quickly. They are the ones who read the archive like a record with structure. The Internet Archive’s search guidance still points users toward the exact URL and date selection flow, and that is usually the smartest place to begin. A homepage may have hundreds of captures while a subpage has none. A page can move from HTTP to HTTPS, from one subdomain to another, or from a plain path to a slugged CMS URL. The archive is literal-minded about addresses, so good searching often begins with URL variations rather than keyword guessing.

The timestamp matters as much as the page. A capture from 10:03 and a capture from 19:47 on the same date may not show the same thing. If you are checking whether a statement appeared before an edit or removal, you need the narrowest possible window. The Memento framework and capture metadata exist for exactly that reason. Time is not decoration in a web archive. It is the point of the archive.

If you need to preserve something important, waiting for a general crawl is a bad strategy. Use Save Page Now. The Internet Archive’s help pages and 2025 blog guidance make that explicit: you can archive a page in real time, and the service will give you a persistent archived URL once the save completes. For urgent public-interest material, that can be the difference between having a record and losing it. The archive is strongest when people treat preservation as an active habit rather than a passive hope.

For deeper work, the APIs matter. The developer documentation and tutorial pages show how to check whether a site exists in the archives and how to retrieve capture information programmatically. That is not only for engineers. Researchers, newsroom developers, librarians, and investigative teams can use those interfaces to verify coverage, study change over time, or automate retrieval. The public calendar remains the entry point. The API layer is where the Wayback Machine becomes research infrastructure.

One more caution belongs here. A Wayback capture proves less than some users want and more than skeptics sometimes admit. It may not prove the full functionality of a live site, the intent behind a change, or the completeness of every embedded asset. It can still prove that a public-facing archived version existed at a specific capture time and displayed particular content. Serious use depends on staying inside those boundaries. The archive becomes more trustworthy the moment you stop asking it to do impossible things.

A public memory worth defending

The Wayback Machine matters because the web is not self-preserving. It never was. The oldest optimistic idea about the internet was that publishing digitally would make information plentiful and durable. The durable part turned out to be false. Digital publication is easy. Durable public access is hard. Pages disappear. Interfaces change. Owners lock down archives. Platforms shutter old content. Links decay. Reference trails break. A public archive steps into that failure and says that yesterday’s web still belongs to the historical record.

That is why the Internet Archive’s achievement deserves real respect. Starting from web archiving work in 1996, opening public access in 2001, and crossing the threshold of one trillion archived pages in 2025 is not just a story about scale. It is a story about insisting that digital memory should remain publicly accessible rather than being outsourced entirely to private platforms, proprietary search indexes, and whoever currently controls a domain name.

But admiration is not enough. The record the Wayback Machine offers is partial, contested, and under pressure. That is not a reason to dismiss it. It is the reason to take it seriously. A public memory system for the web will always be incomplete. The alternative is not perfection. The alternative is disappearance without witness. For a medium that now carries government records, journalism, research, commerce, culture, and personal history, that would be an astonishingly reckless way to live.

FAQ

What is the Wayback Machine?

The Wayback Machine is the Internet Archive’s web archiving service that lets users view saved versions of webpages from earlier dates. It began with web archiving efforts in 1996 and opened public access in 2001.

Who runs the Wayback Machine?

It is run by the Internet Archive, a nonprofit digital library organization. The service is one part of the Archive’s broader mission to preserve digital and cultural materials for public access.

How many pages has it archived?

The Wayback Machine now advertises access to more than 1 trillion archived web pages. The Internet Archive publicly celebrated that milestone in October 2025.

Why does the Wayback Machine matter?

It matters because the live web loses material constantly through deletion, redesign, broken links, and silent edits. Web archives preserve earlier versions that readers, journalists, and researchers can still inspect.

What is link rot?

Link rot is what happens when a URL that once worked no longer leads to the original content. Archivists often pair that with “reference rot,” where the link still works but the content has changed.

Can I save a page myself?

Yes. Save Page Now lets users request an immediate archive of a specific page and returns an archived URL once the save completes.

Does Save Page Now archive an entire site?

No. The Internet Archive says Save Page Now does not automatically add the page to future crawls and does not archive whole sites just because one page was saved.

How do I search the Wayback Machine well?

The most reliable starting point is the exact URL, then the calendar view, then the specific capture time. URL changes, redirects, and subdomain differences often explain why a page is harder to find than expected.

Why do some archived pages look broken?

A replay may miss images, scripts, styles, or other resources if they were not captured. Dynamic pages, forms, and content that depends on live services also tend to replay imperfectly.

Can the Wayback Machine archive JavaScript-heavy pages?

Sometimes. The Archive improved Save Page Now with Brozzler, a browser-based crawler that can run page JavaScript during capture, but modern dynamic sites still remain harder to preserve than traditional HTML pages.

What is WARC?

WARC is the standard archival file format used to store harvested web resources and related metadata. It is a core building block of professional web archiving.

What is Memento?

Memento is an HTTP framework for time-based access to prior versions of web resources. It gives web archives a standard way to expose earlier states of a page.

Can site owners remove material from the Wayback Machine?

The Internet Archive provides a process for requesting exclusion of archived site material, and copyright complaints follow separate policy routes. That means the archive is public, but not immune to removal requests.

Why are some sites missing entirely?

A site may be absent because it was never captured, was technically difficult to archive, was excluded, or blocked archival crawling. Absence in the archive does not automatically mean the page never existed.

Are publishers blocking the Wayback Machine now?

Yes, some major publishers and platforms have recently restricted archival crawling, often citing AI-scraping and reuse concerns. That trend has become a real threat to future web preservation.

Is the Wayback Machine used in journalism and research?

Very much so. Journalists use it to track edits and deletions, while researchers use it to study media history, web change, and institutional communication over time.

Can Wayback captures be used in court?

They can be used, but courts still care about authentication, admissibility, and the scope of what an archived page actually proves. Legal commentary has treated Wayback evidence as useful but not automatic.

What is the End of Term Web Archive?

It is a collaborative project that preserves U.S. government websites during presidential transitions so the public record remains accessible even as administrations change sites and messaging.

Did the Wayback Machine face outages or attacks recently?

Yes. The Internet Archive reported cyberattacks in 2024 and restored services in stages, bringing the Wayback Machine back before archive.org returned in provisional read-only form.

Is the Wayback Machine enough on its own?

No single archive is enough. The Wayback Machine is crucial, but long-term web memory also depends on targeted tools such as Perma.cc, institutional archiving, standards like WARC, and deliberate preservation habits.

Author:
Jan Bielik
CEO & Founder of Webiano Digital & Marketing Agency

This article is an original analysis supported by the sources cited below

Wayback Machine – Internet Archive
The official Wayback Machine homepage, including the current public scale of the archive and the Internet Archive’s description of the service.

Wayback Machine general information
Internet Archive help page explaining what the Wayback Machine is, where captures come from, and how the service developed.

Using the Wayback Machine
Official guidance on searching archived pages, reading capture information, and using Save Page Now.

Search – A basic guide
Internet Archive instructions for finding archived websites by URL and navigating the search interface.

Save pages in the Wayback Machine
Official documentation on manually archiving pages and using the browser extension.

Why can’t I see the Web page I archived recently?
Explanation of indexing delays and why a newly saved page may not appear immediately.

Archive whole web sites
Internet Archive help page describing Archive-It for large-scale or institutional web archiving.

How do I request to remove something from archive.org?
Official removal and exclusion guidance for archived material.

Want to help preserve the web? Save Page Now!
Internet Archive blog post encouraging public use of Save Page Now and explaining its preservation value.

The Wayback Machine’s Save Page Now is new and improved
Internet Archive post on the Save Page Now rewrite and better capture of modern web pages.

Brozzler
The Internet Archive’s browser-based crawling project used to improve capture of dynamic web content.

Wayback machine APIs
Official API overview for querying Wayback capture data.

See whether a website exists in the archives
Developer tutorial on checking archived snapshots programmatically.

CDX file format
Internet Archive documentation for the CDX index format used to summarize archived captures.

WARC, Web ARChive file format
Library of Congress overview of the standard archival container used in web archiving.

RFC 7089 HTTP Framework for Time-Based Access to Resource States
The formal specification for Memento and time-based access to prior states of web resources.

Time Travel for the Web – Memento
A plain-language introduction to Memento and time-based navigation across web archives.

When online content disappears
Pew Research Center analysis of disappearing pages, broken links, and digital decay across the web.

Diving into digital ephemera identifying defunct URLs in the web archives
Library of Congress discussion of link rot, reference rot, and web archival recovery work.

Perma Scoping and Addressing the Problem of Link and Reference Rot in Legal Citations
Harvard Law Review Forum piece on link rot in legal scholarship and the development of Perma.cc.

Best evidence and the Wayback Machine
Fordham Law Review analysis of how archived webpages are treated as evidence.

How to avoid link rot
American Bar Association guidance on preserving online legal references, including the Internet Archive and Perma.cc.

New feature alert access archived webpages directly through Google Search
Internet Archive announcement about Wayback links appearing through Google Search’s page information tools.

Wayback Machine Chrome extension now available
Official post on the browser extension that surfaces archived versions of dead pages.

9 ways web archives are used in digital investigations
Internet Archive article on recurring journalistic uses of web archives.

Studying the histories of digital media using the Wayback Machine
Internet Archive piece on academic use of web archives for historical and media research.

End of Term Web Archive
Official homepage for the collaborative archive of U.S. government websites during presidential transitions.

Background
Project description for the End of Term Web Archive and its preservation mission.

Update on the 2024/2025 End of Term Web Archive
Internet Archive update on the scale and purpose of the recent End of Term crawl.

Internet Archive services update 2024-10-21
Official service-status update describing restoration of the Wayback Machine after cyberattacks.

Learning from cyberattacks
Brewster Kahle’s post on the 2024 attacks and the phased return of Archive services.

End of Hachette v. Internet Archive
Internet Archive statement on the end of the publishers’ lawsuit and its institutional implications.

Take action defend the Internet Archive
Internet Archive post on the recording-industry lawsuit and the pressure facing the institution.

One trillion web pages archived Internet Archive celebrates a civilization-scale milestone
Official Internet Archive recap of the trillion-page milestone and its public significance.

News outlets are blocking Wayback Machine from archiving their pages
Reporting on major publishers restricting archival crawling amid AI-related concerns.

AI could mean the end of the Wayback Machine as news websites are increasingly blocking it
Coverage of publisher blocks, public-record concerns, and the growing pressure on the archive.

Reddit will block the Internet Archive
Reporting on Reddit’s decision to restrict most archival access for the Wayback Machine.

More insights

Fake Reddit posts are becoming the new SEO for AI search

June 6, 2026 74 min read

The new Reddit spam story is not about a few fake testimonials buried in a forum. It is about a shift in search itself...

The internet’s bot majority has arrived ahead of schedule

June 6, 2026 108 min read

Cloudflare CEO Matthew Prince expected the crossover later. First it looked like a 2027 problem, then an early-2027 problem. On June 3...

Google’s AI search reports finally arrive with a missing click problem

June 4, 2026 77 min read

Google has made a move publishers, SEO teams, and GEO specialists have been demanding since AI Overviews became a standard part of Search...

Google’s localized AI answers expose a new search problem for multilingual brands

June 4, 2026 62 min read

A user types the same query into Google Search. The words are identical. The device may be the same. The country may even be...

UK forces Google to separate AI search from ordinary search visibility

June 4, 2026 63 min read

Google’s new UK obligation is not a technical footnote. It changes the terms under which publishers, newsrooms, specialist sites and other w...

AI Mode makes Google search more convenient and more dangerous for news

June 4, 2026 80 min read

Google’s new search plan should be read as a media story, not only a technology story. On May 19, 2026, Google described a “new [&h...

Google tries to reconnect AI Search with the web it summarizes

June 2, 2026 61 min read

Google’s May 27 update is not a cosmetic change to Search. It is a small but revealing redesign of power. Preferred Sources now appear [...

Google turns Merchant Center into an AI shopping visibility tool

June 2, 2026 107 min read

Google is giving retailers a new way to see whether their products and brands appear inside AI-powered shopping journeys. The move is small...

Google keeps changing search, but these rules still hold

June 1, 2026 69 min read

Google changes search because the web changes faster than any fixed ranking formula could survive. Spam changes. User behavior changes...

Google’s AI search push sends privacy-minded users toward DuckDuckGo

June 1, 2026 97 min read

Google’s post-I/O search story has a second plot now. The first is the one Google wanted to tell: Search is becoming more conversational, m...

Unique content remains the strongest signal across SEO and GEO

May 31, 2026 99 min read

Search has changed its interface, not its need for source material. Google can place AI Overviews above blue links. ChatGPT Search can...

Google ends FAQ rich results and closes a chapter in schema-led SEO

May 30, 2026 104 min read

Google has ended the visible life of FAQ rich results in Search. The change is not a rumor, an experiment, or a ranking update...

Goodbye to SEO as we knew it, not to the people who understand search

May 27, 2026 103 min read

The question has stopped sounding absurd. A few years ago, “Is it time to say goodbye to SEO teams?” would have been treated as [&h...

Google’s full AI shift has already started, but the blue links are not gone

May 27, 2026 62 min read

Google is not waiting for a single switch-flip moment when “classic Search” disappears. The shift to AI-first Google is already under way, a...

Google turns Search, YouTube and commerce into one AI marketing system

May 27, 2026 83 min read

Google Marketing Live 2026 was not a loose collection of ad product updates. It was a map of how Google wants modern marketing to...