The AI text crossover will arrive sooner in chat than on the open web

The AI text crossover will arrive sooner in chat than on the open web

AI may already be producing enough text each year to rival the stock of publicly available human text that powered the pre-ChatGPT web. But that does not mean the open internet has already become mostly AI-written. The decisive split is between text generated by AI systems and text actually published, indexed, crawled, preserved, and read on public websites.

Table of Contents

On the broadest count, the crossover is likely 2026 to 2027, and may already be underway if private chatbot and API outputs are included. On the narrower and more socially visible count — public AI-written web pages overtaking the pre-AI human web corpus — the crossover is more likely to fall in the late 2030s or 2040s, unless automated publishing accelerates sharply.

The real question is not one date

The question sounds simple: when will AI generate as much text as the whole internet contained before generative AI? The answer breaks in two because “generated” and “online” are not the same thing.

A chatbot answer written for one user may be online in the technical sense that it passes through cloud infrastructure, but it usually does not become part of the public web. It is not indexed by Google. It is not saved by the Internet Archive. It is not linked by other sites. It is not sitting on a public URL where the next search crawler or training crawler can reliably find it. A product description, spam article, auto-generated FAQ page, SEO glossary, AI-translated help page, or synthetic news recap is different. That text enters the public web and becomes part of the material future search engines, answer engines, archives, regulators, publishers, and AI developers must handle.

The fastest crossover is private inference volume. The slower crossover is public web publication. Those two curves now move at different speeds. Chatbots and APIs can generate trillions of tokens in a day. Websites, even automated ones, publish more slowly because public pages face hosting costs, domain reputation limits, spam filters, ranking systems, human review, business incentives, legal exposure, and reader trust.

The strongest currently available anchor for the old human corpus is Epoch AI’s estimate that the effective stock of human-generated public text data is about 300 trillion tokens, with a 90% confidence interval from 100 trillion to 1,000 trillion tokens. That is not the literal total of every word that ever appeared online. It is an estimate of usable public human text for language-model training, adjusted for quality and repeated training passes. But it is the best public baseline for a serious calculation.

Against that baseline, current AI usage is already enormous. OpenAI said in 2025 that ChatGPT received 2.5 billion prompts per day. Alphabet said in April 2026 that its first-party models process more than 16 billion tokens per minute via direct API use by customers, a figure that includes processed tokens, not only newly generated output. These two numbers alone show that AI text production has moved from novelty to industrial throughput.

The public web tells a calmer story. A 2026 study using Internet Archive samples estimated that by mid-2025, roughly 35% of newly published websites were AI-generated or AI-assisted, up from near zero before ChatGPT’s launch in late 2022. That is a large shift, but it still refers to new sites, not the whole accumulated web.

A token is the cleanest unit, but it is still imperfect

Any estimate needs a unit. Pages are too crude because one page may contain 80 words, another 8,000, and another only navigation, ads, cookie banners, JavaScript, and boilerplate. Bytes are better, but web pages contain markup, scripts, images, duplicated menus, tracking code, comments, and template text. Words are intuitive, but language models count subword units. The closest practical unit is the token.

A token is not exactly a word. In English, a rough rule is that one token is about three-quarters of a word, though the ratio changes by language, writing system, punctuation, formatting, code, and URL density. A 1,000-word article may be around 1,300 tokens. A short page of product copy may be 300 tokens. A software documentation page with code snippets may be far denser. This matters because the crossover depends on whether we count compact human prose, scraped web pages, deduplicated text, boilerplate, comments, code, captions, transcripts, forum posts, or machine-generated private chat.

A token-based estimate avoids the worst page-count traps, but it does not remove judgment. We still have to decide whether duplicated syndication counts. We still have to decide whether scraped boilerplate counts. We still have to decide whether private AI outputs count if they were never made public. We still have to decide whether AI-assisted human writing counts as AI-generated, human-generated, or a mixed category.

The “AI-generated or AI-assisted” phrase matters. Many texts now pass through models without being born from them. A journalist may use a model to structure notes but write the final story. A developer may ask for a draft and rewrite it. A company may translate a help-center page with AI and have a support lead approve it. A student may use AI to polish a paragraph. A search engine may synthesize a summary from source pages. All of those are mixed cases.

For a clean forecast, this article uses three categories. Generated AI text is text produced mostly by a model. AI-assisted text is human-directed text that has been drafted, rewritten, translated, summarized, or polished with a model. Public AI web text is generated or AI-assisted text that ends up on a publicly accessible URL and can be crawled, indexed, archived, or cited.

The date changes with the category. Generated AI text across chatbots may hit the old human-text stock first. AI-assisted text in business documents, email, code comments, slide notes, reports, and internal knowledge bases may be even larger, but it is mostly private. Public AI web text is the visible layer, and it is the layer that changes search, publishing, trust, and future training datasets.

The old human web was never a neat library

The pre-AI web was not a clean corpus of human wisdom. It was a messy mixture of news articles, blogs, forum comments, ecommerce pages, PDFs, support docs, social posts, scraped copies, spam, machine translations, parked pages, legal notices, autogenerated templates, software repositories, comments, academic abstracts, wikis, recipes, lyrics pages, travel pages, affiliate pages, and abandoned personal sites. Some of it was written with care. Some was bulk-produced by low-cost content farms long before modern LLMs.

That history matters because “without AI” does not mean “pure human expression.” The web has always contained automation. Template-generated pages, database-driven catalogs, weather pages, stock tickers, sports box scores, spam pages, scraped directories, spun articles, synonym-swapped SEO text, and machine-translated pages existed before ChatGPT. The difference after late 2022 is not that automation began. The difference is that fluent, cheap, general-purpose text generation became available to hundreds of millions of people and nearly every content operation.

Common Crawl gives one public view of scale. It says it maintains an open repository of web crawl data spanning over 300 billion pages across 15 years and adding 3 to 5 billion new pages each month. Its April 2026 crawl alone contained 2.19 billion web pages, 379.2 TiB of uncompressed content, captures from 43.2 million hosts, 35.4 million registered domains, and 660.5 million URLs not visited in prior crawls.

Those figures are large, but they do not translate directly into unique human text. A web crawl contains duplicates, near-duplicates, redirects, calendars, category pages, pagination, mirrored pages, repeated navigation, legal boilerplate, dynamic pages, and machine-created structures. A crawl can grow by hundreds of terabytes without adding hundreds of terabytes of unique article-like prose.

Netcraft offers another view. Its March 2026 web server survey received responses from 1.427 billion sites across 297.6 million domains and 14.2 million web-facing computers. That does not mean 1.427 billion active editorial websites. It counts sites responding to web server requests, including parked domains, placeholders, infrastructure pages, and low-content properties.

The human web was always smaller than raw site counts suggest and larger than cleaned training datasets suggest. That is why the crossover question needs ranges, not a single dramatic date.

The strongest public baseline is about 300 trillion tokens

Epoch AI’s 300-trillion-token estimate is central because it tries to measure the effective stock of human-generated public text data for training large language models. The authors give a wide interval because public human text is hard to measure and because training usefulness is not the same as raw text volume. Their peer-reviewed ICML version projects that model developers may train on datasets roughly equal to the available stock of public human text between 2026 and 2032 if current development trends continue.

That projection is often misunderstood. It does not say the internet will vanish. It does not say no new human text will be written. It does not say every page has been scraped perfectly. It says that the growth of training datasets is approaching the available stock of usable public human text. When models need tens or hundreds of trillions of tokens, the accessible public human corpus stops feeling infinite.

This matters for the AI text crossover because the same figure gives a usable benchmark: if AI systems generate roughly 300 trillion tokens of text, they have produced a volume comparable to the effective public human text stock that existed before the generative-AI wave. But the benchmark is not the whole internet in the storage sense. It is closer to the stock of public human text that could plausibly train language models.

The interval is wide. At the low end, 100 trillion tokens can be matched quickly by modern chatbots. At the high end, 1,000 trillion tokens requires sustained large-scale generation. The midpoint, 300 trillion, is useful because it is large enough to avoid hype and small enough to be reachable with present usage.

The public web after 2022 adds a second complication. New human text keeps appearing. So the target is not fixed unless we define it as the pre-generative-AI stock. If the question means “when will AI produce as much text as the web had accumulated before ChatGPT became mainstream,” then the baseline is roughly static. If it means “when will AI-written public text exceed all human-written public text ever online,” the target keeps moving as people continue publishing, posting, documenting, commenting, and transcribing.

For this article, the headline estimate uses the static version: the accumulated public human text stock before broad LLM adoption. The moving-target version comes later, and it pushes the public-web crossover further out.

Current AI systems can generate the baseline faster than the web can absorb it

The raw generation side is the most startling. OpenAI’s 2.5 billion daily prompts, reported in July 2025, do not disclose average answer length. But even modest assumptions produce huge annual text volumes. At an average of 300 output tokens per prompt, ChatGPT would produce about 274 trillion output tokens per year. At 700 output tokens per prompt, it would produce about 639 trillion output tokens per year. Those are calculations from the prompt volume, not confirmed OpenAI output totals.

Alphabet’s April 2026 disclosure is even larger in processed-token terms. More than 16 billion tokens per minute via direct API use equals about 23 trillion processed tokens per day and about 8.4 quadrillion processed tokens per year. Processed tokens include inputs and outputs, and direct API use excludes some consumer product use, so the number cannot be treated as AI-written output. But it shows the scale of model traffic now running through one major provider.

If private chatbot answers, API completions, draft documents, code explanations, meeting summaries, emails, internal reports, synthetic training examples, and agent traces are counted, the AI text crossover is likely around 2026 to 2027 and may already be happening. The uncertainty is not whether models can produce that much text. The uncertainty is how much of the processed-token stream is output, how much output is natural-language text rather than code or structured data, how much is saved, how much is duplicated, and how much should count as “online.”

This is the first major answer to the user’s question. If the phrase “AI generates” means “AI systems produce text anywhere in cloud products,” then the old human web is no longer an unreachable mountain. It is a yearly or even sub-yearly industrial volume. The internet’s accumulated public human text took decades to produce. AI systems can now produce comparable token counts rapidly because they are called billions of times a day and because each call may produce hundreds or thousands of tokens.

But this answer is also the least visible. Most private AI text disappears into chat histories, documents, workflows, API logs, temporary memory, code editors, customer-service systems, and internal tools. It changes work habits before it changes the public surface of the web.

Public publication remains the slower constraint

Publishing is a bottleneck. A model can draft a 1,500-word article in seconds, but getting that article onto the web at scale is not only a generation problem. It requires a content management system, a domain, hosting, indexing, internal links, external links, templates, titles, images, compliance decisions, spam avoidance, reputation management, and some expected return. Low-quality operators can automate much of this, but they still collide with platform friction.

Google’s March 2024 spam policy update is relevant here. Google says “scaled content abuse” is when many pages are generated mainly to manipulate search rankings and not help users, and it applies whether content is created through automation, human work, or both. The policy does not ban AI content as such. It targets the purpose and quality of scaled publishing.

That policy creates an economic brake. Publishing a huge number of AI pages is easy. Getting those pages discovered, ranked, clicked, monetized, trusted, and defended from spam classifiers is harder. The brake is imperfect. A lot of low-value AI content still gets published. But there is a difference between generating text and making public text that matters in search, answer engines, social discovery, and reputation systems.

Graphite’s web-article study captures that split. It found that AI-generated articles surpassed human-written articles in its sample in November 2024, but also reported that the proportion plateaued after May 2024. Its search and answer-engine study found that although many AI articles are being published, 86% of articles ranking in Google Search were written by humans, and 82% of articles cited by ChatGPT and Perplexity were written by humans in its sample.

That does not settle the whole internet, but it points to a pattern: AI content may dominate some publishing flows without dominating visible authority. Automated text can flood the edges of the web while human-written or deeply human-edited pages still hold the center of search, citations, reputation, and user trust.

AI-written websites are already a large share of new web creation

The strongest web-wide AI-content estimate available in the current research record is the 2026 paper “The Impact of AI-Generated Text on the Internet,” by researchers affiliated with Imperial College London, the Internet Archive, and Stanford. The study built representative samples of websites published between 2022 and 2025 using the Internet Archive and applied an AI text detector. Its headline result is that by mid-2025, roughly 35% of newly published websites were classified as AI-generated or AI-assisted, up from zero before ChatGPT’s launch in late 2022.

The study also found correlations with lower semantic diversity and more positive sentiment, while not finding statistically strong evidence for some feared outcomes such as web-scale factual degradation or strict stylistic monoculture. That distinction is useful. It suggests the web may become more synthetic before it becomes obviously false in every measurable way. The risk is not only wrong facts. It is sameness, incentives, opacity, and the erosion of confidence when readers no longer know whether a page reflects a human source, a model average, a marketing operation, or a content farm.

A 35% share of newly published websites is not a small number. It means the public web has entered a mixed-authorship period. But a share of new websites is not cumulative volume. If a city builds 35% of new apartments from a new material this year, the whole city is not 35% made of that material. The old stock remains. The same is true online. The accumulated human web is large, and older pages do not disappear all at once.

The cumulative crossover therefore depends on how much new AI text is published every year, how much old human text remains online, how fast AI text grows, how much is removed or deindexed, and whether AI pages are short spam pages or long substantive documents. A billion AI-generated thin pages might contain less text than a smaller number of long reports, documentation hubs, product catalogs, knowledge bases, transcripts, or synthetic Q&A archives.

The public-web crossover is not the moment when AI becomes common. That moment has already arrived. The crossover is the moment when accumulated public AI text volume rivals the accumulated pre-AI human web text stock. That is a much higher bar.

Articles crossed faster than the wider web

Article publishing is one of the first places AI text becomes visible because the format is easy to automate. A model can draft a listicle, explainer, buyer guide, local-service page, glossary entry, product roundup, crypto explainer, health-adjacent article, travel itinerary, or software tutorial with minimal structure. Many such pages are created for traffic rather than editorial need.

Graphite’s article sample found that AI-generated articles overtook human-written articles in November 2024, with rapid growth after ChatGPT’s launch. It selected English-language articles from Common Crawl with article schema, minimum length, and publication dates from January 2020 to May 2025. It used Surfer’s AI detector and classified an article as AI-generated if more than half of its content was predicted to be AI-generated.

This result is powerful but narrow. It covers a specific kind of web page: English-language articles and listicles matching a defined sample process. It does not cover ecommerce database pages, social posts, PDFs, documentation, support pages, forum discussions, video transcripts, comments, academic papers, newsletters, or private documents. It also depends on detector performance, which remains contested across the field.

Even so, it helps explain public perception. People often experience the web through articles, snippets, search results, and answer-engine citations. If a large share of newly published article-like pages becomes AI-written, users may feel the web changing even before the cumulative token count crosses the old human corpus.

Article volume is also not the same as article influence. Graphite’s separate search and answer-engine study found that AI-generated articles were less represented among Google rankings and citations from ChatGPT and Perplexity than in raw publishing volume. That creates a two-layer internet: a production layer where AI text is abundant, and a visibility layer where human-authored or human-edited work still appears to hold an advantage.

This is why a clean “AI has overtaken the internet” claim can mislead. AI may overtake new articles before it overtakes new pages, and new pages before cumulative pages, and cumulative low-value pages before trusted pages, and trusted pages before source material used by researchers. Each layer has its own clock.

Cumulative public AI text is likely years behind private AI output

To estimate the public-web crossover, start with the 300-trillion-token benchmark. Then ask how much AI-generated or AI-assisted text is being published on public URLs per year. There is no official answer. The best approach is to triangulate from Common Crawl scale, the 35% new-website estimate, the article-level studies, and the difference between raw crawled pages and unique text.

A rough public-web scenario looks like this. Suppose the open web adds many billions of newly discovered URLs per year, but only part of that is substantive text. Suppose each substantive new public page contains hundreds to low thousands of tokens after cleaning. Suppose 35% to 50% of that new text is AI-generated or AI-assisted in the current period. Under those assumptions, public AI text might be accumulating at single-digit to low tens of trillions of tokens per year, not hundreds of trillions.

The range can move. If automated publishing expands across ecommerce, local SEO, programmatic landing pages, AI-translated documentation, synthetic Q&A, and enterprise help centers, the annual public AI-token figure could rise sharply. If search engines, hosts, ad networks, regulators, browser vendors, and answer engines impose stronger friction, it could flatten. If AI assistants reduce clicks to websites, the incentive to mass-publish AI pages may fall because fewer pages will earn traffic.

This is the core reason the public-web crossover likely falls later than the private-output crossover. AI systems can generate hundreds of trillions of tokens a year across chat and API use. Public sites may publish a much smaller share of that output because most AI writing is consumed inside workflows.

A conservative public-web estimate points to the late 2030s or 2040s for published AI text to rival the pre-AI human public text stock. A more aggressive scenario, with programmatic AI publishing growing and AI-assisted text counted broadly, could pull the date into the early-to-mid 2030s. A restrictive scenario, with search and platform systems heavily suppressing low-value synthetic pages, could push it beyond 2050 for the visible, crawlable web.

Those ranges are not dramatic enough for viral headlines, but they fit the evidence better than a single year.

Core estimates behind the crossover

Measurement anchors for the AI text crossover

MeasureCurrent public anchorMeaning for the forecast
Effective public human text stockAbout 300T tokens, 100T–1,000T rangeBest baseline for the pre-AI public human corpus
ChatGPT prompt volume2.5B prompts per day in July 2025Private AI output can reach web-scale text volume quickly
Google direct API model traffic16B+ tokens per minute in April 2026Processed-token scale is already enormous, though not all output
New AI-assisted websitesAbout 35% of newly published sites by mid-2025AI is now a large share of new public web creation
AI article publishingAI articles surpassed human articles in one 2024 sampleArticle formats crossed earlier than the whole public web

This table should be read as a map, not a proof. The measures come from different methods and count different things. The safest conclusion is directional: AI generation volume has already reached web-historical scale, while public AI web accumulation is still catching up.

The most defensible date range

The direct answer is this: AI-generated text probably reaches the volume of the pre-AI public human text stock around 2026–2027 if all chatbot and API outputs are counted. Publicly published AI web text probably reaches that same stock much later, with a central estimate in the late 2030s or 2040s.

The first date range comes from current model use. ChatGPT alone, at 2.5 billion prompts per day, would generate about 274 trillion output tokens per year if the average answer were 300 tokens. That is already close to Epoch AI’s 300-trillion-token public human-text benchmark. With longer average answers or additional providers, the number exceeds the benchmark. This is a rough calculation, but it is grounded in a disclosed prompt-volume figure and a transparent output-length assumption.

The second date range comes from publication friction. A large part of AI output never becomes public. The public web is not a sink that absorbs every chatbot answer. It absorbs content that somebody chooses to publish, maintain, index, monetize, promote, syndicate, or archive. Even with 35% of newly published websites classified as AI-generated or AI-assisted by mid-2025, the accumulated public human web remains vast.

A useful shorthand is:

Generated anywhere: likely 2026–2027. Published on the open web: likely late 2030s to 2040s. Visible in trusted search and answer results: later still, or perhaps never in raw volume terms if ranking systems continue to favor human-led authority.

That last clause matters. AI content can dominate raw volume while not dominating trusted attention. Search engines, recommendation systems, answer engines, universities, publishers, enterprise buyers, and users do not treat every token equally. A million copied AI pages may add volume but little authority. A single standards document, court filing, scientific paper, investigative report, or government dataset may shape knowledge more than thousands of synthetic posts.

Three scenarios for the crossover

Plausible timelines by counting method

Counting methodPlausible crossoverReason
All AI text generated in chatbots and APIs2026–2027Current prompt and token-processing volumes are already near or above public human-text stock scale
All saved AI-assisted business and consumer text2027–2030Internal drafts, reports, emails, summaries, and code comments multiply quickly but remain partly private
Publicly crawlable AI-written web textLate 2030s–2040sPublication, indexing, spam controls, and traffic incentives slow accumulation
Trusted AI-written pages visible in search and answer enginesUncertain, likely laterRanking, citations, authority, and human review filter raw synthetic volume

The table separates volume from visibility. The fastest-growing layer is not the one readers see most often. The public internet may feel increasingly synthetic long before synthetic pages become the majority of trusted, cited, or commercially valuable web text.

Search policy is a brake on public AI volume

Google’s policy stance matters because search traffic remains one of the main reasons businesses publish text to the open web. If search systems rewarded every AI page equally, automated publishing would scale without much restraint. If search systems demote low-value scaled content, many AI pages become commercially weak even if they are cheap to produce.

Google’s March 2024 update framed the issue around scaled abuse rather than AI authorship. The policy says abuse occurs when many pages are generated mainly to manipulate rankings and not help users, “no matter how it’s created.” That phrasing is central. It avoids the impossible task of banning every AI-assisted sentence and targets the economic behavior that floods search results with low-value pages.

For publishers and SEO teams, this creates a practical rule: AI use is not the core risk; scaled unoriginal publishing is. A human can produce spam. A model can assist useful research. A team can use AI to clean transcripts, structure expert notes, translate verified documentation, or draft summaries that editors fact-check. A content farm can use AI to manufacture thousands of near-identical pages with no source depth. The ranking and policy question is not only who typed the sentence. It is whether the page adds verifiable value.

This brake does not stop AI publication. It shapes it. Some operators publish at scale and accept churn: domains burn, pages vanish, traffic spikes and falls. Others use AI quietly in editorial workflows while preserving human reporting, expertise, and accountability. Enterprise sites use AI to expand help centers, product documentation, and customer support knowledge bases. Local businesses use AI for service pages. Ecommerce sites use AI for product descriptions. Agencies use AI for outlines and drafts.

The net effect is uneven. Low-value public AI pages face more friction than private AI output. High-value AI-assisted pages face less friction if they carry human review, original data, expert experience, or brand trust. This means the future public web may not be a simple flood of raw synthetic text. It may be a layered web where AI is present everywhere but visible authority depends on human proof.

Answer engines change the incentive to publish

The old web economy rewarded pages that could attract clicks. AI answer engines change that economy by giving users answers directly, sometimes with citations, sometimes with links, and sometimes with enough summary that users never visit the source. That does not reduce text generation. It may reduce the incentive to publish weak pages for search traffic.

Google’s AI Overviews reached more than 1.5 billion monthly users by Q1 2025, according to Alphabet’s investor call. Google later made AI Mode a larger part of Search, and by Q1 2026 Alphabet said people were returning to Search more with AI Mode and AI Overviews. These are not side experiments. They are becoming front-door information interfaces.

For publishers, this creates tension. AI summaries can use web material to answer questions while reducing the need to click through. Reuters reported in July 2025 that independent publishers filed an EU antitrust complaint over Google’s AI Overviews, alleging harm to traffic, readership, and revenue, and arguing that publishers could not opt out of AI use without losing search visibility. Google disputed the claims and said AI experiences create new discovery opportunities.

Regulators are now examining that control point. The UK Competition and Markets Authority proposed requirements under which publishers would have more control over how their content is used in AI Overviews and AI Mode, including opt-out options and attribution measures. Google has said it is developing further controls to let sites opt out of generative AI features in Search.

This affects the AI text crossover because AI answers may reduce the payoff from mass-producing public pages. If users get answers from search interfaces, content farms may produce fewer pages for traditional clicks. Or they may produce more pages to feed answer engines. The outcome depends on whether answer engines reward public sources, licensed sources, direct feeds, structured data, human brands, or synthetic pages that mimic expertise.

The web’s public layer may become more curated, not less

A common fear is that the web becomes an infinite landfill of machine text. Some of that is already true at the margins. Yet the more AI text appears, the more valuable curation becomes. Search engines, answer engines, social platforms, browsers, archives, universities, journalists, regulators, and users all gain reasons to separate source-backed material from synthetic repetition.

Graphite’s finding that human-written articles dominate Google rankings and answer-engine citations, despite AI articles being abundant in publishing volume, points to this direction. It does not prove search engines can always detect AI writing. It suggests that ranking systems may already select for signals that correlate with human work: originality, links, brand, expertise, user engagement, source depth, topical authority, updates, and editorial reputation.

The public web may therefore split into three layers. The first layer is synthetic bulk, made of low-cost pages that target long-tail queries, affiliate traffic, ad impressions, or automated site growth. The second layer is AI-assisted operations, where companies and publishers use models to speed writing, translation, formatting, support, and documentation while keeping human accountability. The third layer is human-source authority, where readers and machines prefer material tied to reporting, data, expertise, law, science, lived experience, or accountable organizations.

The volume crossover could happen in the first layer without changing the third layer. That is why the phrase “AI will generate more text than humans” can be true and incomplete. Text volume is not the same as knowledge supply. A model can generate a thousand variations of a paragraph about mortgage rates, but that does not replace the central bank decision, the lender’s actual offer, the law, the borrower data, or the journalist’s verification.

The future web may contain more AI text and more demand for human provenance at the same time. That is not a contradiction. Synthetic abundance often raises the market value of trusted scarcity.

Detection is fragile but still useful for population estimates

AI text detection is not a magic scanner. It can fail when models improve, when humans edit outputs, when non-native writers are misclassified, when text is short, when content is technical, when writers mimic model style, or when models are prompted to vary tone. Any estimate that relies on detectors must be read with caution.

But fragile does not mean worthless. Population-level research can use detectors with calibration, false-positive controls, baseline periods, confidence intervals, and sensitivity checks. The Wikipedia study by Brooks, Eggert, and Peskoff calibrated thresholds to maintain a 1% false-positive rate on pre-GPT-3.5 articles and found that detectors flagged over 5% of newly created English Wikipedia articles as AI-generated, with lower shares for German, French, and Italian.

The Internet Archive-based web study similarly tries to estimate broad patterns rather than prove authorship for any single page. That is the right use case. It is risky to accuse a specific writer based on a detector. It is more defensible to examine large samples and ask whether the distribution of new web text has shifted after the launch of LLM tools.

The detection challenge also affects the crossover date. If AI-assisted writing becomes deeply edited, detector-based estimates will undercount it. If human writers adopt AI-like phrasing because the web’s style changes, detector-based estimates may overcount it. If AI systems become better at imitating human variation, future public estimates may become less reliable even as AI use rises.

That is why the best future metrics will combine signals: publication metadata, editing history, authorship disclosure, content provenance standards, model-watermark research, server-side generation logs, platform policies, crawl samples, corpus deduplication, and economic incentives. Detector scores alone cannot answer the crossover question. They can show the direction of travel.

Model collapse makes the public crossover technically important

The public-web crossover is not only a media or SEO story. It affects AI development. Models are trained on large corpora, and the web has been a core source for those corpora. If future web crawls contain high shares of AI-generated text, future training datasets may include more synthetic material unless developers filter it.

The model-collapse literature is the warning label. The Nature paper “AI models collapse when trained on recursively generated data” defines model collapse as a degenerative process in which generated data pollutes the training set for later generations, causing models to misperceive the original distribution. The authors stress that access to real human-produced data matters, especially for distribution tails.

The risk is not that every synthetic token is poison. Synthetic data can be useful when produced, filtered, and mixed carefully. It can train models in math, code, reasoning traces, controlled simulations, rare tasks, and safety evaluations. The risk is indiscriminate recursion: models trained on web crawls that increasingly contain unmarked outputs from earlier models, especially if those outputs compress the diversity, errors, bias, optimism, omissions, or stylistic habits of prior systems.

The Internet Archive-based study makes that risk more concrete. Its estimate that 35% of newly published websites were AI-generated or AI-assisted by mid-2025 means future web-scale corpora will contain a large synthetic fraction unless filtering improves. The authors explicitly frame this as a concern for future foundation models trained on contemporary internet data.

This is where the public-web date matters more than the private-output date. Private chatbot outputs do not automatically enter training corpora unless providers use them, users publish them, or logs are included under policy. Public web pages are different. They are crawlable. They can be scraped by many actors. They can propagate into datasets without strong provenance. The public crossover therefore changes the training environment for the whole field.

Synthetic data is not the same as public AI slop

The term “AI slop” is useful for low-value public content, but it should not be stretched to cover all synthetic data. AI-generated text can be junk, but it can also be controlled, labeled, tested, and useful. The distinction matters for the forecast because private synthetic data may grow faster than public synthetic pages, and it may improve models without flooding the open web.

Synthetic data includes generated math problems, code tasks, unit tests, translation pairs, safety examples, simulated dialogues, tool-use traces, planning examples, and structured reasoning demonstrations. It can be created for a purpose and filtered against known answers. Public AI slop is different. It is usually made to fill pages cheaply, attract attention, imitate authority, or scale content operations without adding new evidence.

The best AI labs already treat data quality as a major differentiator. FineWeb, for example, is a cleaned and deduplicated English web dataset derived from Common Crawl; Hugging Face describes the dataset as more than 18.5 trillion tokens, originally 15 trillion, after processing. The FineWeb paper describes a 15-trillion-token corpus from 96 Common Crawl snapshots and a 1.3-trillion-token FineWeb-Edu subset filtered for educational text.

RefinedWeb showed a similar lesson earlier: filtered web data alone could produce strong models, and its authors extracted five trillion tokens from Common Crawl after filtering. The point is that web-scale data is not used raw. It is cleaned, deduplicated, filtered, scored, and mixed.

As public AI content grows, the data-preparation burden rises. Developers need to identify source quality, remove duplicates, track provenance, separate synthetic and human-originated data, and preserve rare or high-value human material. The old web could be messy and still useful because much of it was human-originated. The new web may require stronger filtering because fluency no longer signals human origin, effort, or truth.

The crossover will not look like a visible switch

There will be no morning when the internet changes color and becomes “mostly AI.” The crossover will be statistical, uneven, and disputed. It will arrive earlier in some languages, sectors, and formats than others. It will be hidden in private systems before it becomes visible in public search results.

English commercial articles may cross early because tools, incentives, and market size are aligned. Low-cost affiliate publishing may cross early. Programmatic local pages may cross early. Product descriptions may cross early. AI-translated help content may cross early. Social comments may be harder to classify. Scientific abstracts may show measurable AI phrasing without being fully AI-written. Legal, medical, and government texts may adopt AI more slowly in public but use it heavily in drafting.

Languages with fewer high-quality public corpora may face different timelines. English has the largest AI tooling market and the most web-scale training material. Smaller languages may see a faster proportional shift if translation tools flood them with synthetic content, or a slower one if local institutions, regulation, and lower monetization reduce mass publishing. The crossover is likely to be language-specific, not global in one stroke.

Format also matters. AI text in code comments and developer documentation may grow rapidly. AI text in peer-reviewed research may be constrained by journals, disclosure rules, and reputational risk, though editing assistance is already common. AI text in ecommerce may explode because every product, variant, and marketplace can generate descriptions. AI text in news may remain more contested because trust, liability, and sourcing matter.

This unevenness will produce contradictory headlines. One study may say AI articles have overtaken human articles. Another may say Google’s top results remain mostly human-written. A third may say 35% of newly published websites are AI-assisted. A fourth may say model providers process quadrillions of tokens. All can be directionally true because they count different layers.

The business impact starts before the crossover

Publishers do not need to wait for a 50% global crossover to feel the impact. The business shift begins when synthetic content changes user expectations, ad markets, search visibility, production costs, and competitive pressure. A human publisher competing against AI-generated commodity explainers feels pressure long before AI text exceeds the whole pre-AI web.

The first impact is cost compression. A company that once paid for hundreds of short service pages can now generate drafts cheaply. That does not mean the pages are good. It means the supply of plausible text has expanded. When supply expands, generic content loses price power. Writers, agencies, and publishers who sold interchangeable text are exposed.

The second impact is trust scarcity. As low-cost text rises, brands with proven expertise, named authors, primary data, editorial standards, citations, and user trust become more valuable. A page that says something original, shows real testing, cites primary documents, includes experience, or carries legal accountability has a stronger claim than a synthetic summary.

The third impact is distribution volatility. Search engines and answer engines may change how they rank, cite, or summarize pages. Publishers may lose traffic even if their content is used as source material. This is why AI Overviews and publisher opt-out debates matter. They shape whether the open web remains economically worth publishing to.

The fourth impact is content governance. Companies must decide which AI uses are acceptable, which require review, which require disclosure, and which are too risky. A regulated business cannot treat generated advice like disposable SEO copy. A medical publisher, financial firm, law firm, university, or government agency needs source control, review trails, and accountability.

The business problem is not that AI will write everything. The business problem is that AI makes average text cheap, so value moves toward proof, judgment, data, brand, and distribution.

SEO and GEO shift from text production to source authority

Search engine optimization used to reward publishing useful pages at scale when those pages matched demand better than competitors. Generative engines add a new layer: pages may be read not only by humans and search crawlers, but by answer engines that extract, summarize, cite, or ignore them. The future contest is not only “rank on Google.” It is also “be selected, trusted, and cited by AI systems.”

This is where GEO — generative engine optimization — enters, though the phrase can become empty if treated as a trick. The practical version is clear: make content easier for machines to verify and safer for humans to trust. That means primary sources, structured facts, named entities, concise definitions, clear dates, author credentials, citations, schema, original research, durable URLs, and content that answers real questions without hiding behind fluff.

AI-generated pages that merely rephrase existing sources may feed the volume curve, but they are weak candidates for durable visibility. Answer engines need sources that reduce risk. A page with original test data, legal text, official pricing, direct documentation, human reporting, or first-party experience gives a model something to ground on. A generic generated article gives it another paraphrase.

Graphite’s findings on search and answer-engine citations fit this view. Human-written articles were far more present among Google rankings and ChatGPT/Perplexity citations than raw AI article publishing volume might suggest. The study did not evaluate heavily human-edited AI-assisted content, which may perform better.

For publishers and businesses, the message is blunt: do not compete with AI at the level of paragraph volume. Compete at the level of evidence. The more synthetic text exists, the more retrieval systems need signals that a page is anchored to something outside the model’s own language distribution.

The open web could shrink in value even as text volume grows

The public web can grow in token volume while shrinking in economic value for publishers. That is the paradox behind many current disputes. If AI summaries satisfy users, publishers may receive fewer visits. If fewer visits produce less revenue, fewer publishers can fund reporting or expert content. If fewer expert sources remain public, answer engines have less reliable material to summarize. The machine-generated layer can expand while the human source layer weakens.

This dynamic is visible in the publisher complaints around AI Overviews. Publishers argue that their content is being used to produce summaries that reduce traffic and revenue. Google argues that AI features create new opportunities and that traffic changes have many causes. Regulators are trying to define control, attribution, and opt-out rights.

The crossover date is relevant here because it can distract from the nearer issue. The web does not have to become majority-AI by volume for journalism to suffer. If search traffic falls, ad rates weaken, affiliate revenues shift, and users rely on summaries, the economics of original publishing can degrade before synthetic text becomes dominant in cumulative volume.

The same applies to independent blogs, forums, and niche experts. If their material is scraped, summarized, and displaced by AI interfaces, fewer may publish openly. Some may move to private communities, newsletters, paywalls, walled platforms, or licensed feeds. That would make the public web less representative of human knowledge even if people keep writing privately.

The public corpus of the future may not be limited by human ability to write. It may be limited by whether humans still have reasons to publish openly.

Data licensing becomes a central market

As public human text becomes scarce relative to model demand, licensed data becomes more valuable. The AI industry has already moved from broad scraping toward a mixed system of public web data, licensed archives, enterprise data, user data under product terms, synthetic data, and domain-specific datasets.

Epoch’s projection that public human-generated text could be fully used by model developers between 2026 and 2032 helps explain the rush. If the accessible public web no longer supplies clean growth, labs seek private corpora, publisher deals, book archives, code repositories, video transcripts, customer-service logs, scientific databases, enterprise documents, and human-feedback data.

This market will not treat all text equally. High-volume generic text is cheap because models can generate it. High-trust, high-signal, human-originated text becomes expensive. Legal documents with metadata, expert Q&A, verified medical content, financial filings, support conversations with outcomes, code with test results, scientific papers with peer review, and forum threads with authentic human problem-solving all carry more training value than synthetic paraphrases.

That shift affects publishers. A publisher’s archive may be worth more as licensed AI training or grounding material than as ad-supported page views. But licensing also raises hard questions: who owns the value of user comments, journalist work, editor curation, photographs, transcripts, and historical archives? What compensation is fair? What consent is required? What happens when opting out of AI use also affects search visibility?

The public AI text crossover may therefore accelerate the pricing of human-origin data. Once synthetic text is abundant, the scarce asset is not fluent language. It is verified human context.

Compute growth removes one barrier and exposes another

AI text volume depends on compute, inference cost, demand, and distribution. Compute capacity has expanded rapidly. Epoch AI estimated that global AI computing capacity has been growing by about 3.3× per year since 2022, equivalent to a doubling time of about seven months. That growth supports more inference, lower costs, longer contexts, more agentic workflows, and more generated text.

Alphabet’s token-processing disclosure shows what that means in practice. Direct API use of Google’s first-party models already runs at more than 16 billion tokens per minute. This is not only consumer chat. It is developers and customers building AI into products, workflows, agents, support systems, coding tools, analytics, and document pipelines.

But compute growth does not automatically put more text on public websites. It expands what models can do. It lowers the cost of drafts, summaries, translations, code generation, and agent actions. The publication layer still depends on incentives. If the economic reward shifts away from public pages and toward closed assistants, internal systems, apps, and licensed feeds, AI text volume may grow mostly outside the open web.

Energy and infrastructure also shape the ceiling. The International Energy Agency projects large growth in data-center electricity use through 2030, with strong regional concentration and grid challenges even when data centers remain a smaller share of total global electricity-demand growth than some other sectors.

The result is a two-sided forecast. Compute growth makes the private-output crossover early. Energy, cost, policy, and distribution pressures make the public-web crossover slower and more uneven. The AI industry can generate far more text than the public web has a reason to host.

The role of Internet Archive and Common Crawl becomes more sensitive

Public web archives are no longer passive historical projects. They have become infrastructure for AI research, media accountability, legal evidence, cultural memory, and measurement of the synthetic web. The Internet Archive celebrated one trillion archived web pages in October 2025. Common Crawl says its open corpus spans over 300 billion pages and is cited in more than 10,000 research papers.

These archives are central to measuring the crossover. Without longitudinal crawl data, researchers cannot compare pre- and post-ChatGPT web composition. Without archives, deleted AI spam, vanished human pages, edited articles, and shifting site structures become hard to study. The Internet Archive-based AI-text study shows the value of historical sampling: it can estimate changes in newly published websites across time.

But archives also face pressure. AI companies want data. Publishers want control. Platforms block crawlers. Anti-scraping systems can block legitimate preservation. Lawsuits and licensing deals reshape access. If the open web becomes harder to crawl, measuring AI-generated public text becomes harder. The more contested the web becomes, the less transparent the crossover may be.

Common Crawl’s April 2026 release shows the scale of the measurement task: billions of pages, hundreds of terabytes, tens of millions of hosts, and hundreds of millions of newly seen URLs in a single crawl. Raw scale alone is not enough. Researchers need sampling methods, language detection, boilerplate removal, deduplication, publication-date inference, page-type classification, AI-text estimation, and uncertainty bounds.

The public AI text crossover will be known only if the public web remains measurable. That is not guaranteed.

Wikipedia shows the value of human governance

Wikipedia is a useful case because it is open, heavily moderated, historically human-edited, and attractive to AI users. The ACL study on AI-generated content in Wikipedia found over 5% of newly created English Wikipedia articles were flagged as AI-generated under calibrated detector thresholds. It also reported that flagged pages tended to be lower quality, self-promotional, or partial toward a viewpoint.

This does not mean Wikipedia is overrun. It means even a tightly governed public knowledge project has to handle AI-assisted writing. Wikipedia’s value comes from citations, edit history, neutrality norms, talk pages, reversions, volunteer review, and community rules. Those are forms of human governance. AI can draft text, but it does not replace the social system that decides whether the text belongs.

The lesson applies to the wider web. Sites with strong governance will absorb AI differently from sites built for volume. A news organization with editors, corrections, and source rules may use AI for transcription, translation, drafts, and summaries without becoming a content farm. A scientific journal may allow language editing but require disclosure and author responsibility. A government agency may use AI to simplify documents but keep legal review. A spam network may publish model outputs directly.

The crossover therefore has a governance dimension. AI text volume rises fastest where review is weakest and incentives reward scale. It rises more slowly, or more safely, where reputation and accountability matter. This is why the cumulative web may become more polarized: low-governance zones fill with AI bulk; high-governance zones use AI but maintain human control.

Readers may respond by trusting domains, authors, institutions, communities, and provenance signals more than individual paragraphs. The author page, editorial policy, citation trail, review history, and institutional reputation become part of the content itself.

Scientific and educational publishing face a different clock

Scientific, academic, and educational writing will not follow the same timeline as commercial articles. AI tools are already used for editing, translation, coding, literature review, summarization, and drafting. But academic publishing carries norms around authorship, originality, disclosure, peer review, and citation. Those norms slow full automation while allowing widespread assistance.

The AI-generated text issue in science is not only volume. It is whether AI changes the distribution of questions, wording, citations, peer review, and publication incentives. If AI makes mediocre papers easier to draft, journals face more submissions. If AI helps non-native English speakers polish manuscripts, access improves. If AI fabricates citations or smooths uncertainty, trust suffers. If AI systems summarize papers for researchers, citation patterns may shift.

Educational content is also attractive for AI generation. FineWeb-Edu’s 1.3-trillion-token subset shows the value model developers place on educationally useful text. If AI-generated educational pages flood the web, future models may train on material that is fluent but not pedagogically tested. If human teachers and experts use AI to produce reviewed materials, the outcome is different.

The public-web crossover in education may arrive earlier than in peer-reviewed science because educational explainers are easy to generate and easy to monetize. But trusted educational authority may remain tied to institutions, experts, curricula, assessments, and user outcomes. Volume and trust diverge again.

The more AI writes educational material, the more important it becomes to know who checked it. This is not nostalgia for human typing. It is a practical requirement when errors can scale.

Language imbalance could reshape the web

English dominates many AI datasets, tools, and commercial publishing incentives. That means the AI text crossover is likely to arrive first, or at least become most measurable, in English. But smaller languages face their own risks.

One risk is synthetic translation flooding. A site can now translate thousands of English pages into Slovak, Czech, Polish, Hungarian, Romanian, or other languages at low cost. Some of those translations may be useful. Others may be awkward, incorrect, culturally thin, or created only to capture long-tail search traffic. In smaller languages, a smaller public corpus means AI-translated or AI-generated pages can change the composition faster.

Another risk is source dilution. If local human reporting, forums, expert blogs, and institutional documents are sparse, AI-generated summaries may become a large share of available material for certain topics. Future models trained on those languages may ingest a higher synthetic fraction. The model then reproduces a thinner version of the language, with fewer idioms, local references, and domain-specific expressions.

The opposite can also happen. Smaller language communities may rely more on trusted institutions, public broadcasters, universities, government sites, and local expert communities. If those actors maintain human review and disclosure, the trusted layer may remain resilient even as low-quality synthetic pages rise around it.

For Central Europe, including Slovakia, this matters for media, public administration, tourism, ecommerce, law, medicine, and education. AI translation will make multilingual publishing easier, but it will also lower the cost of misleading local pages. Search engines and public institutions will need stronger source signals in local languages, not only English.

The global crossover date hides local web realities. A smaller language can feel synthetic earlier even if global public-web token counts have not crossed.

The quality question is harder than the volume question

A trillion tokens of bad text and a trillion tokens of useful text are not equal. The volume question is measurable in principle. The quality question is harder. AI-generated text can be accurate or wrong, useful or empty, original in structure or derivative in substance, readable or bland, transparent or deceptive. Human text has the same spread, but human authorship historically carried signals of accountability, experience, and social context.

The Internet Archive-based study did not find strong evidence for every feared quality decline. It found evidence tied to semantic contraction and positivity shift, but not clear macro-level degradation in factual accuracy or strict stylistic monoculture. That is a warning against lazy doom narratives. The web can change in subtle ways before it becomes visibly broken.

Semantic contraction may matter more than obvious error. If many pages converge toward similar phrasing, similar advice, similar examples, and similar source choices, the web becomes less diverse even when it remains grammatically polished. Search results may appear full but feel repetitive. Answer engines may cite multiple pages that all paraphrase the same underlying source. Users may encounter agreement without independent verification.

Positive sentiment shift matters too. LLMs often produce agreeable, polite, upbeat, safe-sounding text. That tone can be useful in customer support, but it can distort criticism, risk communication, reviews, political analysis, health advice, and product evaluation. A web filled with synthetic positivity may be less abrasive and less honest.

The danger is not only hallucination. It is averaged language replacing situated judgment. That is the deeper editorial issue behind the AI text crossover.

Human text will not disappear, but it may move

A common mistake is to imagine humans stop writing once AI writes more. That is unlikely. Humans will keep writing messages, posts, notes, code, diaries, emails, reviews, fiction, research, journalism, comments, legal arguments, documentation, and social updates. The question is where that writing lives and whether it remains public.

If the public web becomes less rewarding, more human text may move into private channels: messaging apps, Slack, Discord, WhatsApp, closed forums, newsletters, paid communities, internal knowledge bases, private documents, and social platforms that restrict crawling. That would make the open web more synthetic not because humans stopped writing, but because human writing became less accessible.

This is already visible in the broader shift from open blogs and forums to platforms and private groups. AI may accelerate it. If people feel that public writing is scraped without fair return, they may publish less openly. If public comments are flooded by bots, communities may close. If answer engines summarize without clicks, creators may reserve original work for subscribers. If AI spam weakens search discovery, niche experts may rely on direct audiences.

The training-data consequences are large. Future models may have more access to AI-generated public pages and less access to fresh human-originated public writing. That would make licensing deals and user-data policies more central. It would also make public-interest archives more valuable.

The public web could therefore become both larger and poorer: more tokens, fewer authentic human contexts. The true loss would not be a lower word count. It would be a weaker public record of human experience.

The pre-AI web target keeps moving if humans keep publishing

The static benchmark answers one version of the question: how long until AI generates as much text as the public human web had accumulated before generative AI became mainstream? But the moving benchmark asks when cumulative AI public text exceeds cumulative human public text, including new human writing after 2022.

That second version is harder and later. Humans still publish a lot. Newsrooms, governments, courts, researchers, standards bodies, companies, forums, developers, creators, and everyday users keep adding text. Even if AI writes half of new article-like pages, human text continues to expand the denominator. AI must not only match the old stock; it must outpace ongoing human additions.

The moving target also raises classification problems. If a human writes with AI support, where does the text go? If a journalist uses AI transcription but writes the story, is the story human? If a company uses AI to translate a human-authored manual, is the translated page AI? If a model writes a draft and a domain expert rewrites it heavily, is it mixed? The future web will contain too much hybrid work for binary labels.

A practical approach is to use three ratios instead of one. The first is share of new public pages that are AI-generated or AI-assisted. The second is share of cumulative public text volume that is AI-generated or AI-assisted. The third is share of trusted visible information surfaces that rely on AI-generated or AI-assisted text. Each ratio answers a different public concern.

The first ratio may already be high. The second will take longer to cross. The third may remain lower if ranking and citation systems favor verified human sources. This layered measurement is less catchy than a single date, but it is more useful.

The crossover could arrive suddenly if agents start publishing

The late-2030s public-web estimate assumes public publishing grows through current channels: websites, CMS tools, ecommerce platforms, SEO operations, support docs, knowledge bases, and content farms. AI agents could change that. If agents begin creating, updating, testing, localizing, and interlinking pages autonomously for businesses, the public-web accumulation rate could accelerate.

Imagine a commerce platform where every product variant gets AI descriptions, comparisons, FAQs, troubleshooting pages, compatibility notes, local-language versions, seasonal landing pages, and support snippets. Imagine a travel platform generating pages for every itinerary, neighborhood, hotel cluster, and user intent. Imagine a software company generating documentation pages from code changes. Imagine legal, accounting, insurance, and medical directories generating localized explainers for every jurisdiction and scenario.

Some of this already exists in early form. Agents would make it cheaper and more continuous. Public text would no longer be published in batches. It would be generated and refreshed as operations run. A change in inventory, price, regulation, weather, route, product feature, or user demand could trigger new public text.

But autonomous publishing also collides with risk. The more pages an agent creates, the more errors can scale. Companies will need approval workflows, source grounding, rollback systems, logging, compliance checks, hallucination tests, and monitoring. Search systems may punish mass-generated pages if they appear unoriginal or manipulative. Regulators may scrutinize automated advice in sensitive sectors.

Agents could pull the public-web crossover closer, but only if businesses trust them enough to publish at scale and platforms reward the output. That is plausible in some verticals and unlikely in others.

A likely sequence of milestones

The AI text transition is best understood as a sequence, not a single finish line. The first milestone is already past: AI-generated text became common in drafts, chats, marketing, coding, schoolwork, and support. The second milestone is also underway: AI-assisted or AI-generated pages became a large share of newly published websites and articles in some samples.

The third milestone is private-output parity with the old human web. This is likely around 2026–2027 under reasonable assumptions and may already be occurring when large providers are combined. The fourth milestone is public-new-content parity, where AI-generated or AI-assisted pages become about half of newly published public text across major categories. Article samples suggest some slices have crossed; the broader web may be near or moving toward that state.

The fifth milestone is cumulative public-web parity with the pre-AI human corpus. That is the slower one: likely late 2030s or 2040s under central assumptions. The sixth milestone is trusted-surface parity, where AI-authored pages dominate what people actually see in high-trust search results, answer citations, academic references, news, and institutional knowledge. That date is uncertain because trust systems can filter raw volume.

The seventh milestone is training-corpus contamination becoming unavoidable without provenance filtering. That may arrive before cumulative public parity because future crawls overweight recent pages, and because model developers need fresh data. The Internet Archive study’s 35% figure for newly published websites already makes this issue real.

This sequence helps avoid confusion. The web can be AI-heavy in new content, AI-comparable in private output, and still human-dominated in cumulative trusted archives. All three can be true at the same time.

The practical forecast for publishers and businesses

For publishers, the safest planning assumption is that generic AI text will be abundant forever. It will not remain a novelty. It will not be reliably detectable forever. It will not be enough to publish “an article about a topic” and expect that article to stand out. Every commodity explainer now competes with near-zero-cost drafts.

The response is not to avoid AI entirely. It is to use AI where it reduces waste and keep humans where judgment, evidence, and accountability create value. AI is useful for transcription, outline testing, translation drafts, summarizing source documents, extracting entities, checking consistency, generating schema drafts, turning expert notes into rough structure, and creating internal briefs. It is risky when used to invent facts, replace reporting, fabricate expertise, or publish at scale without review.

Businesses should classify content by risk. Low-risk operational text, such as internal summaries or draft metadata, can use more automation. Medium-risk public pages need human review and source grounding. High-risk content in finance, health, law, safety, politics, and reputation-sensitive domains needs named accountability, citations, and documented review.

For SEO and GEO, the priority is source strength. Publish material that contains one or more of these: first-party data, original images or tests, named expert analysis, primary-source citations, local experience, product evidence, legal or technical accuracy, community knowledge, or an angle that is not merely scraped from existing pages. If AI can generate the same page from the top 10 search results, the page is weak.

This is the business version of the crossover forecast. By the time cumulative public AI text overtakes old human text, the market will already have punished generic content. The companies that adapt early will not wait for the crossover. They will build provenance, review, and original evidence now.

The practical forecast for AI developers

For AI developers, the crossover means data provenance becomes central infrastructure. It is no longer safe to treat the public web as a mostly human-origin corpus. New crawls need synthetic-content estimation, deduplication, source scoring, time-aware filtering, and perhaps explicit separation of pre-2023 and post-2023 materials.

The easiest move is to use older human web data, but that data becomes stale. The world changes: laws, products, prices, software APIs, medical guidance, scientific results, geopolitical events, and cultural references all move. Models need fresh data. Fresh public data is increasingly mixed with AI output. That creates a trade-off between freshness and human provenance.

Developers will need a data stack that can use both. Older human-origin corpora preserve distributional richness. Licensed sources provide verified freshness. Human feedback and expert data add judgment. Synthetic data can expand controlled tasks. Retrieval systems can ground answers in current sources. Evaluation suites can test whether synthetic-heavy training harms rare knowledge, minority dialects, uncertainty expression, or factual calibration.

The model-collapse risk does not mean “never use synthetic data.” It means do not let synthetic data recursively replace the human distribution without control.

The winning data strategy is likely hybrid: preserve human-origin data, license high-signal fresh sources, generate synthetic data for targeted skills, and keep provenance metadata throughout the pipeline. The public AI text crossover makes that strategy less optional.

The practical forecast for regulators

Regulators do not need to decide whether AI text is good or bad in the abstract. The policy questions are narrower: disclosure, competition, consumer protection, copyright, data rights, election integrity, platform accountability, and public-interest access.

Search and answer interfaces are the immediate battleground. If a dominant search platform uses publisher content to produce AI answers, regulators must decide what control publishers should have, what counts as fair attribution, whether opt-outs are meaningful, and whether search visibility is being tied to AI reuse. The CMA’s proposed measures and EU complaints show that this is no longer a theoretical debate.

Another regulatory question is labeling. Mandatory labels for all AI-assisted text may be impossible to enforce and too broad to be useful. A better target may be high-risk domains and deceptive uses: synthetic impersonation, fabricated reviews, automated political persuasion, medical or financial advice without accountability, fake local news sites, and content that falsely claims human experience.

A third question is data access for researchers. Measuring the AI web requires archives, crawls, platform data, and legal space for public-interest research. If anti-scraping rules and platform restrictions block all measurement, society loses visibility into the synthetic shift.

A fourth question is consumer harm. A web page generated by AI is not harmful by itself. Harm appears when users are misled, when advice is unsafe, when fake expertise is monetized, when scams scale, when copyrighted work is laundered, when public debate is manipulated, or when original publishers are displaced without compensation.

Regulation will be most useful where it protects provenance, competition, and accountability rather than trying to police every AI-written sentence.

The practical forecast for readers

Readers need new habits because fluent prose has become cheap. The old habit — “this sounds polished, so it is probably credible” — is broken. AI makes polish easy. The better habit is to look for source trails, dates, named responsibility, original evidence, and signs of real contact with the subject.

A trustworthy page usually has some external anchor. It cites primary sources. It names the author or institution. It explains methods. It distinguishes fact from interpretation. It links to documents, data, product pages, legal text, scientific papers, or direct observations. It updates when facts change. It does not hide behind generic claims. It does not produce perfect confidence on topics that deserve uncertainty.

Readers should also compare the type of question to the type of source. For a product specification, use the manufacturer or official docs. For law, use statutes, regulators, courts, or qualified legal analysis. For health, use medical institutions and clinicians. For breaking news, use reputable outlets and primary statements. For local information, check local sources. For historical or scientific topics, check archives, books, papers, and institutions.

AI-generated content is not always wrong. Human content is not always right. But when AI volume rises, the burden shifts from reading style to checking provenance. The question becomes not “does this sound human?” but “what is this based on, and who is accountable for it?”

That habit will matter more before the public-web crossover than after it. Waiting for a global majority-AI moment misses the fact that synthetic text is already common in search, social feeds, marketing, support, and internal workflows.

The crossover may be invisible because AI is becoming infrastructure

AI text is less likely to remain a separate category called “AI content” and more likely to become infrastructure inside ordinary products. Search results, email clients, writing apps, code editors, office suites, customer-support systems, analytics dashboards, ecommerce platforms, CMS tools, translation systems, and documentation workflows will all generate or rewrite text. Much of that output will not carry a label.

Alphabet’s Q1 2026 remarks show this infrastructure direction: AI features across Search, Cloud, subscriptions, Gemini, developer APIs, and enterprise products, with direct API token processing growing sharply. OpenAI’s enterprise report shows ChatGPT at more than 800 million weekly users as of late 2025, embedding AI into work habits.

When AI is infrastructure, the line between generated and assisted text fades. A human may dictate notes, ask AI to structure them, edit the output, paste it into a CMS, have another AI check grammar, and use an SEO tool to rewrite headings. Is the final page AI-generated? Human? Hybrid? The answer depends on the threshold.

This is why future measurement may need graded labels. A page can be human-authored with AI editing, AI-drafted with human review, AI-generated from structured data, AI-translated from human source text, or fully autonomous. Each category has different risk and value.

The public discussion often wants a binary because binaries are easy. The real web is becoming mixed. The crossover will happen inside workflows before it appears in labels. That makes disclosure and provenance more valuable but also more difficult.

The answer for Slovakia and smaller markets

The user’s question comes from Slovak, and the local implication matters. Smaller markets often experience global AI shifts through translation, SEO automation, ecommerce, tourism, and public-sector communication. A Slovak business can generate English, German, Czech, Hungarian, and Polish content cheaply. Foreign sites can generate Slovak pages cheaply. That changes local search competition.

For Slovak publishers and businesses, the first risk is not that all Slovak text online becomes AI-written overnight. The first risk is that generic Slovak content becomes cheap and crowded. Travel guides, local service pages, product descriptions, recipe pages, crypto explainers, health-adjacent summaries, finance explainers, and software tutorials can be generated at scale. Local expertise, real customer data, original photos, legal accuracy, price accuracy, and named accountability become stronger differentiators.

The second risk is translation quality. AI-translated pages may rank for Slovak queries even if they lack local context. They may use terms that sound plausible but are not how people speak. They may miss regulatory specifics. They may translate legal or medical language too loosely. In small markets, such errors can spread because there are fewer high-quality competing sources.

The third risk is public information. Government, municipalities, schools, health institutions, and public agencies should treat AI as a tool for accessibility, not as a replacement for verified source text. AI can simplify documents and translate them, but the official version must remain clear, dated, and accountable.

The opportunity is also real. Smaller organizations can use AI to publish better documentation, serve multilingual users, and explain complex topics more clearly. The winners will be those that combine AI speed with local human knowledge. In Slovak search results, that combination may beat both raw AI pages and slow legacy content.

The public web may need provenance standards

A future web with mixed human and AI authorship needs better provenance. Not every page needs a badge, but high-value information should carry clearer signals about origin, evidence, and review. This could include structured metadata for author, organization, date, sources, revision history, AI use, review status, and content type.

Provenance standards will not solve everything. Bad actors can lie. Metadata can be stripped. Small publishers may lack technical capacity. Platforms may disagree on schemas. Readers may ignore labels. But structured provenance gives search engines, answer engines, archives, and browsers more to work with. It can make trustworthy content easier to identify and synthetic spam easier to discount.

The strongest provenance is not a label saying “human-written.” It is a trail of evidence. A scientific article has authors, affiliations, methods, citations, peer review, data, and corrections. A news article has reporting, named sources, documents, photos, and editorial accountability. A product review has testing, images, measurements, and disclosure. A legal guide has statutes, cases, dates, and qualified authorship.

AI can assist all of those formats. It cannot replace the external anchors. The future trust signal is not the absence of AI. It is the presence of verifiable grounding.

This matters for the crossover because raw AI volume will keep rising. Trust systems need a way to preserve value when volume no longer helps. A trillion extra tokens do not make the web more useful unless users and machines can separate grounded knowledge from synthetic repetition.

The best estimate remains a range, not a prophecy

A precise year would be false precision. The available evidence supports a range. The private-output crossover is near because model usage is already huge. The public-web crossover is later because publication is slower, filtered, and economically constrained. The visible-trust crossover is uncertain because ranking and citation systems may continue to favor human-source authority.

The most defensible estimate is:

AI systems likely generate a volume of text comparable to the pre-AI public human text stock in 2026–2027 when private chatbot and API outputs are counted. Publicly crawlable AI-written or AI-assisted web text likely reaches that same cumulative volume in the late 2030s or 2040s under central assumptions, with an aggressive early-2030s scenario and a restrictive post-2050 scenario.

That answer rests on four pillars. The old human public text stock is roughly 300 trillion effective tokens, with a wide uncertainty range. Chatbot and API output can reach hundreds of trillions of tokens per year. New public web content already contains a large AI-assisted share. Public accumulation remains slower than generation because most AI text is not published.

The final judgment is simple but not simplistic: AI has probably already reached the old web’s scale as a text generator; it has not yet reached the old web’s scale as a public, trusted, crawlable publishing layer. The difference between those two facts will define the next decade of search, publishing, AI training, and online trust.

The date to remember is not the only thing that matters

Dates help people reason about change, but the date is not the main issue. The main issue is that fluent text is no longer scarce. Public evidence is. Human provenance is. First-hand reporting is. Original data is. Expert review is. Editorial responsibility is. Community memory is. Local knowledge is. Trustworthy archives are.

The pre-AI web was flawed, messy, commercial, spam-filled, and uneven, but it carried a vast record of human expression and institutional output. The AI web will be larger in tokens. It may be better in some ways: more accessible, more multilingual, more explanatory, easier to search, easier to summarize. It may be worse in others: more repetitive, more synthetic, more detached from experience, easier to manipulate, harder to trace.

The crossover is therefore not a contest between humans and machines over who types more words. Machines will win that contest because they already operate at industrial speed. The real contest is over which text becomes trusted, cited, preserved, and economically supported.

The internet’s future will not be decided by the amount of AI text alone. It will be decided by whether the public web keeps enough human-origin, source-backed material for AI systems and people to rely on.

Questions readers ask about the AI text crossover

When will AI generate as much text as the pre-AI internet contained?

If all chatbot and API outputs are counted, the crossover is likely 2026–2027 and may already be underway. If only publicly crawlable AI-written web pages are counted, the likely crossover moves to the late 2030s or 2040s.

Has AI already written more than humans online?

Not across the whole accumulated public web. AI may already produce comparable annual text volume in private systems, and some article samples show AI-generated articles overtaking human-written articles. But the cumulative public web still contains a huge stock of older human-origin text.

Does “AI-generated text” include AI-assisted writing?

It depends on the study. Some research combines AI-generated and AI-assisted content because the boundary is hard to detect. A human-edited AI draft is different from a fully automated page, but both may contain model-written language.

What is the best estimate of the pre-AI public human text stock?

Epoch AI estimates the effective stock of human-generated public text data at about 300 trillion tokens, with a 90% confidence interval from 100 trillion to 1,000 trillion tokens.

Why use tokens instead of words or pages?

Tokens are closer to how language models process text. Pages vary too much in length, markup, boilerplate, and duplication. Words are easier to understand, but tokens are better for comparing AI generation and training data.

Does private ChatGPT output count as internet text?

It counts as AI-generated text, but usually not as public web text. Most chatbot answers are not indexed, crawled, linked, archived, or visible to the public.

Could AI-generated public web text cross earlier than the late 2030s?

Yes. If AI agents publish large amounts of product pages, support docs, translations, local landing pages, and synthetic Q&A at scale, the public-web crossover could move into the early-to-mid 2030s.

Could the crossover happen later than 2040?

Yes. Stronger search spam enforcement, weaker traffic incentives, publisher opt-outs, hosting controls, regulation, and user rejection of low-value AI pages could push the public-web crossover beyond 2050.

Are AI articles already more common than human articles?

One Graphite study found that AI-generated articles surpassed human-written articles in its English-language article sample in November 2024. That does not mean all web text or all trusted search results are now AI-written.

Are Google results mostly AI-written now?

Graphite’s 2025 search study found the opposite in its sample: 86% of articles ranking in Google Search were written by humans, while 14% were AI-generated.

Are AI answer engines citing mostly AI-written content?

In Graphite’s sample, ChatGPT and Perplexity citations were mostly human-written articles, with 18% classified as AI-generated.

Does Google ban AI content?

Google does not ban AI content simply because it is AI-generated. Its policy targets scaled content abuse, especially pages created mainly to manipulate rankings and provide little value.

Why does model collapse matter?

Model collapse is a risk when future models train on large amounts of recursively generated data. If public web crawls contain more AI text, developers need stronger filtering and provenance to preserve human-origin data.

Is synthetic data always bad for AI training?

No. Synthetic data can be useful when controlled, labeled, tested, and mixed carefully. The risk is uncontrolled recursive training on low-quality or unmarked AI-generated public text.

Which parts of the web will become AI-written first?

Commercial articles, affiliate pages, product descriptions, local SEO pages, translations, support docs, and programmatic landing pages are likely to shift early because they are easy to generate and monetize.

Will human writers lose value?

Generic text loses value. Human writers with reporting skills, domain expertise, judgment, voice, data access, editing ability, and accountability become more valuable because those qualities are harder to automate.

What should publishers do now?

Publishers should use AI for workflow support but compete on original evidence, named expertise, primary sources, human reporting, testing, citations, and trust signals. Producing more generic pages is a weak strategy.

What should businesses do now?

Businesses should classify content by risk, keep human review for public and regulated material, document sources, disclose AI use where needed, and avoid large-scale unoriginal pages built only for search traffic.

What should readers do now?

Readers should look for dates, sources, author accountability, primary documents, original evidence, and review history. Polished writing is no longer a reliable signal of credibility.

What is the simplest answer?

AI is likely matching the old human web in generated text volume around now, but the public open web will take much longer to become majority AI by accumulated text volume.

Author:
Jan Bielik
CEO & Founder of Webiano Digital & Marketing Agency

The AI text crossover will arrive sooner in chat than on the open web
The AI text crossover will arrive sooner in chat than on the open web

This article is an original analysis supported by the sources cited below

Will we run out of data to train large language models?
Epoch AI’s analysis estimating the effective stock of human-generated public text data and explaining the training-data ceiling for large language models.

Position: Will we run out of data? Limits of LLM scaling based on human-generated data
Peer-reviewed ICML paper by Villalobos and co-authors on when language-model training datasets may reach the available stock of public human text.

The Impact of AI-Generated Text on the Internet
Research paper using Internet Archive samples to estimate the share of newly published websites classified as AI-generated or AI-assisted.

April 2026 Crawl Archive Now Available
Common Crawl release notes for the April 2026 crawl, including page count, uncompressed size, host count, domain count, and newly discovered URLs.

Common Crawl
Common Crawl’s main project page describing its open web corpus, historical scale, monthly additions, and research use.

March 2026 Web Server Survey
Netcraft’s survey of responding sites, domains, and web-facing computers, used as a separate view of the public web’s infrastructure scale.

Celebrating 1 Trillion Web Pages Archived
Internet Archive announcement marking one trillion web pages preserved by the Wayback Machine.

More Articles Are Now Created by AI Than Humans
Graphite study estimating that AI-generated articles surpassed human-written articles in its English-language web article sample.

How Does AI-Generated Content Perform in Search and Answer Engines?
Graphite study comparing AI-generated and human-written article visibility in Google Search, ChatGPT citations, and Perplexity citations.

ChatGPT users send 2.5 billion prompts a day
TechCrunch report on OpenAI’s disclosure that ChatGPT receives 2.5 billion daily prompts.

The state of enterprise AI
OpenAI report page stating that ChatGPT serves more than 800 million users every week and discussing workplace adoption.

Alphabet earnings call, Q1 2026
Google and Alphabet CEO Sundar Pichai’s Q1 2026 remarks, including direct API token-processing volume and AI product momentum.

Alphabet Investor Relations 2025 Q1 Earnings Call
Alphabet investor call transcript describing AI Overviews reaching more than 1.5 billion monthly users and AI Mode query behavior.

Google I/O 2025 announcements
Google’s official I/O 2025 announcement roundup, including Gemini app usage and product expansion.

What web creators should know about our March 2024 core update and new spam policies
Google Search Central explanation of scaled content abuse and how the policy applies regardless of whether content is made by humans or automation.

Google’s AI Overviews hit by EU antitrust complaint from independent publishers
Reuters report on publisher complaints over Google AI Overviews, traffic concerns, opt-out issues, and Google’s response.

CMA proposes package of measures to improve Google search services in UK
UK Competition and Markets Authority announcement proposing publisher controls and transparency requirements for Google Search AI features.

Google’s response to the CMA’s consultation on potential requirements for Search
Google’s response to UK CMA proposals, including its statement that it is developing controls for sites to opt out of generative AI features in Search.

AI models collapse when trained on recursively generated data
Nature paper defining model collapse and explaining risks from training future generative models on recursively generated data.

The Rise of AI-Generated Content in Wikipedia
ACL Anthology paper estimating AI-generated content in newly created Wikipedia pages using calibrated detector thresholds.

FineWeb dataset
Hugging Face dataset page describing FineWeb as a large cleaned and deduplicated English web corpus derived from Common Crawl.

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale
Paper introducing FineWeb and FineWeb-Edu, used to explain modern web-corpus cleaning, filtering, and educational data selection.

The RefinedWeb Dataset for Falcon LLM
NeurIPS dataset paper describing extraction of five trillion tokens from Common Crawl through filtering and deduplication.

Global AI computing capacity is doubling every 7 months
Epoch AI analysis estimating rapid growth in global AI computing capacity since 2022.

The 2025 AI Index Report
Stanford HAI report page covering AI investment, adoption, and broader industry trends relevant to generative AI scale.

Energy demand from AI
International Energy Agency analysis of data-center electricity demand, regional growth, and AI infrastructure implications.