Google’s index is getting pickier, not broken

Google’s index is getting pickier, not broken

Pages are disappearing from Google, but the useful question is not whether Google has pushed one giant “deindex” button. The useful question is which pages are disappearing, from which parts of which sites, and because of which mechanism. Google has not announced a mass deindexing campaign against ordinary websites. It has confirmed recent ranking and spam updates, kept tightening its spam-policy perimeter, and continues to state in its own Search Central documentation that it does not guarantee crawling, indexing, or serving, even when a page follows Google Search Essentials.

Table of Contents

A real indexing squeeze, not a single purge

That distinction matters because the same business symptom can come from different causes. A page can be crawled but not indexed. A page can be discovered but not crawled. A page can be folded into another canonical URL. A page can be blocked by noindex. A site section can lose visibility after a spam-policy enforcement action. A page can remain indexed but lose impressions after a broad ranking update. A hacked site can suffer manual action or security-related suppression. To the business owner, the line chart looks the same: fewer pages visible, fewer impressions, fewer clicks, less revenue. To the recovery team, the cause changes everything.

The current anxiety has a clear trigger. Google’s March 2026 spam update began on March 24, 2026, and completed on March 25. Two days later, Google released the March 2026 core update, which began on March 27 and completed on April 8. Google’s status dashboard also shows a February 2026 Discover update before those two March events. That sequence created a dense period of volatility across Search and Discover, especially for publishers, affiliates, ecommerce sites, SaaS sites, programmatic SEO operations, and sites carrying third-party commercial content.

The blunt answer is this: Google is not simply “deindexing the web.” Google is becoming more selective about which URLs deserve crawl attention, index storage, canonical representation, and search visibility. In some cases, pages are genuinely deindexed. In other cases, they were never indexed. In many cases, they are technically known to Google but not considered worth storing or serving. In others, Google has chosen a different URL as canonical. The word “deindexing” has become a bucket for several different search failures.

This is not just a technical SEO story. It is an editorial story, a product story, a content-governance story, a security story, and a business-risk story. The web now produces huge volumes of near-duplicate, AI-assisted, templated, affiliate-heavy, and programmatic pages. Google has little incentive to index every crawlable URL when many pages add little beyond keyword targeting. The index has become a proof system. A page has to prove that it is accessible, indexable, canonical, useful, distinct, internally supported, and trustworthy enough to deserve a place.

For site owners, the right response is not panic. It is diagnosis. Treating every missing URL as a Google penalty leads to bad fixes. Teams rewrite indexed pages that only lost rankings. They request indexing for pages that remain weak. They delete pages that should have been consolidated. They noindex pages that had recoverable demand. They blame core updates for robots mistakes and blame robots files for content-quality problems. The first job is to name the mechanism.

The practical position is narrower and stronger than the panic headline. Google is not broken, but the old assumption that every crawlable page deserves a place in Google’s index is broken. Sites that publish with discipline will adapt. Sites that publish pages mainly because a keyword exists will keep seeing more of their inventory ignored, consolidated, delayed, or suppressed.

Google’s own documentation removes the biggest myth

Google’s Search Central documentation is direct about a point many site owners still resist: Google does not guarantee that it will crawl, index, or serve a page, even if the page follows Google Search Essentials. Google describes Search as a three-stage process: crawling, indexing, and serving search results. It also says not all pages make it through each stage.

That one statement should change how indexing is discussed. A sitemap is not an order. A 200 status code is not a contract. A submitted URL is not a ticket to visibility. A long article is not automatically a valuable article. A canonical tag is a strong signal, not an absolute command. A page can be technically valid and still fail the quality, duplication, demand, or trust test.

Crawling is discovery and fetching. Indexing is analysis, selection, storage, and clustering. Serving is retrieval and ranking for a query. These are related but separate. A page can be crawled and not indexed. A page can be indexed and not served for meaningful queries. A page can be served for one query class and lose another. A page can be indexed under a canonical URL the site owner did not expect.

The indexing stage is where many current problems concentrate. Google has to decide whether a page is worth storing in a useful form. It has to understand the main content, media, structured data, canonical signals, links, duplication, language, mobile rendering, and site-level patterns. If the page is thin, duplicative, weakly linked, rendered poorly, or too similar to existing indexed pages, Google may decline to store it as a distinct result.

This does not always mean the page is “bad.” Search is comparative. A decent article can be less useful than ten existing articles. A product page can be accurate but redundant because it repeats manufacturer copy. A local page can be true but too generic to justify indexing. A comparison page can target a real keyword but lack any evidence that the author has used the products. Indexing is not a moral judgment. It is a retrieval value judgment.

That value judgment is becoming harsher because the supply of pages is expanding faster than the supply of genuinely original information. AI-assisted writing has lowered the cost of producing passable pages. Programmatic SEO has lowered the cost of creating thousands of keyword-targeted URLs. CMSs, ecommerce systems, filters, tags, and internal search functions create URL inventory by default. Search engines do not need all of it.

The site owner’s mistake is often emotional. They treat every page as an asset because somebody paid to create it. Google treats the page as one possible answer among billions. The fact that a page cost money, contains keywords, and sits in a sitemap does not make it useful enough for the index.

The March 2026 timing created a noisy diagnostic window

The March 2026 spam update and core update were close enough together to blur the evidence for many sites. Google’s spam update began on March 24 at 12:00 Pacific time and ended on March 25 at 07:30. Google’s core update began on March 27 at 02:00 and ended on April 8 at 06:00. Those are not vague SEO calendar guesses; they are the dates published in Google’s Search Status Dashboard.

A site that collapsed on March 24 or March 25 should examine spam-policy exposure, hacked content, malicious behavior, third-party content, scaled content, doorway-like pages, and manual-action risk. A site that declined during the March 27 to April 8 core-update window should look at ranking-system reassessment, query-level losses, content competitiveness, site quality, and whether pages remain indexed. A site that saw index exclusions weeks later may be dealing with delayed crawling, reporting lag, a site release, technical changes, or a separate issue.

The February 2026 Discover update adds another layer. Google’s ranking incident history lists the February Discover update before the March spam and core updates. For publishers, that means Discover, Search rankings, and indexing reports could all show movement in a short period. A newsroom looking only at total Google traffic might merge three different problems into one narrative.

Search Engine Land reported that the March 2026 core update completed after 12 days and 4 hours and noted that it was Google’s first broad ranking update of the year. The report also said the core update followed the March 2026 spam update and the February 2026 Discover update. That sequence matters because a core update can reduce impressions without removing pages from the index.

Many sites misread ranking drops as deindexing. If a page remains indexed but loses non-brand impressions, the issue is not index inclusion. It may be weaker relevance, changed intent, stronger competitors, SERP layout changes, stale content, poor content satisfaction, or site-level reassessment. If a page is missing from Google’s index, the investigation shifts to Page Indexing, URL Inspection, canonical choice, noindex, robots, status codes, rendering, internal links, sitemaps, duplication, and page value.

A correct diagnosis starts with dates. Mark Google’s update windows, then mark your own releases. Did the site change robots.txt? Did a CMS plugin update meta robots tags? Did a CDN rule add X-Robots-Tag headers? Did a redesign change canonical tags? Did product filters become crawlable? Did the sitemap start listing redirected URLs? Did a JavaScript deployment move main content out of initial HTML? Did a third-party ad script alter user behavior? The answer may sit in engineering history, not Google’s blog.

The March 2026 window is a reminder that SEO investigations must separate correlation from cause. Google updates create volatility, but they also tempt teams to stop looking at their own systems. The update date tells you where to begin. The affected URL pattern tells you what to fix.

Deindexing, exclusion, canonical consolidation, and ranking loss are different problems

The phrase “Google deindexed my pages” is often wrong even when the commercial pain is real. Deindexing should mean a URL was previously indexed and is no longer indexed. Many reported cases are not that. They are ranking declines, canonical consolidations, crawl delays, duplicate exclusions, noindex mistakes, or Search Console reporting changes.

A ranking loss means the page remains in the index but appears lower or for fewer queries. This can happen after core updates, SERP changes, competitor improvements, intent shifts, freshness decay, or site-quality reassessment. A ranking loss is not solved by requesting indexing because the page is already indexed.

An exclusion means Google discovered or crawled the URL but did not index it. “Crawled — currently not indexed” and “Discovered — currently not indexed” are exclusion patterns, but they point to different stages. One means Google fetched the page and did not index it. The other means Google knows the URL but has not yet crawled it.

Canonical consolidation means Google indexed another URL as the representative version. A product variant, parameter URL, syndicated article, sort order, print version, HTTP duplicate, or trailing-slash variant may seem missing when it has been clustered under another canonical. This may be correct or wrong depending on the business intent.

A manual action or spam-related removal is stronger. Google’s manual-action and reconsideration documentation makes clear that manual reviews and reconsideration requests exist for policy violations and security issues, not for ordinary ranking drops. A reconsideration request is for review after fixing problems identified in a manual action or security issue notification.

The distinction changes the work plan. If a page is indexed but has lost rankings, examine content, queries, competitors, intent, snippets, freshness, internal links, and core-update impact. If a page is not indexed, examine technical access, indexability, canonical signals, duplication, internal links, sitemaps, rendered content, and page value. If a manual action exists, clean the violation and request review. If a site is hacked, treat it as a security incident before treating it as an SEO problem.

Precision also changes executive communication. “Google deindexed 40% of our pages” creates panic. “Google is no longer indexing 18,000 low-value filter URLs, while core category pages remain indexed” may be good news. “Our key service pages are indexed but lost impressions after the March 2026 core update” calls for content and authority work. “Our coupon subdirectory lost visibility after site reputation abuse enforcement” calls for a policy and business-model decision.

Do not call it deindexing until you confirm that the canonical URL was indexed before and is not indexed now. Everything else needs a different label.

Search Console labels are clues, not verdicts

Search Console’s Page Indexing report is useful because it shows how Google groups indexing outcomes. It is also dangerous because the labels are easy to overread. A label is a clue, not a full diagnosis.

“Crawled — currently not indexed” means Google fetched the page and did not index it at the time reflected in the report. It does not automatically mean a penalty. It may point to weak content, duplication, poor internal support, rendering problems, canonical confusion, low site trust, stale content, or Google simply choosing not to store the page.

“Discovered — currently not indexed” means Google knows the URL exists but has not crawled it yet. On small sites, that may mean weak discovery or low priority. On large sites, it often points to crawl demand problems. Google may know about thousands or millions of URLs and decide that many are not worth fetching quickly.

“Duplicate without user-selected canonical” means Google found duplication and chose a canonical because the site did not clearly indicate one. “Alternate page with proper canonical tag” means Google accepted a canonical relationship. These are not always problems. Duplicates should often stay out of the index. The issue is whether Google chose the right representative URL.

“Excluded by ‘noindex’ tag” is often self-inflicted. Google’s noindex documentation explains that noindex can block a page from appearing in Search results. If this status appears on pages that should rank, the fix is not content editing. The fix is directive cleanup.

“Blocked by robots.txt” is different from noindex. Robots.txt is mainly a crawl-management tool. Google’s robots.txt guide says robots.txt is used primarily to manage crawler traffic; it is not the right way to protect private information, and crawler obedience depends on crawler behavior. For Google, robots.txt can keep Googlebot from fetching the content and seeing other signals.

The most common mistake is measuring total indexed URLs without asking whether those URLs should be indexed. A decline in indexed pages can be healthy if Google stops indexing duplicate filters, weak tags, or campaign leftovers. A stable indexed-page count can hide serious risk if key commercial pages are replaced by low-value URLs. The better metric is intended-index coverage: the share of URLs the business actually wants in search results that Google has indexed as the selected canonical.

That requires discipline. Every site needs an inventory of pages that should be indexed, pages that should exist but not be indexed, pages that should consolidate, and pages that should be removed. Without that map, Search Console becomes a wall of warnings with no business meaning.

“Crawled but not indexed” is the uncomfortable center of the debate

“Crawled — currently not indexed” is the label that frustrates site owners most because it feels like rejection. Google saw the page and did not keep it. In many cases, that is exactly the practical meaning. The page was accessible enough to crawl but not compelling enough, clear enough, distinct enough, or trusted enough to index.

The label can appear for many reasons. The page may be too similar to other pages. It may have thin main content. It may be overloaded with boilerplate. It may be orphaned or weakly linked. It may rely on JavaScript that Google rendered poorly. It may return a technically valid page but contain little useful information. It may target a query already answered by stronger sources. It may belong to a site section with a pattern of low-value pages. It may sit inside a duplicate cluster where another URL wins.

The wrong answer is often “add more words.” Length is not value. A 600-word page with original data, clear pricing, expert detail, or a specific answer can deserve indexing. A 3,000-word AI-assisted article that repeats public information without evidence may not. Google has no shortage of long pages. It needs useful pages.

The second wrong answer is “request indexing.” A request can prompt crawling; it does not create value. If Google already crawled the page and declined to index it, another crawl with the same content and signals may produce the same result. The request button is useful after a real fix, not instead of one.

For publishers, crawled-not-indexed often appears on commodity news: rewritten press releases, duplicate wire stories, late summaries of widely covered events, and articles with no original reporting. For ecommerce sites, it appears on product pages using manufacturer descriptions, empty categories, thin filters, and variant URLs. For SaaS sites, it appears on comparison pages, alternatives pages, integration pages, and glossaries that use the same generic structure across many keywords. For local businesses, it appears on doorway-like location pages with swapped city names and no local proof.

A useful investigation samples both failed and successful pages. Take 100 crawled-not-indexed URLs from one template and compare them with indexed URLs from the same template. Look at internal links, content uniqueness, crawl depth, canonical choice, sitemap inclusion, lastmod accuracy, rendered HTML, status code, author credibility, page freshness, query demand, external links, and boilerplate ratio. Patterns will usually appear.

The central question is not “Why did Google crawl but not index?” It is “What does this page add that makes it worth storing separately?” If the answer is weak, the fix is editorial and architectural, not only technical.

“Discovered but not indexed” points to priority and crawl demand

“Discovered — currently not indexed” sits earlier in the chain. Google knows the URL exists but has not crawled it. The page may be good or bad; Google has not necessarily fetched it yet. The problem is priority.

Google discovers URLs through links, sitemaps, and other signals. Discovery does not mean Google will crawl immediately. A sitemap can help discovery, but Google’s sitemap documentation says a sitemap does not guarantee every item will be crawled and indexed. That gap is where many large sites get into trouble.

Large ecommerce sites, marketplaces, publishers, directories, and programmatic SEO sites can generate more URLs than Google wants to fetch. Filters, sort orders, internal search results, tag pages, pagination, calendar pages, session identifiers, tracking parameters, and low-value archives expand URL inventory quickly. Google may discover the URLs and then decide not to crawl many of them.

The issue is not always server capacity. Crawl demand is tied to perceived value. If Google sees a site producing many duplicate, thin, stale, or low-priority URLs, it has less reason to crawl new URLs aggressively. If important pages sit beside millions of weak URLs, the site has made Google’s job harder.

Small sites can also see this status. A new website with few external links may have low crawl demand. A service page linked only from an XML sitemap may not look important. A blog article buried in pagination may be discovered but not prioritized. A page that has no internal links beyond a sitemap is not strongly endorsed by its own site.

The fix is hierarchy. Important pages need internal links from relevant pages, clean sitemap inclusion, correct canonical signals, and a reason to be crawled. Low-value URL patterns need rules. Some should be noindexed. Some should be canonicalized. Some should be blocked from crawling. Some should be removed. The choice depends on whether Google needs to crawl the URL to see the signal.

Sitemaps should not be dumps. They should list canonical, important, indexable URLs. If a sitemap contains redirected URLs, noindexed pages, 404s, duplicates, and low-value parameters, it trains Google to treat the sitemap as noisy.

Discovery is not priority. A URL known to Google is not a URL Google has decided to spend resources on.

Canonical decisions decide which URL survives

Canonicalization is one of the biggest reasons pages appear to vanish. Google does not want to show every duplicate or near-duplicate version of a page. It groups similar URLs and selects a representative canonical. The site can influence that choice, but Google can still choose differently when signals conflict.

Google’s canonical guidance treats redirects, rel=canonical annotations, and sitemap inclusion as canonical signals, with different strengths. Internal linking consistency also matters because links tell Google which version the site uses in practice. A canonical tag is strong, but it is not a magical override for every contradictory signal.

Canonical conflict is common. A page declares one canonical, internal links point to a parameter version, the sitemap lists another version, redirects are inconsistent, hreflang references a different URL, and duplicate content is accessible through filters. Google has to reconcile the mess. If it chooses a URL the business did not expect, the team may call the preferred URL deindexed, but Google may simply have chosen another representative.

Migrations create canonical instability. HTTP to HTTPS moves, domain changes, subdomain consolidation, trailing-slash changes, slug rewrites, pagination changes, CMS rebuilds, and international URL changes can all leave duplicate paths live. Google may keep old canonicals longer than expected or choose unexpected URLs if redirects, canonicals, and internal links are not aligned.

Publishers face canonical problems with syndication and wire content. If multiple sites publish the same article, Google has to decide which version deserves visibility. A publisher adding no original reporting to widely syndicated material should not expect every version to behave like a unique search result. Originality and canonical agreements matter.

Ecommerce sites face canonical problems through variants and filters. A product might appear under multiple categories, colors, sizes, sort orders, tracking URLs, and internal search paths. A category might have dozens of filter states. Some filtered pages deserve search visibility because they match real demand and have stable inventory. Most do not. Without rules, Google will decide for the site.

SaaS sites face canonical-like redundancy when they publish many comparison, alternative, glossary, and integration pages that differ mainly by inserted product names. Even if canonical tags are technically correct, the pages may look too similar to deserve separate indexing. Canonical tags cannot solve a content strategy that produces hundreds of near-identical pages.

Stable indexing requires one clear URL for one clear piece of value. The sitemap, canonical tag, redirects, internal links, hreflang, and content strategy should say the same thing.

Noindex, robots.txt, headers, and status codes still cause avoidable losses

Some indexing crises are not Google becoming pickier. They are site owners instructing Google to leave.

Noindex is the clearest case. Google’s noindex documentation explains that a noindex rule can block indexing so a page does not appear in Search results. Noindex can be placed in a meta tag or HTTP header. If it appears on a page that should rank, Google may drop that page from results.

Noindex mistakes happen through CMS settings, staging flags, plugins, theme changes, template logic, environment variables, CDN rules, and X-Robots-Tag headers. A staging site launches with noindex left on. A WordPress setting discourages search engines. A plugin applies noindex to entire content types. A header rule meant for PDFs hits all pages in a directory. A headless CMS template injects robots directives inconsistently.

Robots.txt creates a different problem. Google’s robots.txt guide explains that robots.txt is mainly used to manage crawler traffic. If a page is blocked by robots.txt, Google may not be able to fetch the content and see noindex, canonical, or updated page content. Robots.txt is not the same as noindex, and using it as a privacy tool is unsafe.

Status codes can quietly destroy indexing. A page that looks fine in a browser may return a 404, 403, 500, redirect chain, or soft 404 to crawlers. JavaScript apps can return 200 for missing pages or 404 for valid routes before hydration. A CDN can block Googlebot by region. A firewall can challenge crawlers. A login wall can hide content. Google needs accessible pages with correct responses.

Headers are often ignored because marketers rarely inspect them. X-Robots-Tag directives can apply noindex outside the visible HTML. Cache rules can serve stale noindex headers. Server-side redirects can send Googlebot to different destinations from users. Security tools can block crawlers after rate thresholds.

The right audit is mechanical. Fetch the exact URL as Googlebot where possible. Inspect headers. Check robots.txt. Check meta robots. Check canonical. Check status code. Check rendered HTML. Check mobile output. Check whether the indexed version differs from the live version. Check representative templates, not only the home page.

Before blaming Google for deindexing, verify that the site is not blocking, noindexing, redirecting, canonicalizing, or mis-serving the page. Many “Google removed us” cases begin with a release mistake.

JavaScript can make Google see a weaker page than users see

JavaScript-heavy sites carry indexing risk because the HTML Google first fetches may not contain the main value of the page. Google can render JavaScript, but rendering adds dependency, delay, and failure points. A page that looks complete to a user can look thin, broken, or inconsistent to a crawler.

The risk is highest when main content, internal links, title tags, canonical tags, robots directives, structured data, status behavior, or language alternates depend on client-side execution. If a script fails, an API is blocked, content is delayed, or rendering differs by user agent, Google may process a weaker version of the page.

JavaScript can also create signal conflicts. The server sends one canonical and the client changes it. The server includes noindex and the client removes it. Structured data appears only after hydration. Internal links appear only after user interaction. The route returns a generic shell with no main content. Infinite-scroll pages expose content visually but not through crawlable links.

For important search pages, server-rendered or statically rendered content is safer. That does not mean modern frameworks are bad. It means critical SEO elements should be present in initial HTML whenever possible: title, canonical, robots directives, main content, primary navigation links, structured data, hreflang, and correct HTTP status.

URL Inspection is useful here because it can show Google’s indexed version, test the live page, and reveal rendered output and crawl information. But live testing is not the same as immediate index update. A page can pass the live test after a fix while the indexed version still reflects an older crawl.

JavaScript issues often show uneven patterns. Some pages index, others do not. Pages with static content index faster. Pages requiring several API calls stall. Pages blocked by consent logic render poorly. Pages hidden behind personalization lack content. Pages with infinite scroll expose fewer links than expected.

The practical test is harsh: fetch the initial HTML and ask whether the page still has enough search value. Then compare it with the rendered version. If the answer changes dramatically, the page has indexing risk.

If Google needs a fragile chain of scripts, APIs, hydration, consent, and client state to see the page’s value, the page is less reliable as an index candidate.

Crawl budget is really URL governance

Crawl budget is often discussed as if it were only about server performance. On large sites, it is also URL governance. Googlebot must decide what to crawl, how often to refresh it, and whether the site’s new URLs deserve attention. A site that generates endless weak URLs wastes that decision-making process.

Google’s crawl-budget material has long emphasized that large sites and frequently changing sites need to care more about crawl management than small sites. It also identifies low-value URL patterns that can hurt crawling and indexing, including faceted navigation, session identifiers, duplicate content, soft errors, hacked pages, infinite spaces, low-quality content, and spam.

Faceted navigation is the classic ecommerce problem. One category can generate thousands of combinations: brand, size, color, price, rating, availability, sort order, shipping, and promotional filters. Some combinations match real search demand. Most exist only because the interface allows them. If every combination becomes crawlable and indexable, Google has to sort a mess the site should have governed.

Publishers have their own crawl waste: tag pages, date archives, author archives, short briefs, duplicate topic pages, photo galleries, old live blogs, internal search results, AMP leftovers, print versions, and thin syndicated copies. Some archive pages are useful. Many are not. The site needs rules.

SaaS sites create crawl waste through generated glossaries, comparison pages, alternative pages, integration pages, changelog duplicates, documentation versions, and parameterized demo pages. Local businesses create crawl waste through location pages with no local proof. Marketplaces create crawl waste through expired listings, thin seller profiles, internal search pages, and filtered result sets.

The answer is not to get Googlebot to crawl more junk. The answer is to reduce junk. Each URL pattern should have a role: index, noindex, canonicalize, block, redirect, remove, or serve only internally. The decision depends on whether the URL has search demand, unique value, stable content, internal support, and a clear canonical relationship.

Crawl budget is therefore editorial. It asks whether the site knows which pages matter. If everything is indexable, nothing is prioritized. If the sitemap lists everything, the sitemap says little. If internal links expose infinite combinations, Google will spend time discovering combinations instead of refreshing important pages.

The best crawl-budget fix is often not technical speed. It is deleting, consolidating, noindexing, or blocking URL patterns that should never have competed for Googlebot’s attention.

AI-assisted publishing changed the economics of indexing

AI did not create low-quality SEO content, but it lowered the cost of producing it. That changes the index economics. A team that once published 50 pages a month can now publish 5,000. A local SEO operator can create pages for every city and service combination. A SaaS company can generate comparison pages for every competitor pair. An affiliate can publish product roundups at industrial scale. A publisher can rewrite every public announcement into a thin article.

Google’s spam policy around scaled content abuse focuses on pages generated at scale primarily to manipulate search rankings and not help users. The policy is not limited to automation. Google’s March 2024 policy announcement named expired domain abuse, scaled content abuse, and site reputation abuse as new spam policies, explicitly responding to practices that harm search quality.

The key phrase is “no matter how it is created.” AI content is not automatically spam, and human-written content is not automatically safe. The question is whether the publishing pattern produces large amounts of unoriginal content with little value, mainly to manipulate rankings.

That matters for deindexing because Google does not need another version of the same generic page. It does not need 10,000 city pages with swapped location names. It does not need 500 “best software” pages with the same product blurbs. It does not need AI-generated definitions of terms already well covered by authoritative sources. It does not need product pages that copy manufacturer feeds without useful additions.

AI can still support strong content. It can help editors classify pages, summarize internal data, create drafts from original reporting, identify gaps, generate outlines, and maintain documentation. The difference is governance. AI-assisted content backed by real expertise, original data, product experience, verified sources, and human editing can be useful. AI-assisted content used to fill keyword maps without new information is weak.

The indexing squeeze is therefore rational. The supply of pages has exploded. Search engines must decide which pages are worth storing and serving. If Google indexed every plausible page created by modern content systems, search results would drown in redundancy.

The winning standard is not “human-written.” The winning standard is “information-rich, accurate, distinct, useful, and defensible.” AI does not remove that requirement. It makes the requirement more important.

Scaled content abuse is about purpose, not only production method

Scaled content abuse is often misunderstood as “AI content policy.” It is broader than that. Google’s policy language targets the production of many pages primarily to manipulate search rankings and not help users. A human content farm can violate that. A scripted database can violate that. A hybrid AI-human workflow can violate that. A legitimate database-driven site can avoid it if the pages truly serve users.

The difference lies in purpose and substance. A flight-tracking page can be programmatic and useful because it provides real-time data. A real estate listing page can be programmatic and useful because it contains unique property information. A job listing page can be programmatic and useful if it has real openings and clear details. Programmatic SEO is not the problem. Programmatic thinness is.

A weak programmatic page has a template, a keyword variable, and little else. The city changes, but the proof does not. The product name changes, but the analysis does not. The industry changes, but the advice does not. The page exists because a keyword tool showed demand, not because the business has something distinct to offer.

This creates index fragility. Google may initially crawl or index some of these pages, especially under a strong domain. Over time, as systems reassess duplication, usefulness, and site-level patterns, more pages may fall into “crawled but not indexed” or lose visibility. That can feel like sudden deindexing, but it is often delayed recognition of weak scale.

Scaled publishing needs an indexation budget. A site should ask how many pages it can maintain, update, link to, review, defend, and improve. If the business cannot maintain 50,000 pages, it should not pretend each page deserves Google’s index. Indexable inventory should be a curated asset, not a CMS byproduct.

Good scaled pages share three traits. They contain unique data or expertise. They answer a real user need that differs from nearby pages. They are supported by the site’s internal architecture. Weak scaled pages share the opposite traits: repeated copy, shallow variables, and no internal proof that the pages matter.

Scale is not the enemy. Empty scale is. Google’s index can handle large sites. It has less patience for large sites that cannot justify their own URL inventory.

Site reputation abuse turned deindexing into a publisher-revenue fight

Site reputation abuse is one of the clearest places where Google’s indexing and visibility decisions have become politically and commercially explosive. Google’s policy targets third-party pages published on a site in an attempt to abuse search rankings by taking advantage of the host site’s ranking signals. Google’s November 2024 clarification says third-party content alone is not a violation; the violation occurs when the content is published to abuse rankings through the host site’s reputation.

That distinction matters to publishers. Many media companies built commercial-content operations on trusted domains: coupon subdirectories, product-review partnerships, finance offers, betting pages, voucher hubs, affiliate directories, and white-label commerce sections. Some of these projects were editorially connected. Others were thin commercial plays using the publisher’s domain strength.

Google calls the abusive version parasite SEO. Publishers and commercial partners often describe it as monetization. The conflict moved beyond SEO forums into EU regulatory scrutiny. Reuters reported in April 2025 that German media company ActMeraki complained to EU antitrust regulators over Google’s site reputation abuse policy, arguing that the policy penalized websites. Reuters also reported that Google defended the policy as a response to user complaints about site reputation abuse and said enforcement includes review and reconsideration.

The conflict escalated. Reuters reported on November 13, 2025, that the European Commission opened an antitrust investigation into Google’s spam policy after publisher complaints. The Commission said its monitoring indicated Google was demoting news media and publisher content when sites included content from commercial partners. Google defended the policy as anti-spam enforcement against attempts to game rankings.

For the deindexing debate, site reputation abuse matters because affected sections can disappear or lose visibility even when the core editorial site remains intact. A news publisher may still rank for journalism while a coupon directory under the same domain collapses. A site owner may call that deindexing. Google may frame it as policy enforcement or demotion of abusive third-party content.

This creates a strategic test for publishers: does the content belong under the brand in a way users would reasonably expect? A newspaper’s restaurant reviews are different from a rented casino directory. A technology publication’s tested laptop reviews are different from a white-label coupon hub with little editorial involvement. A health publisher’s medically reviewed guide is different from unrelated lead-generation pages.

The site reputation abuse fight is not only about URLs. It is about whether a domain’s trust can be rented. Google is saying no. Publishers are testing whether regulators will accept that answer.

Back button hijacking shows that page behavior now belongs in search risk

Google’s April 2026 back button hijacking policy is not mainly an indexing policy, but it belongs in this discussion because it shows Google’s spam perimeter expanding from text and links into user behavior. Google announced that pages engaging in back button hijacking may be subject to manual spam actions or automated demotions, with enforcement beginning on June 15, 2026.

Back button hijacking manipulates the browser experience so users cannot easily return to the previous page. It can insert fake history states, send users to unwanted pages, show unsolicited recommendations or ads, or trap users after they click from Search. Google’s announcement gives site owners a two-month window before enforcement.

The SEO implication is direct. Search quality does not end when Google ranks a page. If a page behaves deceptively after the click, it becomes a bad search result. That behavior may come from the publisher’s own code, ad networks, engagement scripts, recommendation widgets, affiliate scripts, or third-party libraries. Site owners remain responsible for what users and crawlers experience.

This matters for deindexing because severe spam or malicious behavior can lead to manual actions, automated demotions, or trust loss. A site may think its content is fine while its monetization layer damages search performance. Aggressive ad tech, interstitials, pop-unders, browser-history manipulation, and deceptive navigation patterns can convert a content issue into a policy issue.

The broader lesson is that Google is judging the page as a user experience, not only a document. The page’s content, code, behavior, security, redirects, ads, scripts, and trust signals all matter. A page that matches a query but traps users is not a good result.

Indexing risk now includes what the page does, not only what the page says. SEO teams need product, engineering, ad operations, and compliance in the same room.

News publishers face a harder originality and transparency test

News publishers feel indexing pressure more sharply because they depend on crawl speed, freshness, Top Stories eligibility, Discover visibility, and high-volume article production. They also face intense commercial pressure to publish fast and monetize hard. That combination creates weak spots.

Commodity news is especially vulnerable. If dozens of outlets publish the same wire copy, press-release rewrite, or short summary, Google has little reason to index and surface every version. The publisher that adds original reporting, documents, local context, expert explanation, verified timelines, live updates, or practical service information has a stronger claim.

Originality in news is not only about being first. It is about adding something verifiable. A local angle, a sourced timeline, a government document, a quote gathered by the newsroom, a court filing, a map, a data table, or a clear explanation can separate a useful article from a duplicate. Original reporting is an indexing advantage because it gives Google a reason to store and serve the page as distinct.

News archives also create indexing risk. Years of thin briefs, old tags, low-value author pages, duplicate topic pages, outdated explainers, empty galleries, old live blogs, and syndicated copies can bloat a publisher’s URL inventory. Some archives are public records and should remain. Others should be updated, consolidated, noindexed, or removed.

Transparency is part of trust. Google’s Manual Actions and policy documentation for News and Discover emphasizes clear dates, bylines, author information, publication information, publisher details, and contact information. These signals help users and systems understand accountability.

Discover adds its own volatility. A publisher may lose Discover traffic without being deindexed from web search. A page can remain indexed and still stop appearing in Discover. Treating every Discover loss as deindexing leads to wrong fixes. Discover strategy needs freshness, originality, expertise, packaging discipline, and avoidance of sensational presentation.

Commercial content is the hardest publisher issue. Site reputation abuse enforcement and EU scrutiny show the tension between revenue diversification and search trust. A publisher that hosts unrelated third-party commercial content under a trusted news domain may face visibility losses that do not affect core journalism.

Newsrooms need indexing policy inside editorial operations. Which short briefs should remain indexable? Which wires need canonical treatment? Which stories require updates? Which tags deserve curation? Which commerce sections fit the brand? Which AI workflows are allowed? Which pages should be noindexed after a short shelf life? These are editorial governance questions with search consequences.

Ecommerce sites usually lose control through filters, variants, and inventory churn

Ecommerce indexing problems often begin with URL multiplication. A store may have 20,000 products but expose millions of crawlable URLs through filters, sorting, search results, pagination, tracking parameters, product variants, currency changes, language versions, and category paths. Google may know about far more URLs than the business actually wants indexed.

The indexable set should be intentional. Core categories, high-demand filtered categories, canonical product pages, buying guides, and useful comparison pages usually deserve search visibility. Internal search results, sort orders, thin filter combinations, duplicate variant URLs, empty categories, and sessionized paths usually do not.

Product pages face their own value test. A product page that repeats manufacturer copy is weak. A strong product page adds unique specs, clear availability, original images, shipping information, returns detail, compatibility, reviews, FAQs, comparisons, use cases, and structured data that matches visible content. The page should be useful even when it targets a product name.

Out-of-stock and discontinued products need rules. If a product will return and still has demand, keeping the page indexed may make sense. If there is a replacement, a clear alternative or redirect may be better. If the product is gone and has no useful informational value, returning 404 or 410 can be cleaner. Blindly deleting product pages with backlinks can waste authority; blindly keeping every dead SKU can create crawl waste.

Category pages need real content and stable inventory. A category with no products, repeated boilerplate, or thin filters is a poor index candidate. A category that helps users compare products, understand differences, filter usefully, and reach available items is stronger. Ecommerce SEO often over-focuses on text blocks while ignoring inventory quality, internal linking, faceted logic, and canonical consistency.

Marketplaces face the same problems at larger scale. Listings expire, sellers duplicate content, filters explode, location pages thin out, and internal search creates infinite combinations. Indexing every listing is rarely the right goal. The goal is to index pages that satisfy search demand and remain useful long enough for Google to crawl, store, and serve.

Ecommerce indexation is not about making every URL visible. It is about making the right commercial inventory visible through clean, canonical, useful pages.

SaaS and B2B sites are overproducing comparison pages

SaaS indexing problems are often less obvious than ecommerce problems because the URLs look cleaner. The waste hides in content patterns: feature pages, alternatives pages, “best software” lists, comparison pages, integration pages, templates, industry pages, use-case pages, glossaries, and documentation variants.

The most fragile SaaS page type is the generic comparison page. A company publishes “Product A vs Product B,” “Product A alternatives,” “Best tools for X,” and “Software for Y industry” at scale. Each page repeats the same structure, same claims, same feature language, and same vague buying advice. The names change. The evidence does not.

Google can crawl these pages and decline to index many of them because they lack original information. A useful comparison page should show actual product knowledge: screenshots, workflow details, pricing caveats, feature limitations, migration notes, customer fit, integrations, support differences, security details, and a transparent comparison method. Without that, the page is a keyword container.

Integration pages have a similar problem. If the integration exists, show setup steps, permissions, data flow, limitations, examples, troubleshooting, API details, and real use cases. If the integration does not exist and the page only captures demand, the page is weak and potentially misleading.

Glossaries can be useful when they reflect real domain expertise. They become index bloat when they define obvious terms with generic paragraphs. A B2B site does not need hundreds of thin definitions unless those entries connect to product context, examples, standards, diagrams, workflows, or practitioner detail.

Documentation pages usually have stronger indexing claims because they answer implementation needs. But docs can also create duplication through versioning, old releases, parameterized pages, and internal search results. Documentation requires canonical and archival rules.

SaaS sites should publish fewer “keyword pages” and more evidence pages. The index is more likely to reward pages that prove product experience than pages that imitate a buyer’s guide format.

Local sites are punished by doorway-style repetition

Local SEO creates a strong temptation to generate location pages for every city, suburb, neighborhood, and service combination. Some of those pages are useful. Many are not.

A legitimate local page proves relevance. It may show a real office, service area, local staff, project examples, reviews from that area, local regulations, pricing considerations, travel times, maps, photos, case studies, permits, neighborhood-specific concerns, and internal links to related services. It does not need to be huge. It needs to be real.

A doorway-style local page swaps a place name into boilerplate. The plumbing service is “trusted in Bratislava,” “trusted in Košice,” “trusted in Trnava,” and “trusted in Žilina,” but the content is otherwise the same. The page exists because the keyword exists. Google has little reason to index dozens of near-identical pages from one business if the pages do not add local value.

Local service sites also create duplication through service pages. “Emergency plumber,” “24-hour plumber,” “burst pipe repair,” “leak repair,” and “water damage plumber” may all overlap. Separate pages make sense only when the user need, service detail, examples, and internal links differ. Otherwise, consolidation into stronger pages may work better.

Local businesses often rely on sitemaps and forget internal links. A location page hidden in a sitemap but not linked from the service hub, navigation, footer, case studies, or related pages looks unimportant. If the business claims a page matters, the site should show it.

Reviews and proof matter. A local page with real projects, photos, testimonials, and operational detail is more defensible than a generic page with stock claims. Local expertise is visible through specifics.

The safest local indexation strategy is to publish only the location pages the business can prove. Thin geographic scale is easy to generate and easy for Google to ignore.

Pages lost from the index may be liabilities, not assets

A falling indexed-page count is not automatically bad. Some pages should leave the index. A cleaner indexable inventory can improve crawl efficiency, reduce duplication, and concentrate signals on useful pages.

Healthy exclusions include duplicate parameter URLs, weak tag pages, internal search results, old campaign pages, test pages, staging URLs, thin author archives, printer versions, empty categories, sort orders, obsolete filters, duplicate PDFs, and low-value user-generated pages. If Google stops indexing these, the site may be better off.

Unhealthy exclusions affect intended search assets: canonical product pages, core categories, service pages, high-value location pages, original reporting, evergreen guides, documentation, research, tools, calculators, and pages with proven demand. Those require immediate diagnosis.

The first audit question should be “Did we want this indexed?” Many sites cannot answer. Their sitemaps include everything. Their CMS exposes archives by default. Their filters create crawlable paths. Their noindex rules are inconsistent. Their content teams publish pages without assigning an index role. Google then imposes order from outside.

A mature site classifies URLs into five groups. Some should be indexed and ranked. Some should be indexable only when they meet a quality threshold. Some should exist for users but not be indexed. Some should consolidate into another URL. Some should be removed, redirected, or return 404/410.

That classification turns Search Console from a panic report into an operating system. “Crawled but not indexed” on an internal search page is fine. The same label on a revenue-driving category is a problem. “Alternate page with proper canonical tag” on a sort parameter is fine. The same label on a primary service page is not.

Panic causes destructive fixes. Teams delete pages with backlinks, noindex pages that should be improved, redirect unrelated pages to the home page, rewrite content without diagnosing canonical issues, or request indexing for everything. A controlled audit prevents those errors.

The goal is not maximum indexation. The goal is correct indexation. A site with fewer indexed pages can perform better when the indexed set is cleaner, stronger, and more useful.

Diagnostic signals that separate common causes

Common indexing symptoms and first checks

SymptomLikely mechanismFirst place to checkBad first reaction
Indexed-page count drops while traffic holdsDuplicate cleanup or low-value URL removalPage Indexing by URL typeReindexing every excluded URL
Core pages show “Crawled — currently not indexed”Quality, duplication, weak internal links, rendering, or canonical conflictURL Inspection and template comparisonAdding words without adding value
Many URLs show “Discovered — currently not indexed”Crawl demand, weak linking, or too much URL inventoryCrawl stats, logs, sitemaps, internal linksRepeated manual indexing requests
Important pages show noindex exclusionMeta robots or X-Robots-Tag ruleHTML, headers, CMS settingsEditing copy instead of directives
Whole site vanishes or spam queries appearSecurity issue or manual-action riskManual Actions, Security Issues, logs, DNSWaiting for rankings to return
Pages remain indexed but impressions collapseRanking change, query shift, SERP change, or core updatePerformance report and SERP comparisonCalling it deindexing

This diagnostic frame keeps the first decision where it belongs: mechanism before remedy. The same traffic decline can require content work, engineering work, security work, policy cleanup, or no action at all.

A fast investigation starts with segmentation

Total site numbers are too blunt. An indexing investigation should begin by segmenting URLs by type and template: articles, evergreen guides, products, categories, filters, tags, author pages, location pages, service pages, comparison pages, integration pages, docs, PDFs, listings, profiles, partner content, and internal search results.

Once segments exist, compare index status by group. Are product pages affected or only filters? Are recent articles excluded or old tags? Are location pages missing or service hubs? Are partner directories hit while editorial pages remain stable? Segment-level evidence turns a vague Google complaint into a solvable problem.

Next, align dates. Mark Google’s March 2026 spam update, March 2026 core update, and any relevant Discover changes. Then mark your own releases: migrations, CMS updates, template changes, robots.txt edits, sitemap generator changes, CDN rules, security tools, ad-tech deployments, JavaScript framework changes, and content launches.

Then inspect representative URLs. Use URL Inspection on affected and unaffected examples from the same template. Check Google-selected canonical, last crawl, crawl status, indexing permission, sitemap discovery, referring page, rendered output, loaded resources, and live test results. Do not inspect only one page and generalize to the whole site.

Logs add evidence that Search Console cannot fully provide. Server logs can show whether Googlebot is crawling affected sections, wasting time on parameters, hitting errors, encountering redirects, or avoiding new URLs. For large sites, logs often reveal the crawl-budget problem behind “Discovered — currently not indexed.”

Content review comes after technical eligibility. Evaluate whether the affected pages are unique, current, internally linked, useful, sourced, and better than competing results. Compare them with indexed pages from the same site. The useful question is not whether the page is “good enough” in isolation. It is whether Google has a reason to keep it as a separate result.

Security checks should run early if the decline is sudden or broad. Look for spam queries, unfamiliar indexed URLs, changed snippets, hacked pages, malicious redirects, DNS changes, unknown subdomains, injected scripts, strange server access, and Search Console security messages.

Good diagnosis samples patterns, not feelings. The affected URL group tells you more than a total indexed-page graph.

Recovery begins with an intended index

Recovery starts by defining what should be in Google. Without an intended index, a site cannot tell the difference between healthy exclusion and harmful deindexing.

The intended index is a controlled list of canonical URLs that deserve search visibility. It should include pages with real search value: core categories, products, services, locations with proof, original articles, evergreen resources, documentation, tools, research, and other useful assets. It should exclude internal search results, thin tags, duplicate filters, session URLs, staging pages, thank-you pages, low-value UGC, old campaigns, and pages that exist only for internal use.

For each intended-index URL, the site should be able to answer five questions. Does it return HTTP 200? Is it crawlable? Is it indexable? Is it canonical? Is it useful enough to deserve search traffic? Failure at any point can keep the page out of Google or make its index status unstable.

Technical cleanup comes first because it removes false barriers. Fix accidental noindex, X-Robots-Tag headers, robots blocks, bad status codes, blocked resources, redirect chains, canonical conflicts, sitemap noise, mobile rendering gaps, and internal links to non-canonical URLs.

Then reduce URL waste. Consolidate duplicate pages. Noindex user-facing pages that should not appear in search. Redirect replaced pages to close equivalents. Return 404 or 410 for pages with no value and no replacement. Block crawl traps only where Google does not need to see noindex or canonical signals.

Then improve the pages that matter. Add original information, stronger internal links, updated facts, better evidence, clearer authorship, product detail, local proof, methodology, media, structured data that matches visible content, and practical usefulness. Remove weak overlap between pages.

For manual actions, the path is different. Fix the violation thoroughly, document the work, and request review through Search Console. Do not file reconsideration requests for ordinary ranking drops or algorithmic reassessment. Google’s reconsideration documentation defines the request as a review after fixing manual action or security issue problems.

Recovery is a sequence: define the intended index, fix access and signals, reduce waste, improve value, strengthen internal support, and measure recrawling. Skipping steps creates noise.

Stronger publishing standards are now an SEO requirement

The long-term answer is not a better indexing hack. It is stricter publishing. Search teams need to stop asking only whether a page can be produced and ask whether the page should be indexable.

Every indexable page should pass a purpose gate, uniqueness gate, source gate, maintenance gate, and indexation gate. Purpose asks what user need the page serves. Uniqueness asks what it adds beyond existing pages. Source asks where the information comes from. Maintenance asks who will keep it accurate. Indexation asks whether the page belongs in Google or should remain accessible only inside the site.

This is especially important for AI-assisted workflows. AI can support drafting, classification, summarization, translation, and updating, but it also produces plausible sameness. A site using AI without editorial gates will publish pages that look complete but add little. Google does not need more plausible sameness.

Programmatic SEO needs thresholds. A generated page should be indexable only when it has enough unique data, search demand, internal support, and user value. If a city has no proof, do not index the city page. If a product comparison lacks real analysis, do not publish it. If a filtered category has no stable inventory, noindex or canonicalize it.

Publishers need format rules. Breaking news may be short and temporary. Explainers should be updated and linked. Live blogs need closure. Syndicated articles need canonical handling. Tag pages need curation. Commerce projects need editorial fit. AI summaries need verification. Old articles need archival policy.

Ecommerce teams need URL-pattern rules. Which filters are indexable? Which variants canonicalize? What happens to discontinued products? Which categories require minimum inventory? Which internal search pages are blocked or noindexed? Which product pages need unique copy?

SaaS teams need proof standards. A comparison page should prove product knowledge. An integration page should show setup and limitations. A glossary page should add domain context. A template page should provide something usable. A feature page should show real product detail.

A smaller indexable site with stronger pages is better than a larger site filled with weak, unmaintained URLs. The index is less forgiving of volume without substance.

Recovery actions should differ by page type

Practical recovery choices by page type

Page typeKeep indexed whenImprove firstRemove, noindex, or consolidate when
News articleIt adds original reporting, verified context, or useful local detailDates, bylines, sourcing, updates, related linksIt is duplicate wire copy with no added value
Ecommerce categoryIt matches real demand and has useful, stable inventoryProduct coverage, filters, internal links, canonicalsIt is a thin filter or sort variation
Product pageIt has demand, availability, and unique product valueSpecs, images, reviews, FAQs, alternativesIt is permanently discontinued with no useful replacement
SaaS comparisonIt contains real methodology and product evidenceScreenshots, trade-offs, pricing caveats, freshnessIt repeats generic claims across many pages
Local service pageIt proves real service relevance in the areaLocal examples, reviews, regulations, staff, linksIt only swaps the location name
Tag or archiveIt is curated and useful as a topic hubIntro copy, selection rules, internal linksIt is an auto-generated list with thin value

The table matters because index recovery is not one universal action. A missing article, product, filter, comparison page, local page, and archive page each needs a different decision.

Measuring recovery without fooling yourself

Index recovery is slow and noisy. Google does not recrawl every fixed page instantly. Search Console reports can lag. A live test may pass before the indexed version updates. A page may return to the index without regaining rankings. A page may be indexed as the wrong canonical. Measurement must be careful.

The first metric should be intended-index coverage. Take the list of URLs that should be indexed and measure how many are indexed as the selected canonical. Do not mix that with every discovered URL. A site with fewer total indexed pages but stronger intended-index coverage may be healthier than a site with a bloated index.

The second metric is segment movement. Track affected templates separately: products, categories, filters, articles, tags, locations, docs, and partner pages. A total graph can hide progress. Important pages may recover while junk pages keep falling.

The third metric is crawl behavior. Use Search Console Crawl Stats and server logs to see whether Googlebot is returning to fixed sections, wasting crawl on parameters, hitting errors, or refreshing high-value pages. Crawl recovery often comes before stable index recovery.

The fourth metric is canonical accuracy. For important pages, verify that Google selected the intended canonical. A page can look missing because another URL wins the canonical cluster. Fixing canonical signals may take time as Google reprocesses duplicates.

The fifth metric is impressions by query class. A recovered page should earn impressions for relevant queries, not merely appear in an index check. Separate brand, non-brand, local, product, informational, transactional, and long-tail queries. Ranking recovery and indexing recovery are related but not the same.

The sixth metric is change history. Keep a log of technical fixes, content updates, internal-link additions, sitemap changes, redirects, noindex changes, and template releases. Without a change log, teams cannot connect recovery to action.

A recovered page is not only indexed. It is indexed as the right canonical, crawled reliably, internally supported, earning relevant impressions, and worth maintaining.

Business impact goes beyond SEO dashboards

Indexing problems are revenue problems. For publishers, missing pages can reduce ad inventory, subscriptions, affiliate revenue, Discover reach, Top Stories exposure, and brand visibility. For ecommerce, missing categories and product pages can shift demand into paid search. For SaaS, missing comparison, integration, and documentation pages can weaken pipeline and support discovery. For local businesses, missing service and location pages remove lead paths.

The cost is often hidden because analytics reports traffic, not lost opportunity. A page that never indexed has no obvious decline. A category that vanished may be covered by paid campaigns. A local page that never appears may be blamed on weak sales rather than missing organic visibility. A publisher may chase headline tactics when the real issue is archive bloat and weak originality.

Indexing also affects the wider answer-engine environment. Google’s index is not the only discovery system, but public crawlability, links, citations, structured content, authorship, and freshness all influence how content is found, referenced, and summarized across search and AI systems. A page that is not indexed by Google may still be discoverable elsewhere, but it has lost one of the strongest public-web visibility channels.

The risk is highest for business models built on long-tail page scale. Programmatic SEO, affiliate publishing, local page networks, marketplace listings, and ecommerce filters depend on search engines accepting many distinct URLs. A more selective index raises the cost per successful page. Each page needs more proof, better maintenance, clearer architecture, and stronger internal support.

The upside is that cleaner competitors can win. If Google ignores weak page floods, sites with original information, better architecture, and stronger trust can gain relative visibility. The index squeeze punishes low-value volume but rewards disciplined coverage.

The business question is not “How many pages are indexed?” It is “How much of our search-worthy business is represented by indexed, trusted, canonical pages?”

Original information gain is the strongest defense

Information gain is the practical heart of modern indexation. A page that adds little to existing results is easier to ignore. A page that adds original reporting, data, evidence, experience, or utility is harder to replace.

For publishers, information gain can come from documents, interviews, local reporting, court filings, original photos, timelines, expert analysis, verified corrections, and practical service information. For ecommerce, it can come from original product photos, testing, measurements, compatibility data, reviews, stock information, return details, and buying guidance. For SaaS, it can come from screenshots, workflows, benchmarks, migration notes, API examples, limitations, and support data. For local businesses, it can come from project proof, local regulations, service-area detail, pricing factors, and staff expertise.

Many crawled-not-indexed pages fail here. They answer a query, but no better than existing results. They define a term, but without domain insight. They summarize news, but without reporting. They list products, but without testing. They describe a service, but without local proof. They compare tools, but without evidence.

Originality is not only being first. It can be being clearer, more precise, more local, more current, better sourced, more practical, or more complete for a specific audience. A short page with unique data can beat a long generic article. A local guide with jurisdiction-specific detail can beat a national overview. A product review with original testing can beat an affiliate roundup.

Maintenance is part of information gain. Old pages lose value when facts, prices, screenshots, regulations, products, and processes change. Updating a page meaningfully is different from changing the date. A useful update should alter the main content, links, data, or structured information in a way that improves the page.

The hard editorial test is simple: would a knowledgeable reader learn something here that they would not get from the top existing results? If not, indexing risk rises.

Internal links tell Google what the site values

Internal links are not decoration. They are how a site expresses priority, context, and hierarchy. A sitemap can list a URL, but internal links show whether the site actually values it.

A page with no internal links is weakly endorsed by its own domain. It may be discovered through a sitemap, but discovery is not the same as priority. Important pages should be reachable through navigation, hubs, related content, breadcrumbs, category links, documentation links, or contextual references.

Weak internal linking is common. Blog posts vanish into pagination after publication. Evergreen guides are not linked from new articles. Product pages are accessible only through filters. Location pages sit in XML sitemaps but not service hubs. Comparison pages are hidden because the brand does not want to promote competitors. Documentation pages rely on internal search instead of crawlable navigation.

Internal links also prevent content overlap. A topical hub can show which page is the primary guide and which pages are supporting resources. Without this structure, a site may publish many similar articles that compete with one another. Google then has to decide which one matters.

For ecommerce, internal links from categories, buying guides, breadcrumbs, related products, and popular filters support crawl and context. For local sites, service hubs should link to real location pages, and location pages should link back to relevant services and case studies. For publishers, breaking news should link to evergreen explainers, and explainers should link to current updates when relevant.

Anchor text should be natural and specific. Over-optimized exact-match linking across thousands of pages can look manipulative. Vague anchors waste context. The best internal links help readers and clarify relationships.

If a page matters, the site should prove it through links. If the site will not link to the page, Google has reason to treat it as low priority.

Sitemaps are signals, not commands

Sitemaps help Google discover URLs and understand what a site considers important. They do not force indexing. That point is central to the deindexing debate because many site owners believe that sitemap inclusion should guarantee coverage.

A clean sitemap should list canonical, indexable, important URLs. Too many sitemaps do the opposite. They include redirected URLs, noindexed pages, 404s, duplicate variants, parameter URLs, expired campaigns, and obsolete content. That turns a priority signal into noise.

Segmented sitemaps are better for large sites. Separate articles, products, categories, videos, images, docs, locations, and news content when scale warrants it. Segmentation makes monitoring easier. If product-page coverage drops, the product sitemap reveals the problem. If news articles are discovered but not indexed, the article or news sitemap shows the affected group.

Sitemap timestamps should be honest. The lastmod field should reflect meaningful updates, not template changes, copyright-year changes, or automatic daily refreshes. If every URL claims to be updated every day, Google has less reason to trust the signal.

Sitemaps must align with canonicals. If the sitemap lists URL A, the page canonical points to URL B, internal links point to URL C, and redirects send users to URL D, Google receives mixed signals. The sitemap should reinforce the site’s canonical architecture, not contradict it.

News publishers need special discipline. News sitemaps should focus on fresh eligible articles, not old archives. Publication dates, update times, titles, and article URLs must be accurate. A noisy news sitemap can weaken freshness signals.

Ecommerce sites need inventory rules. Permanent discontinued products should not live forever in sitemaps unless they have informational value. Empty categories should not be listed as important. Variant URLs should not fight canonical product pages.

A sitemap works best when it says the same thing as internal links, canonical tags, redirects, and editorial judgment. It is a supporting signal, not a command to Google.

Security failures can look like sudden deindexing

Security problems can produce the most dramatic indexing losses. A hacked page, DNS compromise, injected spam, cloaked redirects, malware, malicious subdomain, or unauthorized content injection can make Google distrust a site quickly.

A security-related collapse often has telltale signs: unfamiliar indexed pages, spammy titles, strange snippets, gambling or pharmaceutical queries, unexpected redirects, unknown subdomains, suspicious files, altered DNS records, changed favicons, injected scripts, unauthorized CMS users, server access anomalies, or Search Console security warnings.

This type of problem should not be treated as an ordinary content-quality issue. A hacked site needs incident response: identify the entry point, clean malicious files, patch vulnerabilities, rotate credentials, review DNS, remove injected pages, fix redirects, audit server logs, update plugins, harden access, and request review when applicable.

Security also damages crawl efficiency. Hacked pages and spam injections create low-value URLs that Google may crawl and judge as harmful. Even after cleanup, Google may need time to recrawl and reassess trust.

Third-party scripts create another risk. Google’s back button hijacking policy explicitly shows that site owners can be exposed by included code that manipulates user behavior. Ad networks, engagement widgets, affiliate scripts, consent tools, and tag-manager snippets can all affect the page experience.

High-value sites need monitoring. Track robots.txt changes, sitemap changes, noindex changes, canonical shifts, DNS records, SSL changes, indexed-query anomalies, server errors, redirects, unfamiliar URL growth, and Googlebot access. Early detection shortens recovery.

If a whole site or major section disappears quickly and spam signals appear, assume security or manual action until proven otherwise.

Manual actions require a different recovery path

Manual actions are not the same as algorithmic losses. A manual action means Google reviewers identified a policy issue. Search Console’s Manual Actions report is the place to check. Google’s documentation says that after fixing issues listed in the report, the site owner can request review and should explain the quality issue, describe the fixes, and document the outcome.

Algorithmic suppression is different. A core update or spam system can reduce visibility without a manual action notice. There is no reconsideration request for a core-update loss. There may be no specific message when Google’s systems decide a page is not useful enough to index or rank.

This distinction matters because the language changes the response. Do not tell stakeholders “we have a penalty” unless there is a manual action, security issue, or clear policy enforcement. A ranking decline after a core update is not automatically a penalty. A crawled-not-indexed URL is not a manual action. A duplicate canonical exclusion is not a penalty.

Manual-action recovery should be complete and documented. Remove or fix the violating content. Address the root cause, not only examples. If spam came through user-generated content, improve moderation. If hacked content appeared, patch the vulnerability. If site reputation abuse caused the action, remove, noindex, move, or rebuild the content according to policy. If unnatural behavior came from scripts, remove the behavior.

Reconsideration requests should be honest. Vague claims that the site is now compliant are weak. A strong request explains what happened, what was changed, how many URLs were affected, which systems were fixed, and how the site will prevent recurrence.

Manual-action recovery can restore eligibility, but it does not guarantee previous rankings. If the site relied on weak pages before the action, those pages may not return to previous performance. Policy cleanup and content competitiveness are separate tasks.

Manual action means fix and request review. Algorithmic loss means improve and wait for reassessment. Technical exclusion means repair signals and recrawl.

AI Overviews and AI Mode raise the stakes for publishers

Indexing anxiety is now tied to a wider publisher fear: search visibility may decline even when content remains indexed, because AI answers can satisfy users before they click. That is not the same as deindexing, but it affects the business value of being indexed.

Reuters reported on April 30, 2026, that Italy’s communications regulator AGCOM asked the European Commission to investigate Google’s AI-powered search features over publisher concerns. AGCOM acted after a complaint from Italian newspaper publishers, who argued that AI Overviews and AI Mode could divert users away from original news sources and threaten publisher economics, especially for smaller and independent outlets.

Reuters also reported in December 2025 that Google faced an EU antitrust investigation into its use of publishers’ online content and YouTube videos for AI services, including concerns around compensation and the ability to refuse use without search penalties.

This adds a new layer to the indexing debate. A page can be indexed but receive fewer clicks if the answer is summarized in the search interface. A publisher may experience traffic loss without deindexing. A site owner who mislabels that as deindexing will apply the wrong fixes.

For GEO and answer-engine visibility, the requirements overlap with classic indexing: crawlable content, clear authorship, trustworthy sourcing, direct definitions, original evidence, structured sections, accurate dates, and canonical stability. Pages that are too weak for indexing are unlikely to become trusted answer sources.

But AI surfaces also reward extractable clarity. A strong page should state key facts directly, explain mechanisms, show evidence, and make source relationships clear. That does not mean writing robotic snippets. It means writing with enough precision that a human reader and a retrieval system can identify the page’s contribution.

Publishers face a double challenge: earning index inclusion and earning clicks in a search environment that may answer more queries directly. The defensive move is not mass-producing more generic pages. It is publishing work that users and systems have reason to seek out: original reporting, analysis, tools, data, local coverage, expert interpretation, and living resources.

AI search does not make indexing irrelevant. It makes weak indexation less valuable and strong authority more important.

Regulatory pressure will not make weak pages worth indexing

The EU investigations into Google’s spam policy and AI search tools show that search visibility is now part of competition policy. Publishers argue that Google’s rules and AI features can harm revenue, restrict visibility, and shift value away from original content. Google argues that spam enforcement protects users and that AI search is part of product development.

Reuters’ November 2025 report on the EU spam-policy investigation centered on Google’s site reputation abuse enforcement and publisher complaints that commercial partner content was being demoted. Reuters’ April 2025 report showed similar publisher concern after ActMeraki complained to EU regulators.

These cases may affect how Google explains or enforces certain policies, especially where publisher monetization and dominant-platform obligations intersect. They will not remove the basic search problem. The web contains too much spam, duplication, manipulation, and low-value content for any serious search engine to index and rank everything generously.

Regulation may demand more transparency or different treatment in specific cases. It may test whether Google’s rules are fair to publishers under the Digital Markets Act or other EU laws. It may affect AI content-use practices. It may produce commitments, fines, process changes, or policy adjustments. Those outcomes matter.

But a weak page remains weak. A regulator cannot make a duplicate filter URL useful. A legal complaint cannot make a thin city page locally credible. A policy change cannot make a generic affiliate roundup original. A commercial partnership cannot automatically make third-party content fit a publisher’s brand.

Businesses should not build search strategies around enforcement gaps. They should build direct audiences, brand demand, email lists, referral channels, partnerships, paid acquisition discipline, and product value. Organic search remains powerful, but dependence on fragile indexing tactics is dangerous.

Regulatory pressure may change parts of the rulebook. It will not restore the old era when publishing more pages was enough.

A better answer for site owners

The best answer to “Is Google deindexing pages?” is precise.

Yes, pages are being removed, excluded, consolidated, delayed, or left out of Google’s index every day. Some of this is normal. Some of it is healthy cleanup. Some reflects stricter quality thresholds. Some follows spam-policy enforcement. Some is caused by site owners through noindex, robots.txt, canonical conflicts, rendering problems, sitemap noise, redirects, and wrong status codes. Some is caused by hacked pages or manual actions. Some traffic loss is not deindexing at all; it is ranking loss or reduced click-through from changing search results.

The current evidence does not support a single mass deindexing campaign against all ordinary websites. It does support a clear operational reality: Google is more selective about which URLs deserve index storage and search visibility.

For site owners, the response should be disciplined. Confirm whether the page is truly deindexed. Segment affected URLs. Align the timing with Google updates and site releases. Inspect canonicals, noindex, robots, headers, status codes, rendering, internal links, sitemaps, and content value. Compare excluded pages with indexed controls. Review security and manual actions. Remove crawl waste. Improve pages that matter. Build an intended index.

The old model treated indexing as a technical default. Publish a page, put it in a sitemap, wait for Google. That model is weaker now because the web has changed. The index has less patience for pages that exist only because a keyword exists.

The stronger model treats indexing as earned. Every indexable page needs a role, proof, maintenance, internal support, and technical clarity. Pages without those things should be improved, consolidated, noindexed, or removed.

Google’s index is getting pickier, not broken. The sites that understand that will stop chasing indexation volume and start managing indexation quality.

Search questions about Google deindexing and indexing problems

Is Google deindexing pages in 2026?

Google has not announced one mass deindexing campaign against normal websites in 2026. It has confirmed the March 2026 spam update and March 2026 core update, and its own documentation says indexing is not guaranteed. Many pages are being excluded, consolidated, delayed, or left unindexed because of quality, duplication, crawl demand, technical signals, spam policy issues, or manual actions.

What does deindexing mean?

Deindexing means a URL that was previously in Google’s index is no longer indexed. It is different from ranking loss, where the page remains indexed but appears lower or for fewer queries. It is also different from canonical consolidation, where Google indexes another URL as the representative version.

Does “Crawled — currently not indexed” mean Google penalized my page?

No. It means Google crawled the URL but has not indexed it. The cause may be weak content, duplication, poor internal linking, rendering issues, canonical conflict, low perceived value, or ordinary selection. It is not automatically a penalty.

Does “Discovered — currently not indexed” mean the page is bad?

Not necessarily. It means Google knows about the URL but has not crawled it yet. This often points to crawl demand, weak internal links, sitemap noise, or too much low-value URL inventory, especially on large sites.

Can a sitemap force Google to index a page?

No. A sitemap helps Google discover URLs and understand which pages a site considers important, but it does not guarantee crawling or indexing. Google still decides whether a URL belongs in the index.

Can noindex remove a page from Google?

Yes. A noindex meta tag or X-Robots-Tag header can prevent a page from appearing in Google Search. If noindex is accidentally applied to important pages, those pages can drop out of results after Google crawls them.

Is robots.txt the same as noindex?

No. Robots.txt manages crawling. Noindex manages indexing. If a page is blocked by robots.txt, Google may not be able to crawl it and see a noindex or canonical signal.

Why did Google choose a different canonical URL?

Google may choose a different canonical if the site sends mixed signals through redirects, canonical tags, internal links, sitemaps, duplicate content, hreflang, or URL patterns. Canonical tags are strong signals, not absolute commands.

Did the March 2026 core update deindex pages?

The March 2026 core update was a broad ranking update, not a publicly announced deindexing campaign. Some pages may have lost visibility during reassessment, but ranking loss and deindexing are different problems.

Did the March 2026 spam update remove pages?

The March 2026 spam update applied globally and to all languages. Google did not publicly list every target. Sites violating spam policies can rank lower or not appear in results, so spam enforcement can remove or suppress pages.

Can AI content be deindexed?

AI-assisted content is not automatically deindexed. The risk is scaled, unoriginal, low-value content created mainly to manipulate rankings. Google’s scaled content abuse policy focuses on purpose and value, not only production method.

Can a manual action remove a whole site?

Yes. Severe spam or security issues can create site-wide impact. Site owners should check Search Console’s Manual Actions and Security Issues reports if a site or major section suddenly disappears.

How do I check whether a page is really deindexed?

Use Search Console’s URL Inspection tool for the exact URL. Check whether Google reports it as indexed, which canonical it selected, whether indexing is allowed, and whether the live page is crawlable and renderable.

Should I request indexing for every excluded URL?

No. Request indexing only after fixing the root cause and only for important URLs. If the page remains weak, duplicate, blocked, noindexed, or canonicalized elsewhere, requesting indexing will not solve the problem.

Should I delete pages Google does not index?

Not automatically. First decide whether the page should be indexed. If it has no search value, noindex, consolidate, redirect, or remove it. If it is important, improve technical signals, internal links, and content value before pruning.

Can internal links improve indexing?

Yes. Internal links help Google discover pages, understand hierarchy, and see which URLs the site values. Important pages should not rely only on sitemaps.

Is a falling indexed-page count always bad?

No. A falling count can be healthy if Google stops indexing duplicate, low-value, or unintended URLs. It is bad when intended, useful, canonical business pages fall out of the index.

How long does indexing recovery take?

Technical fixes can be recognized after recrawling, but quality reassessment and canonical changes can take longer. Manual actions require review after cleanup. Core-update recovery may depend on broader reassessment by Google’s systems.

What is the safest long-term indexing strategy?

Build an intended index of pages that truly deserve search visibility. Keep technical signals consistent, reduce duplicate and low-value URL inventory, publish original and useful content, strengthen internal links, maintain pages over time, and monitor security.

Author:
Jan Bielik
CEO & Founder of Webiano Digital & Marketing Agency

Google’s index is getting pickier, not broken
Google’s index is getting pickier, not broken

This article is an original analysis supported by the sources cited below

In-depth guide to how Google Search works
Google Search Central documentation explaining crawling, indexing, serving, and Google’s statement that crawling, indexing, and serving are not guaranteed.

March 2026 core update
Google Search Status Dashboard entry recording the March 2026 core update rollout dates.

March 2026 spam update
Google Search Status Dashboard entry recording the March 2026 spam update and its global rollout.

All incidents reported for Ranking
Google Search Status Dashboard history showing recent ranking, spam, core, and Discover updates.

Google Search’s core updates
Google Search Central guidance on analyzing and responding to broad core update impact.

Spam policies for Google web search
Google Search Central documentation defining spam policies, including scaled content abuse and other manipulative practices.

What web creators should know about our March 2024 core update and new spam policies
Google Search Central blog post announcing spam policies for expired domain abuse, scaled content abuse, and site reputation abuse.

New ways we’re tackling spammy, low-quality content on Search
Google product blog post explaining ranking and spam-policy changes aimed at reducing low-quality, unoriginal content.

Updating our site reputation abuse policy
Google Search Central blog post clarifying that third-party content alone is not a violation, but abusing host-site ranking signals is.

Introducing a new spam policy for back button hijacking
Google Search Central blog post announcing back button hijacking as a spam-policy violation with enforcement from June 15, 2026.

Google Search technical requirements
Google Search Central documentation explaining basic technical requirements for Google Search access and indexing.

Block Search indexing with noindex
Google Search Central documentation explaining noindex meta tags and X-Robots-Tag headers.

Robots.txt introduction and guide
Google Search Central documentation explaining how robots.txt manages crawler traffic and its limits.

How to specify a canonical URL
Google Search Central documentation on canonical signals, duplicate URLs, and preferred canonical selection.

Fix canonicalization issues
Google Search Central documentation for diagnosing canonical selection problems.

Understand JavaScript SEO basics
Google Search Central documentation explaining JavaScript rendering and search-related technical risks.

Crawl budget management
Google documentation explaining when crawl budget matters and how large sites should manage crawlable inventory.

What crawl budget means for Googlebot
Google Search Central blog post explaining crawl demand, crawl rate, and low-value URL patterns.

Learn about sitemaps
Google Search Central documentation explaining what sitemaps do and why they do not guarantee indexing.

Build and submit a sitemap
Google Search Central documentation on sitemap construction and accurate lastmod use.

URL Inspection tool
Search Console Help documentation explaining URL Inspection, indexed data, live testing, and indexing requests.

Manual actions report
Search Console Help documentation explaining manual actions, issue repair, and review requests.

Reconsideration requests
Search Console Help documentation defining reconsideration requests for manual actions and security issues.

Google March 2026 core update rollout is now complete
Search Engine Land report on the completion, timing, and context of the March 2026 core update.

Google’s spam policy hit by EU antitrust complaint from German media company
Reuters report on ActMeraki’s EU antitrust complaint over Google’s site reputation abuse policy.

Google hit with EU antitrust investigation into its spam policy
Reuters report on the European Commission investigation into Google’s spam policy and publisher complaints.

Google faces EU antitrust investigation over AI Overviews, YouTube
Reuters report on EU scrutiny of Google’s use of publisher content in AI Overviews and AI services.

Italy’s media regulator asks EU to investigate Google AI search tools over publisher concerns
Reuters report on AGCOM’s April 2026 request for EU review of Google AI Overviews and AI Mode over publisher concerns.