The robots.txt rules Google ignores are finally getting named

The robots.txt rules Google ignores are finally getting named

Robots.txt has always looked deceptively simple. A plain text file sits at the root of a site, gives crawlers a few instructions, and then disappears from daily conversation until something breaks. Yet that file is one of the few places where publishing, engineering, SEO, infrastructure, and now AI governance all meet in the same twelve lines of text.

Google’s latest movement around robots.txt matters because it is not a ranking update, a new report, or a flashy Search feature. It is a cleanup job. Google has been looking at real robots.txt files across the web, using HTTP Archive data and BigQuery, to identify the unsupported directives site owners actually write into those files. The current Google documentation still says Google supports four fields in robots.txt: user-agent, allow, disallow, and sitemap. Other fields, including crawl-delay, are not supported by Google.

That gap between what site owners write and what Google’s crawler actually honors is the story. Search Engine Journal reported on April 23, 2026, that Google may expand its unsupported robots.txt documentation after Gary Illyes and Martin Splitt discussed the work on Search Off the Record. The project began with a community pull request, then widened into a data-led review of commonly used unsupported rules. A public commit in Google’s robotstxt repository already shows several unsupported tags added to a reporting list, including content-signal, content-usage, domain, request-rate, revisit-after, and visit-time, with comments tying the additions to HTTP Archive custom metrics.

The practical lesson is blunt: robots.txt is not a place for wishful configuration. If a directive is not supported by the crawler you care about, it is not a hidden control. It is a comment with a colon.

Google’s robots.txt expansion is really a cleanup of the messy real web

The phrase “robots.txt docs expand” sounds small until you look at the reason behind it. Google is not expanding robots.txt because the protocol suddenly became new. The Robots Exclusion Protocol dates back to 1994 and was standardized as RFC 9309 in 2022. The RFC defines the method service owners use to tell automated clients, or crawlers, which content they may access. It also formalizes behavior around parsing, caching, errors, redirects, and the structure of a valid /robots.txt file.

What changed is Google’s interest in documenting the web as it is actually configured, not as clean examples suggest it should be configured. Robots.txt files are full of inherited rules, copied snippets, CMS defaults, hosting panel templates, plugin residue, AI crawler experiments, and directives originally meant for other search engines. Some are harmless. Some mislead teams into thinking a crawler has been controlled when nothing has changed. Some expose private paths without protecting them. Some block resources Google needs to render a page correctly.

Google’s current guidance states that robots.txt is mainly for managing crawler traffic and is not a mechanism for keeping a web page out of Google. To keep a web page out of Google, the page needs noindex or password protection. That distinction is easy to say and surprisingly easy to violate. Many teams still use robots.txt as if it were an indexing switch, a security layer, an AI licensing mechanism, a crawl-rate dial, and a canonicalization tool all at once.

The expansion effort appears to be aimed at reducing that confusion. If Google publicly names the unsupported directives it sees most often, Search Console warnings and crawler behavior become easier to interpret. A site owner seeing crawl-delay, noindex, nofollow, host, request-rate, or visit-time in an old robots.txt file can stop asking whether Google secretly honors it. The answer becomes visible.

That visibility has real value. Technical SEO work often fails not because the team lacks skill, but because old assumptions remain undocumented inside infrastructure. A rule gets added during a migration. A plugin appends a directive. A developer copies a template from a forum. Years later, another team treats the line as policy. Google’s documentation expansion turns those ghosts into auditable objects.

The work also signals something about Google’s Search Relations posture. Rather than guessing which unsupported tags deserve documentation, Google looked to HTTP Archive data. The public pull request to HTTP Archive’s custom metrics shows the parser was changed to capture any valid Key: Value pair instead of only a fixed set of known rule types. That lets analysts count custom directives across real robots.txt files.

For SEOs, developers, and publishers, the message is not “add more rules.” It is the opposite. A shorter robots.txt file with supported directives is usually stronger than a long file filled with instructions Google ignores.

The supported field list is still short for a reason

Google’s robots.txt support has a narrow center. Its documentation names four supported fields: user-agent, allow, disallow, and sitemap. The first three determine which crawler may access which paths. The fourth points crawlers toward sitemap URLs. Everything else is outside Google’s supported robots.txt control set unless a specific Google product documents a separate preference mechanism.

That narrowness can frustrate site owners who want robots.txt to do more. A single text file looks like a convenient control plane. Why not set crawl speed? Why not declare AI training rights? Why not mark pages as noindex? Why not name a preferred host? The answer is partly technical and partly historical. Robots.txt was created to express crawl access, not every downstream use of content. Crawler access is a pre-fetch decision. Indexing, serving, snippets, AI use, licensing, canonicalization, and ranking are later decisions handled by different systems.

The supported fields also keep parsing predictable. A crawler reading robots.txt needs to know which group applies to its user agent, which paths are allowed, which are disallowed, and whether a sitemap exists. Google explains that only one group is valid for a crawler: the most specific matching user-agent group. If multiple groups apply to the same specific user agent, Google combines them internally, while user-agent-specific groups and the wildcard group are not combined. A messy file can therefore behave differently from what a human scanning it expects.

The sitemap field is a special case. It is not a crawl-blocking rule. Google, Bing, and other major search engines support it in robots.txt, and Google’s documentation says it is not tied to a particular user agent. A sitemap line can point to a sitemap or sitemap index file and does not need to be on the same host as the robots.txt file.

The tight list of supported fields protects site owners from a worse problem: false precision. If Google supported many half-defined directives, every crawler would interpret them slightly differently. That already happens outside the standard. A directive like crawl-delay has had different meanings across engines and tools. Google’s clearer position is safer for operators: for Google, unsupported means unsupported, even if the word looks official.

The April 2026 discussion does not appear to change the supported list. It appears to change the unsupported list, which is a different thing. Google can document a directive as commonly seen and unsupported without making it active. That distinction matters. A documented unsupported directive is not a new capability. It is a clearer warning label.

For a technical team, the clean mental model is simple: use robots.txt to manage crawling access, use noindex to manage indexing, use canonical signals to consolidate duplicates, use HTTP status codes for resource state, use authentication for private content, and use crawler-specific policies only where the crawler documents them.

Unsupported directives are not harmless decoration

A line that Google ignores can still cause damage. It can waste time during audits. It can mislead non-technical stakeholders. It can create false compliance records. It can make teams think they have solved a privacy, indexing, or AI-use problem when they have only written text into a public file.

Google’s 2019 post about unsupported robots.txt rules was direct. When Google open-sourced its robots.txt parser, it said it had analyzed rules unsupported by the internet draft, including crawl-delay, nofollow, and noindex. Google also said it was retiring code that handled unsupported and unpublished rules such as noindex in robots.txt. For anyone relying on noindex inside robots.txt, Google pointed to alternatives such as robots meta tags, X-Robots-Tag headers, 404 or 410 status codes, password protection, and Search Console removals.

That post is still one of the clearest references because it separates crawler access from index control. A Disallow rule tells a compliant crawler not to fetch a URL. It does not tell the search engine to erase all knowledge of the URL. If other pages link to a disallowed URL, Google may still know the URL exists and show it without a snippet. Google’s robots.txt introduction says exactly that: a disallowed page can still be indexed if linked elsewhere, and private content should be protected with stronger methods.

Unsupported directives create a second problem: they make robots.txt harder to reason about under pressure. During a migration, traffic drop, staging leak, indexation accident, or crawl spike, teams need to know which lines matter. A robots.txt file with twenty unsupported directives, legacy comments, copied bot blocks, stale sitemaps, and environment-specific paths becomes a risk surface. The file may still parse, but the human process around it breaks down.

The public Google robotstxt commit is revealing because the unsupported list it expands includes tags that sound operationally serious: request-rate, revisit-after, visit-time, domain, content-signal, and content-usage. Those are not silly strings. They look like policy. Some may have meaning for other crawlers or emerging content-use systems. For Googlebot crawl access, the commit’s comment says the library “doesn’t use them for anything,” though other search engines may.

That is the uncomfortable middle ground. Unsupported by Google does not always mean globally meaningless. It means the directive should not be counted on for Googlebot behavior. A publisher may still decide to include a content-use signal, a third-party bot rule, or an AI crawler directive for another system. The audit question is narrower: which crawler or product is each line meant to affect, and is that support documented?

Once that question becomes mandatory, many bloated robots.txt files shrink fast.

HTTP Archive changed the documentation conversation

The most interesting part of the April 2026 robots.txt story is not the list of tags. It is the research method. Search Engine Journal reported that Google used HTTP Archive to study robots.txt rules across large numbers of sites, but the first attempt ran into a problem: HTTP Archive’s default crawl did not typically request robots.txt files. The project then used a custom JavaScript parser to extract robots.txt rules line by line, with the resulting data made available in HTTP Archive’s custom metrics dataset.

That is a useful pattern for SEO work beyond robots.txt. It replaces anecdote with distribution. The SEO industry often treats rare edge cases as common because rare edge cases produce memorable screenshots, conference questions, and long forum threads. Data from many real sites can reset that instinct. If Google sees that usage drops sharply after user-agent, allow, and disallow, then documentation can focus on the long tail without pretending that every weird directive deserves equal attention.

The public HTTP Archive custom metrics pull request gives a glimpse of the mechanics. The parser was changed to capture any valid Key: Value pair, dynamically count rule types, normalize unknown rules, exclude sitemap from user-agent breakdowns because sitemaps are global, and preserve schema consistency. The examples even show custom or broken lines such as unicorns, stray HTML-derived fields, and known unsupported directives like crawl_delay and noindex.

That matters because robots.txt data is messy by nature. A server may return an HTML error page with a 200 status code. A CMS may generate comments, plugin blocks, or malformed lines. A CDN may serve a managed file. A staging environment may leak production rules. A single site can have different robots.txt behavior across http, https, www, non-www, subdomains, and ports. The parser has to deal with real failure, not classroom syntax.

For technical SEO teams, this is a reminder to stop treating robots.txt as a static checklist item. Robots.txt deserves log-level and infrastructure-level thinking. A page in a CMS is not the same as a root file served by the host. A deployed file is not the same as the file in the repository. A valid file path is not the same as a valid rule. A 200 response is not proof the content is a usable robots.txt file.

The data-led approach also helps explain why Google may document unsupported tags rather than ignore the problem. If enough site owners include the same ineffective directives, the issue is no longer just user error. It becomes a communication failure between documentation, tools, CMS defaults, plugins, and crawler reality.

Better documentation will not fix bad files automatically. It will make bad files easier to identify.

The open-source commit gives the clearest clue

The public google/robotstxt commit from April 2026 is more concrete than the podcast summary. It adds content-signal, content-usage, domain, request-rate, revisit-after, and visit-time to a list of unsupported tags, alongside older entries such as clean-param, crawl-delay, host, noarchive, nofollow, and noindex. The commit message says the added tags were identified as frequently used in robots.txt files via HTTP Archive BigQuery tables.

That list is a map of old and new confusion.

crawl-delay, request-rate, revisit-after, and visit-time belong to the dream of controlling crawler speed from a text file. Many site owners want a simple throttle. Google’s current crawl-rate guidance points elsewhere. It says Google’s crawler infrastructure calculates crawl rate algorithmically, aiming to crawl as many pages as possible without overwhelming the server. For emergency short-term reductions, Google recommends returning 500, 503, or 429 instead of 200 for crawl requests, with warnings about broad effects.

noindex, nofollow, and noarchive belong to the indexing and serving layer. Those instructions live in meta robots tags or X-Robots-Tag headers when supported. Google’s robots meta documentation says meta and header controls can be read and followed only if crawlers are allowed to access the pages containing them. This is a common trap: blocking a page in robots.txt can prevent Google from seeing a noindex tag on the page.

host and domain belong to older host preference and regional search engine behavior. Google has other ways to interpret canonical hosts, redirects, and site structure. A host directive in robots.txt should not be treated as a Google canonicalization signal.

content-signal and content-usage belong to the new AI-content governance era. Cloudflare’s Content Signals Policy uses a Content-Signal line to express preferences such as search=yes and ai-train=no, while also saying those signals express preferences and are not technical countermeasures against scraping. Google adding these terms to an unsupported reporting list does not settle the broader legal or standards debate. It does clarify one operational point: such lines should not be confused with Google’s standard robots.txt crawl-control fields.

That distinction is where the real work sits. A publisher may include content signals for rights reservation or AI policy. A technical SEO should still label them correctly: policy expression, not Googlebot crawl access control.

Once those categories are separated, robots.txt becomes easier to govern. The same file may contain crawler access rules, sitemap pointers, third-party bot rules, and rights signals. Each line needs an owner, a purpose, and a documented support target.

Robots.txt is a crawl control file, not an indexing guarantee

The most expensive robots.txt mistakes usually start with one false belief: if a URL is disallowed, it cannot appear in Google. Google’s documentation says otherwise. A robots.txt file tells crawlers which URLs they can access; it is used mainly to avoid overloading a site with requests, and it is not a mechanism for keeping a web page out of Google.

The difference is mechanical. Crawling is fetching. Indexing is storing and evaluating. Serving is showing a result. Robots.txt acts before fetch. If Google is blocked from fetching a page, it cannot see the page content, the HTML, the visible text, the canonical tag, the meta robots tag, or the internal links on that page. Yet Google can still learn that the URL exists from external links, internal links on allowed pages, sitemaps, redirects, historical data, or other discovery paths.

That is why a blocked URL can appear without a useful snippet. Google may know the URL exists but not have page content to describe it. For sensitive pages, this is worse than many teams expect. A disallowed admin path or private document path may become visible as a bare URL because robots.txt is public and the URL appears somewhere else. Google’s useful rules documentation warns not to use robots.txt to block private content because the file can be viewed by anyone and can disclose private paths.

The correct control depends on the goal.

If the goal is to reduce crawling of faceted URLs, internal search results, calendar traps, duplicate sort orders, cart paths, or low-value parameter combinations, robots.txt can be appropriate. If the goal is to remove a page from Google’s index, noindex, a proper removal flow, authentication, or the correct status code is safer. If the goal is to protect confidential material, robots.txt is the wrong tool.

The trap deepens with noindex. Teams sometimes block a page in robots.txt and place noindex on the page, expecting double protection. Google’s meta robots documentation warns that meta and header settings can be read only if the crawler is allowed to access the page. If robots.txt blocks crawling, Google may never see the noindex. The stronger-looking combination can undermine the intended indexing control.

The reliable sequence is plain: allow Google to crawl a page long enough to see noindex, let it process the directive, then remove or block only when the index state is understood. For private pages, skip the dance and require authentication.

Robots.txt controls access. It does not erase discovery.

Crawl budget is still the best reason to care

Robots.txt becomes strategically important on large, messy, frequently changing sites. Google’s crawl budget documentation says the web is too large for Google to crawl and index every available URL, so there are limits to the time and resources Google can spend on a single hostname. Crawl budget is determined by crawl capacity limit and crawl demand.

That framing cuts through a lot of SEO folklore. Small websites do not need to obsess over crawl budget. A hundred-page business site with stable URLs rarely needs a complex robots.txt strategy. A marketplace, publisher archive, ecommerce catalog, classified site, travel aggregator, documentation portal, or faceted product database may need one badly. The difference is URL inventory.

Google says perceived inventory is one factor site owners can control. If Google knows about many duplicate, removed, unimportant, or unwanted URLs, crawling time can be wasted. Its crawl budget documentation explicitly names robots.txt as a tool for blocking URLs that users may need but that site owners do not want Google to crawl or reprocess, such as infinite scrolling pages that duplicate information or differently sorted versions of the same page.

The key word is “crawl,” not “rank.” Robots.txt is not a ranking boost. It does not make weak pages stronger. It does not concentrate PageRank in a magical way. Its value is operational: it can keep Googlebot out of places where crawling produces little or no search value.

Faceted navigation is the classic case. Google’s faceted navigation guidance describes overcrawling, slower discovery crawls, and huge combinations of filter URLs. It recommends preventing crawling of faceted navigation URLs when those URLs do not need to appear in Google Search or other Google products. A clothing site with size, color, brand, price, material, availability, shipping speed, discount, and sort parameters can generate millions of URL combinations. Most are not distinct search landing pages. Letting crawlers explore all of them is not openness. It is waste.

Robots.txt is useful when the unwanted area is pattern-based and crawl-level blocking is acceptable. It is weaker when individual URLs need indexing decisions, when the content must be seen for canonicalization, or when only some parameter combinations have search value. The right answer may combine clean internal linking, canonical tags, parameter handling in application logic, sitemap discipline, and selective robots.txt blocking.

Crawl budget work is really inventory work. Robots.txt is one inventory control, not the whole system.

The cache delay makes robots.txt a poor emergency switch

Robots.txt changes do not behave like a light switch. Google’s update documentation says Google’s crawlers notice robots.txt changes during automatic crawling and update the cached version every 24 hours. If a faster update is needed, site owners can use the Request a recrawl function in the Search Console robots.txt report.

That cache behavior is often forgotten during emergencies. A developer blocks the wrong directory. A staging rule ships to production. A CMS generates a global Disallow: /. A CDN serves the wrong file. Someone fixes robots.txt and expects Google to resume crawling immediately. The fix may be correct, but the crawler may still be working from a cached version for a period of time.

The delay cuts both ways. If a bad rule blocks important content, recovery is not instant. If a team tries to use robots.txt to pause crawling for a few hours during server trouble, Google may not pick up the temporary rule quickly enough, and the rule may remain in effect after the emergency passes. That is why Google’s crawl-rate emergency guidance points to server response codes rather than rapid robots.txt edits. For short-term urgent crawl reduction, Google recommends returning 500, 503, or 429, while warning that long use can affect Search visibility.

Robots.txt is better for durable policy than momentary control. Block /cart/, /checkout/, internal search results, generated calendars, or low-value parameter spaces because those areas should not be crawled. Do not rewrite robots.txt every hour to steer a crawler through operational turbulence.

The file’s public nature also makes emergency edits risky. A rushed rule may reveal paths, block assets, or affect more hosts than intended. Remember that robots.txt applies to the protocol, host, and port where it is posted. Google’s creation guidance says a file at https://example.com/robots.txt applies only to paths on that exact protocol, host, and port, not to subdomains or alternate protocols. In an outage, those boundaries are easy to misread.

The safe operational pattern is boring and strong: version robots.txt, review changes like code, test before deployment, monitor Search Console, inspect logs after major releases, and keep emergency crawl reduction separate from permanent crawl policy.

A robots.txt file should be stable enough that every line has a reason to survive a migration.

File location and host boundaries still trip up mature teams

Robots.txt has one unforgiving rule before any directive is parsed: the file must be in the right place. Google’s guidance says the file must be named robots.txt, must be located at the root of the site host to which it applies, and applies only to paths within the same protocol, host, and port. A robots.txt file on a subdomain controls that subdomain, not the parent domain. A file in a subdirectory is not valid for robots.txt control.

That sounds basic. It still causes real failures.

Modern websites rarely live on one clean host. A brand may use example.com, www.example.com, shop.example.com, support.example.com, cdn.example.com, regional ccTLDs, staging hosts, app subdomains, API hosts, image hosts, and non-standard ports. Each may need a separate robots.txt decision. Some should allow crawling. Some should block crawling. Some should not exist publicly. Some should require authentication rather than rely on robots.txt.

Migrations make the problem sharper. A site may redirect non-www to www, but robots.txt behavior still has to be valid where crawlers request it. The RFC says robots.txt is found at /robots.txt in the top-level path of the service and must be UTF-8 plain text. It also defines behavior for redirects, unavailable status, and unreachable status. A redirecting robots.txt endpoint is not automatically wrong, but chained redirects, protocol mismatches, CDN rules, and host-level rewrites increase room for mistakes.

The “one file per host” rule is especially important for international SEO. A multilingual site may have language folders under one host, language subdomains, or separate country domains. Robots.txt follows host boundaries, not language strategy. Blocking /de/ on www.example.com says nothing about de.example.com. Allowing a crawler on the root domain says nothing about a staging subdomain exposed by mistake.

The same logic applies to assets. If images, JavaScript, or CSS are served from a CDN host, the CDN host’s robots.txt behavior may affect whether Google can fetch those resources. Google’s introduction warns against blocking resource files when their absence makes the page harder for Google’s crawler to understand. A site can accidentally allow HTML while blocking the resources needed for rendering.

The audit question should be host-based before it is directive-based. List every crawlable host. Fetch /robots.txt on each. Check status, content type, redirects, caching headers, size, and content. Then review rules. Many teams skip straight to syntax and never notice they are editing the wrong host.

Syntax details decide real crawler behavior

Robots.txt is forgiving in some places and strict in others. That combination creates false confidence. Field names are case-insensitive. Spacing is flexible. Comments are allowed. Invalid lines may be ignored. Yet paths are case-sensitive, rule grouping changes behavior, the most specific matching user-agent group wins, and unsupported fields do not become active because they look reasonable.

Google’s creation guide states that each group begins with a User-agent line, that a user agent can match only the first most specific rule set, and that rules are case-sensitive. It also says the default assumption is that a user agent can crawl any page or directory not blocked by a disallow rule.

The case sensitivity point is not academic. Disallow: /File.asp and Disallow: /file.asp may control different URLs on many servers. A rule copied from a lowercase URL pattern may not block uppercase variants. A migration that changes path casing can leave old blocks ineffective. A CMS that normalizes visible links while the server still accepts mixed-case paths can create audit noise.

Grouping is another common failure. A wildcard group and a Googlebot-specific group are not simply layered in the way many humans expect. Google’s spec says user-agent-specific groups and global groups are not combined. If a team writes broad blocks under User-agent: * and then creates a small Googlebot-specific group for one exception, it may accidentally override the broader expectation. The crawler follows the most specific applicable group, not every group that looks relevant.

The allow and disallow interaction is also precise. allow is used to override a broader disallow for a subdirectory or page. Google supports * wildcards in paths and $ end anchors. RFC 9309 requires support for special characters including # for comments, $ for end-of-match, and * for wildcard matching. Small syntax choices can therefore decide whether a rule blocks a whole directory, a path prefix, or an exact URL.

This is why robots.txt should be tested against real URLs, not reviewed by eye alone. For each important rule, teams should maintain examples of URLs expected to be allowed and disallowed. That makes regressions visible when a CMS changes URL patterns, a platform adds parameters, or a migration introduces new paths.

Robots.txt is a pattern-matching file. Treating it like a human-readable policy document is where mistakes begin.

Faceted navigation is where robots.txt earns its keep

Faceted navigation remains one of the strongest reasons to maintain a serious robots.txt strategy. Ecommerce, marketplaces, job boards, travel sites, real estate portals, recipe databases, SaaS directories, and publisher archives can create enormous URL spaces through filters, sorting, pagination, and internal search. Search users may need some of those pages. Crawlers do not need all of them.

Google’s faceted navigation documentation describes the core problem well: crawlers often have to access many faceted URLs before determining that they are useless, and crawling spent on useless URLs leaves less time for new, useful URLs. The issue is not that filtered pages are always bad. The issue is that combinations explode faster than editorial value.

A simple category page such as /shoes/ may deserve indexing. A filtered page such as /shoes/?color=black may also deserve indexing if users search for black shoes and the page has stable inventory, useful content, and internal links. A URL like /shoes/?color=black&size=8&sort=price-desc&in_stock=true&discount=20&page=17 is less likely to be a strong search landing page. Multiply that by every category and filter, and the crawl space becomes unmanageable.

Robots.txt can block known low-value parameter patterns. Google’s faceted navigation guide gives examples of disallowing specific query-parameter patterns while allowing an all-products listing. That kind of pattern is practical because it handles a class of URLs before Google spends resources fetching each one.

The hard part is deciding which classes deserve access. Blocking too much can remove useful long-tail landing pages from crawling. Blocking too little can let crawler resources disappear into parameter noise. Canonical tags may reduce indexing duplication over time, but Google still has to crawl a page to see the canonical. Internal nofollow can reduce discovery but is less durable than fixing URL generation and crawl paths. URL fragments may prevent some crawl expansion because Google generally does not use fragments in crawling and indexing, but that requires product and engineering decisions.

The strongest faceted strategy starts with search demand and inventory stability. Which filtered pages have durable search value? Which combinations are generated only by UI state? Which produce thin, empty, or near-duplicate content? Which should be linked internally? Which should appear in XML sitemaps? Robots.txt then enforces the crawl boundary.

Faceted navigation is not a robots.txt problem first. It is a URL design problem that robots.txt can help contain.

AI crawlers made robots.txt politically important again

For years, robots.txt was mostly a technical SEO and infrastructure topic. AI crawlers changed its public meaning. Publishers, brands, artists, software companies, and communities now look at robots.txt as a way to express who may fetch content, who may use it for search, who may use it for AI training, and who may use it for real-time AI answers.

The problem is that the old robots.txt model was built around access, not downstream use. Cloudflare’s 2025 analysis of AI crawler traffic found that about 14% of the top 10,000 domains it could inspect had allow or disallow directives targeting AI bots, with GPTBot both the most blocked and the most explicitly allowed among the sampled domains. Cloudflare also noted that some tokens, such as Google-Extended, are not user-agent substrings and may signal crawler purpose rather than map directly to request logs.

That is a very different world from “block /cart/.” A publisher may want normal Google Search crawling but not AI model training. A brand may want ChatGPT Search visibility but not foundation model training. A documentation site may want AI assistants to retrieve current pages because support costs fall when answers are accurate. A paid publisher may want strict access control and licensing. These are business decisions, not just crawl directives.

OpenAI’s crawler documentation reflects that separation. It describes OAI-SearchBot for surfacing websites in ChatGPT search features and GPTBot for crawling content that may be used in training generative AI foundation models. It says each setting is independent, so a webmaster can allow OAI-SearchBot while disallowing GPTBot. That kind of crawler separation is what many publishers want across the web.

Google’s ecosystem is more complicated because Googlebot, Google-Extended, Search, AI features, and product-specific crawlers do not map neatly to the same categories in every use case. Cloudflare’s analysis notes that Google uses Google-Extended in robots.txt to determine whether content can be used for AI training, while traffic itself still comes from standard Google user agents such as Googlebot.

This is where Google’s unsupported directive documentation becomes more important. If files now include content-signal, content-usage, AI bot blocks, product-specific tokens, and old crawl directives, site owners need sharper labels. A Disallow line for a documented user-agent token is a crawl access rule. A content signal may be a rights or preference expression. A Google-ignored unsupported tag is not Google crawl control.

AI did not make robots.txt more powerful by itself. It made misunderstandings more expensive.

Content signals belong in a different mental bucket

Cloudflare’s Content Signals Policy is a good example of why unsupported directive documentation needs careful reading. The policy introduces a Content-Signal line that can express preferences such as search=yes, ai-input=no, or ai-train=no. Cloudflare says the signals define whether content may be used for building a search index, real-time AI input, or AI model training, and says restrictions may reserve rights under EU copyright law.

Yet Cloudflare also says content signals express preferences and are not technical countermeasures against scraping. It recommends combining them with WAF rules and bot management when publishers need stronger control.

That distinction matters for Google’s expanded unsupported tags list. If content-signal appears in an unsupported reporting list for Google’s robots parser, it does not mean content signals are worthless in every context. It means Google’s standard robots.txt parser should not be assumed to treat Content-Signal as one of its supported crawl-access fields. A legal or standards signal can live in the same file without being the same kind of instruction as Disallow.

This is the new governance problem. Robots.txt is becoming a mixed-purpose public declaration file. It may contain:

  • crawl access rules for Googlebot, Bingbot, GPTBot, OAI-SearchBot, ClaudeBot, or other crawlers;
  • sitemap references;
  • comments explaining internal policy;
  • content-use signals or rights reservations;
  • legacy unsupported directives;
  • CMS or plugin output.

That mix is not automatically bad. It is dangerous when teams treat every line as equally enforceable. A compliance lead may read ai-train=no as a rights statement. An SEO may read Disallow: / under a bot token as access control. A developer may read request-rate as a throttle. A crawler may ignore two of those three lines. A lawyer may care about one of them even if the crawler does not.

The solution is not to ban all non-Google lines. The solution is line ownership. Every non-comment line in robots.txt should answer four questions: who added it, which crawler or system supports it, what behavior it is expected to change, and how the team verifies that behavior.

The future robots.txt audit is part SEO, part infrastructure, part content rights, and part vendor management.

Google’s crawler ecosystem is bigger than Googlebot

Many SEO discussions still talk about Googlebot as if it were the only Google crawler that matters. Google’s newer crawling infrastructure documentation makes that view harder to defend. Google documents common crawlers, special-case crawlers, and user-triggered fetchers. Common crawlers are used for Google products and always obey robots.txt rules when crawling automatically. Special-case crawlers may or may not respect robots.txt rules. User-triggered fetchers ignore robots.txt because a user initiated the fetch.

That taxonomy matters because robots.txt is not a universal instruction to “Google.” It is a set of rules applied to specific automated clients under specific conditions. Google’s common crawlers page lists user-agent strings, robots.txt user-agent tokens, and affected products. It warns that HTTP user-agent strings can be spoofed and that verification is needed.

This is especially relevant for teams reading server logs. A request that says Googlebot is not automatically Googlebot. Google’s verification documentation explains that site owners can verify whether a request is actually from Google and provides reverse DNS and IP range information for different categories. Without verification, bot blocking rules can punish legitimate crawlers or allow spoofed ones.

Google’s crawling documentation has also moved beyond Search Central alone. The crawling changelog says that in late 2025 Google migrated crawling documentation to a separate crawling infrastructure site because the infrastructure is shared across products including Google Shopping, News, Gemini, AdSense, and more. That move explains why robots.txt guidance now belongs to a broader product environment. Crawling is not just a Search feature.

For publishers, the implication is practical. Blocking all Google crawling may have effects beyond web search. Allowing Googlebot may support Search freshness but does not automatically answer every product-specific question. AdsBot and Storebot behavior may need separate review. User-triggered fetches may appear in logs even where robots.txt blocks automatic crawling. AI-related Google access may involve policy signals that do not look like classic user-agent matching.

A clean robots.txt file therefore begins with crawler inventory. Which Google crawlers appear in logs? Which ones are verified? Which products matter to the business? Which paths should each product access? Which requests are automatic and which are user-triggered? Only after those answers are clear should the file be edited.

“Block Google” is not a technical requirement. It is a vague sentence waiting to become an outage.

Rendering resources are easy to block by accident

One of the quietest robots.txt risks is blocking files that Google needs to understand a page. Google’s introduction says resource files such as unimportant images, scripts, or styles can be blocked if pages loaded without them are not significantly affected. It also warns not to block resources when their absence makes the page harder for Google’s crawler to understand.

That warning grew more important as pages became more dependent on JavaScript, CSS, client-side rendering, APIs, and component libraries. A crawler fetching only raw HTML may not see the same content a user sees. Google’s crawling overview says crawlers use rendering to load a site in full and see a page more like a real person. Blocking JavaScript or CSS can therefore damage Google’s ability to evaluate layout, hidden content, navigation, main content, links, structured data injection, or mobile presentation.

Old robots.txt templates often blocked /assets/, /includes/, /scripts/, /js/, /css/, or /wp-content/ because those paths looked non-content. Some of those blocks are now harmful. A modern page’s content may rely on JavaScript bundles under those directories. A product page may need API calls to render price, availability, reviews, images, and variant data. A documentation site may hydrate navigation and code examples through static assets.

The danger is not limited to Google rankings. AI search systems, accessibility tools, link preview fetchers, social crawlers, monitoring systems, and internal quality tools may all need some resources. A broad robots.txt block can create a web of partial failures that no single team owns.

The right test is rendered output, not file category. If blocking a resource changes what a crawler can understand about the page, the block deserves scrutiny. If a resource is truly irrelevant to content understanding, blocking may be safe. If it is heavy but important, performance work is better than crawler blindness.

There is also a crawl budget angle. Blocking infinite API endpoints or generated JSON feeds may make sense. Blocking all JavaScript because it consumes crawl resources is crude. Google’s crawler infrastructure supports compression, HTTP caching through ETag and Last-Modified, and file-size limits. Better caching and cleaner architecture can reduce load without hiding the page’s meaning.

Do not block what Google needs to see the page as a page.

Status codes and robots.txt failures change crawler assumptions

Robots.txt is fetched like any other resource, but the response code carries special meaning. RFC 9309 says if a crawler successfully downloads robots.txt, it must follow parseable rules. If the file is unavailable through 400-level statuses, the crawler may access resources. If the file is unreachable due to server or network errors, the crawler must assume complete disallow, at least under the standard’s definition.

Google documents its own handling in more detail. For robots.txt, Google treats most 4xx errors as if no valid robots.txt file exists, meaning no crawl restrictions, while 5xx errors can stop crawling temporarily and trigger use of a cached version. The current Google spec page also says robots.txt is generally cached for up to 24 hours, possibly longer when refreshing fails.

That behavior makes robots.txt availability an SEO infrastructure concern. A broken robots.txt endpoint can open or close crawling depending on the failure mode. A 404 may mean crawl everything. A persistent 5xx may mean crawling pauses or falls back to cached rules. A CDN misconfiguration, WAF challenge, malformed redirect, blocked Google IP range, or origin outage can affect Googlebot’s access before it sees a single content URL.

The same issue appears at URL level. Google’s HTTP status documentation says 5xx and 429 server errors prompt crawlers to temporarily slow crawling, while 401 and 403 should not be used to limit crawl rate. This matters when teams try to use security controls, rate limiting, or bot defenses without separating verified search crawlers from abusive traffic.

A healthy robots.txt setup needs monitoring. At minimum, teams should track the status code, content hash, response size, cache headers, and fetch success for each important host. Changes should trigger alerts. Search Console’s robots.txt report can help with validation and recrawl requests, but server-side monitoring catches problems before SEO tools notice them.

Robots.txt should also stay below size limits and avoid accidental HTML responses. The HTTP Archive examples in the custom metric pull request show oversized files and broken content being parsed into strange fields. An error page served as robots.txt with a 200 response is not a theoretical risk. It is the kind of production mess large datasets reveal.

Crawler control starts with serving the control file reliably.

A compact audit framework for robots.txt

Robots.txt audits often become line-by-line debates too early. A better audit starts with intent, then checks deployment, then checks syntax, then checks crawler-specific support. That sequence prevents the most common mistake: polishing a file that is being served from the wrong host or filled with unsupported assumptions.

Robots controls at a glance

Control goalBest-fit mechanismRobots.txt role
Reduce crawling of low-value URL patternsDisallow rules, URL design, internal linking disciplineStrong when patterns are stable and indexing is not needed
Remove a page from Google resultsnoindex, 404 or 410, removal tools, authenticationWeak if used alone because URLs may still be discovered
Protect private contentAuthentication, authorization, server access controlWrong tool because robots.txt is public and voluntary
Manage AI crawler accessCrawler-specific documented user-agent rules, verified bot controlsUseful only where the crawler documents support
Express content-use preferencesContent signals, licensing, legal policy, standards-based declarationsPossible as a public signal, not the same as crawl blocking
Lower emergency crawl load503, 500, or 429 for short periods, infrastructure controlsPoor emergency switch because of caching and delayed refresh

This table should not replace judgment. It is a triage tool. If the goal and mechanism do not match, the robots.txt line should be rewritten, moved to another control layer, or removed.

A good audit has six passes.

The first pass is host inventory. Fetch /robots.txt from every public host, protocol, and subdomain that matters. Include CDNs, media hosts, international hosts, staging hosts, app hosts, and non-standard ports where relevant.

The second pass is response validation. Check status codes, redirects, content type, encoding, file size, caching, and whether the response is actually plain-text rules rather than HTML.

The third pass is ownership. Every line should have a current owner. “Plugin added this” is not an owner. “Inherited from old agency” is not an owner. A rule without ownership should be treated as suspect.

The fourth pass is support. Mark each directive as supported by Google, supported by another named crawler, a content-use signal, a comment, or unsupported legacy residue. This is where Google’s expanded unsupported documentation will save time.

The fifth pass is URL testing. For each rule, test real examples expected to be allowed and disallowed. Include case variants, query strings, trailing slashes, encoded characters, and representative faceted URLs.

The sixth pass is impact review. Compare blocked patterns against traffic, logs, index coverage, XML sitemaps, internal links, rendering resources, and business-critical pages. A syntactically correct block can still be commercially wrong.

The audit should end with a shorter file, a changelog, and a monitoring plan. The best robots.txt files are not clever. They are explainable.

CMS and plugin defaults deserve suspicion

Many robots.txt mistakes come from tools that are trying to help. CMS platforms, SEO plugins, security plugins, ecommerce extensions, staging plugins, CDN products, and hosting panels can all generate or modify robots.txt. That convenience is useful until nobody knows which layer is winning.

Google’s introduction notes that CMS users may not need to edit robots.txt directly and may instead have search settings or page visibility controls. That is friendly advice for site owners, but technical teams need to understand the deployment path. A WordPress virtual robots.txt response, a physical file at the root, a server rewrite, and a CDN-managed robots.txt feature can collide. The browser shows one file. The repository contains another. The CMS setting suggests a third. The CDN may serve a fourth.

Plugin defaults are especially risky because they can include outdated SEO beliefs. Some add crawl-delay. Some block asset directories. Some generate bot-specific blocks. Some append sitemap lines. Some treat noindex settings and robots.txt settings as if they were interchangeable. A plugin may be reasonable for a small blog and dangerous for a complex ecommerce site.

Managed robots.txt features for AI crawlers add another layer. Cloudflare says millions of domains use its managed robots.txt feature and that it can include content signals such as Content-Signal: search=yes, ai-train=no. That may be desirable policy, but it must be understood by SEO teams so they do not confuse it with Google’s supported crawl directives.

The answer is not to disable every automation. It is to document precedence. Which system serves robots.txt in production? Which users can change it? Are changes reviewed? Does staging differ from production? Are generated sitemap URLs accurate? Does the CDN cache the file? How quickly can the team roll back? Which monitoring catches unauthorized changes?

Robots.txt deserves the same release discipline as redirects, canonical templates, XML sitemaps, and server status behavior. It can affect discovery, crawl load, search visibility, product feeds, paid surfaces, AI search access, and legal signaling. A checkbox in a plugin UI is not enough governance for that much influence.

If a tool writes robots.txt, the tool becomes part of your crawl infrastructure.

Search Console warnings should be treated as evidence, not annoyance

Search Console has long surfaced robots.txt issues, and the April 2026 discussion suggests Google’s expanded unsupported list may make those warnings more useful. Search Engine Journal noted that Search Console already surfaces some unrecognized robots.txt tags and that documenting more unsupported directives could align public documentation with what people see in reports.

Warnings about unknown or unsupported tags are easy to dismiss because they often do not cause immediate ranking drops. That is the wrong standard. A warning means the file contains a line whose intended behavior does not match Google’s supported parser. Even when the current effect is harmless, the line may be evidence of outdated process, copied configuration, or misunderstood control.

A mature team should classify each warning.

Some unsupported lines can be deleted because no current system relies on them. Old noindex or nofollow directives in robots.txt often fall into this group. Some lines should remain because they target another documented crawler. A Bing-specific, Yandex-specific, or AI-crawler-specific directive may be valid for that crawler even if Google ignores it. Some lines should be moved because they express a goal better handled elsewhere, such as page-level indexing control or server-side access control. Some lines should remain as policy signals with a comment explaining that they are not Google crawl controls.

This classification prevents overcorrection. A Google-centered audit should not delete every line Google ignores if the site has a multi-crawler strategy. Yet it should never allow unsupported lines to masquerade as Google directives.

Search Console should also be paired with server logs. Search Console tells you what Google reports. Logs tell you what crawlers request, how often, from which IPs, with which user agents, and with what status codes. Google’s verification documentation is valuable here because spoofed user agents are common.

The best workflow is cyclical: inspect Search Console, verify in logs, test affected URLs, adjust robots.txt or other controls, request recrawl when needed, and watch crawl behavior after deployment. That turns robots.txt from a set-and-forget file into a maintained control surface.

A robots.txt warning is not an insult from Google. It is a free audit lead.

Publishers need separate policies for search visibility and AI use

The robots.txt conversation has become sharper for publishers because search visibility and AI use no longer feel like the same bargain. Classic search crawlers indexed content and sent users back through links. AI systems may crawl content for training, grounding, summaries, or answers that reduce the need to click. The technical controls are still catching up with the business conflict.

OpenAI’s documentation is notable because it separates OAI-SearchBot from GPTBot. A site can allow one and disallow the other, at least according to OpenAI’s documented controls. That separation gives publishers a practical policy structure: allow retrieval for search visibility where it returns attribution or traffic; restrict model training where the publisher sees no acceptable exchange.

Cloudflare’s Content Signals Policy takes a different route by separating search, AI input, and AI training as content-use categories. Whether that approach becomes widely standardized is still uncertain, but the categories themselves are useful for internal decision-making.

Publishers should not let robots.txt syntax drive policy. Policy should come first. A newsroom, ecommerce brand, SaaS company, university, forum, or documentation site should decide what it wants from search engines, AI search products, training crawlers, user-triggered agents, content scrapers, and commercial data brokers. Then the technical team can map those decisions to supported controls.

For Google, that mapping needs extra care. Google’s crawlers support many products, and Google’s documentation distinguishes common crawlers, special-case crawlers, and user-triggered fetchers. A broad block may remove desirable visibility. A narrow signal may not do what stakeholders think. A content-use line may express preference without changing crawl access.

The risk is binary thinking. “Allow AI” and “block AI” are too crude. A site may want to allow ChatGPT Search and Perplexity-style retrieval but block training crawlers. It may want Google Search, Google News, Google Images, and Google Shopping but not certain downstream AI uses. It may want to block aggressive crawlers while preserving citation opportunities in answer engines. Each choice needs a crawler list, a legal position, and a measurement plan.

The new robots.txt policy question is not who may crawl. It is who may crawl, for what purpose, under which terms, with what verification, and with what business trade-off.

The safest robots.txt files are boring, short, and tested

A strong robots.txt file does not show off. It avoids cleverness. It names only necessary crawler groups. It uses supported directives for the crawlers that matter. It links to current sitemap files. It avoids stale comments, obsolete plugin blocks, and unsupported instructions masquerading as policy. It is tested against real URL examples. It is monitored like infrastructure.

This is where Google’s expansion of unsupported documentation should improve day-to-day SEO work. Once common unsupported directives are named, teams can stop debating folklore. crawl-delay is not a Google crawl throttle. noindex in robots.txt is not a Google indexing command. request-rate and visit-time are not magic speed controls for Googlebot. content-signal may belong to content-use policy, but it is not one of Google’s four supported robots.txt fields.

The file should also be humble about what it cannot do. Robots.txt cannot force bad actors to obey. Google’s introduction says instructions in robots.txt cannot enforce crawler behavior; it is up to crawlers to obey them. For security, use authentication. For abusive bots, use bot management and firewall rules. For legal rights, use contracts, licenses, rights reservations, and counsel. For indexing, use index controls that crawlers can actually see.

A boring file still needs nuance. Large sites may need parameter blocks, crawler-specific groups, AI crawler rules, and content signals. The point is not minimalism for its own sake. The point is that every line should do a job.

Here is a healthy editorial rule for robots.txt governance: if nobody can explain the line in one sentence and name the crawler that supports it, the line is not ready for production.

That rule catches copied fragments. It catches unsupported directives. It catches staging leaks. It catches forgotten disallows. It catches content-use signals with no policy owner. It also makes migrations cleaner, because robots.txt becomes a maintained artifact rather than a superstition drawer.

Google’s documentation expansion is useful because it gives teams permission to remove noise. When unsupported rules are named, pretending they work becomes harder.

The robots.txt update is a search quality story disguised as documentation

Robots.txt documentation rarely feels like a search quality issue. It should. Crawling is where search quality starts. If crawlers waste time in junk URL spaces, miss important updates, cannot render pages, or obey misunderstood blocks, search systems receive worse inputs. If site owners rely on unsupported directives, they make decisions on false feedback. If AI crawlers blur the line between access and use, publishers lose trust in the open web exchange.

Google’s current work around unsupported robots.txt directives is modest in scope, but the direction is healthy. It uses real web data. It identifies common confusion. It clarifies the boundary between supported behavior and ignored syntax. It fits a broader documentation shift in which Google’s crawling guidance now lives in infrastructure documentation because the same crawling systems serve many Google products.

The biggest beneficiaries will be teams that treat the update as an audit trigger rather than a news item. Pull your robots.txt files. Search for unsupported directives. Check every host. Validate resource access. Separate indexing goals from crawling goals. Separate AI-use policy from Googlebot crawl access. Review plugin output. Verify crawler identities. Test representative URLs. Remove dead lines.

The update does not require panic. It rewards discipline.

Robots.txt has always been a public negotiation with machines. The file says, “Here is how I want automated clients to access this site.” The old mistake was assuming that any sentence written into that negotiation had force. Google is making the ignored sentences easier to spot.

For technical SEO, that is good news. For publishers, it is a reminder to define AI and search policies with more precision. For developers, it is a reason to treat robots.txt as deployable infrastructure. For executives, it is a warning that content access rules now sit close to business strategy.

The web does not need longer robots.txt files. It needs clearer ones.

Questions technical teams are asking about Google’s robots.txt expansion

Does Google now support more robots.txt directives?

No confirmed Google documentation change has made additional directives supported for Googlebot crawl control. Google’s current documentation still names user-agent, allow, disallow, and sitemap as supported fields. The April 2026 discussion and public commit point to better documentation of commonly used unsupported tags, not a broader supported control set.

Which robots.txt directives does Google support?

Google supports user-agent, allow, disallow, and sitemap in robots.txt. allow and disallow control crawl access by path. user-agent identifies the crawler group. sitemap points to sitemap files and is not tied to a specific user-agent group.

Does Google support crawl-delay in robots.txt?

No. Google’s robots.txt specification says fields such as crawl-delay are not supported. Google manages crawl rate algorithmically and provides separate guidance for reducing crawl rate during emergencies or unusual load.

Can robots.txt keep a page out of Google Search results?

Not reliably. Robots.txt can stop Google from crawling a URL, but Google may still discover and show the URL if it is linked elsewhere. To keep a page out of search results, use noindex where Google can crawl the page, password protection, removal tools, or appropriate HTTP status codes.

Why can a blocked URL still appear in Google?

A blocked URL can still appear because crawling and discovery are different. Google may know a URL exists from links, sitemaps, redirects, or previous crawls even if robots.txt prevents fetching the content. In that case, the result may appear without a useful snippet.

Should I put noindex in robots.txt?

No. noindex in robots.txt is not a supported Google robots.txt directive. Use a robots meta tag or X-Robots-Tag header instead, and make sure Google is allowed to crawl the URL so it can see the directive.

Can I block private content with robots.txt?

No. Robots.txt is public and voluntary. It can reveal private paths and does not stop users or non-compliant crawlers from accessing content. Use authentication and server-side authorization for private material.

What did Google’s April 2026 robots.txt work involve?

Google discussed analyzing real robots.txt files using HTTP Archive data and BigQuery to identify common unsupported directives. A public google/robotstxt commit added several frequently used unsupported tags to a reporting list, including content-signal, content-usage, domain, request-rate, revisit-after, and visit-time.

Does content-signal work for Googlebot?

Content-Signal is not one of Google’s standard supported robots.txt fields for crawl access. It may be used as a content-use preference or rights signal in systems such as Cloudflare’s Content Signals Policy, but it should not be treated as a Googlebot crawl directive.

Should publishers use AI crawler rules in robots.txt?

Publishers can use documented AI crawler user-agent rules where specific crawlers support them. The decision should come from business policy: whether the site wants visibility in AI search, whether it permits training use, and whether it needs stronger enforcement through bot management, licensing, or access controls.

What is the difference between GPTBot and OAI-SearchBot?

OpenAI documents GPTBot as a crawler for content that may be used in training generative AI foundation models, while OAI-SearchBot is used to surface websites in ChatGPT search features. OpenAI says the settings are independent, so sites can allow one and disallow the other.

Does Googlebot obey robots.txt?

Google’s common crawlers, including Googlebot, obey robots.txt rules for automatic crawls. Some Google special-case crawlers may or may not respect robots.txt, and user-triggered fetchers ignore robots.txt because a user initiated the fetch.

Can user-agent strings be spoofed?

Yes. Google warns that HTTP user-agent strings can be spoofed. Site owners should verify Google crawler requests using Google’s documented verification methods, reverse DNS, and published IP ranges.

How often does Google refresh robots.txt?

Google says its crawlers usually update the cached robots.txt version every 24 hours during automatic crawling. A faster refresh can be requested through the Search Console robots.txt report.

Should robots.txt be used during a server emergency?

Robots.txt is a poor short-term emergency switch because of caching. Google recommends short-term use of 500, 503, or 429 responses for urgent crawl reduction, while warning that long use can affect Search visibility.

What robots.txt mistakes are common on ecommerce sites?

Common ecommerce mistakes include allowing crawlers into unlimited faceted navigation, blocking important product resources, disallowing pages that need noindex, relying on unsupported crawl-rate directives, and forgetting that separate hosts or CDNs need separate robots.txt review.

Can blocking JavaScript or CSS hurt Google’s understanding of a page?

Yes. Google warns against blocking resource files when doing so makes the page harder to understand. Modern pages often need JavaScript and CSS for rendering content, navigation, layout, and structured information.

Is a long robots.txt file bad?

Not automatically, but long files are harder to govern. A long file full of unsupported directives, stale comments, and copied bot blocks is risky. Every active line should have an owner, a supported target, and a tested purpose.

Should all unsupported directives be removed?

Not always. Some unsupported-by-Google lines may target other crawlers or express content-use policy. They should be labeled and documented. Remove lines that have no current purpose, no supporting crawler, or no owner.

What should technical teams do now?

Audit every robots.txt file by host, validate status and content, classify supported and unsupported directives, test real URLs, check resource access, review AI crawler policy, verify Googlebot requests in logs, and remove legacy lines that do not do what the team thought they did.

Author:
Jan Bielik
CEO & Founder of Webiano Digital & Marketing Agency

The robots.txt rules Google ignores are finally getting named
The robots.txt rules Google ignores are finally getting named

This article is an original analysis supported by the sources cited below

Robots.txt introduction and guide
Google’s Search Central guide explaining what robots.txt is used for, where its limits sit, and why it should not be used to hide pages from Google.

How Google interprets the robots.txt specification
Google’s detailed crawling infrastructure documentation for supported fields, grouping, matching, caching, status handling, and syntax.

Create and submit a robots.txt file
Google’s implementation guide covering file location, host boundaries, UTF-8 format, rule writing, and testing basics.

Update your robots.txt file
Google’s documentation explaining robots.txt cache refresh behavior and the Search Console recrawl option.

Useful robots.txt rules
Google’s examples of common robots.txt patterns, with warnings about privacy, indexing, AdsBot behavior, and media crawling.

Robots meta tag, data-nosnippet, and X-Robots-Tag specifications
Google’s documentation for page-level and header-level indexing and serving controls, including the requirement that crawlers must access the page to read them.

Robots Refresher: robots.txt — a flexible way to control how machines explore your website
Google Search Central’s refresher on robots.txt usage, examples, and the role of robots.txt in crawler access decisions.

A note on unsupported rules in robots.txt
Google’s 2019 post explaining unsupported robots.txt rules such as noindex, nofollow, and crawl-delay, and the alternatives site owners should use.

Google may expand unsupported robots.txt rules list
Search Engine Journal’s April 2026 report on Google’s plan to document more commonly used unsupported robots.txt directives based on HTTP Archive analysis.

Google’s robots.txt docs expand, deep links get rules, EU steps in
Search Engine Journal’s SEO Pulse summary placing the robots.txt documentation expansion in the broader April 2026 search news cycle.

Search Off the Record podcast
Google Search Central’s official podcast hub for Search Off the Record, where Google’s Search Relations team discusses crawling, documentation, and Search systems.

Update robots_txt.js to extract unknown rules dynamically
The HTTP Archive custom metrics pull request that changed robots.txt extraction to capture and count unknown directives across real-world files.

Add content-signal, content-usage, domain, request-rate, revisit-after, and visit-time to the unsupported tags list
Google’s public robotstxt repository commit adding frequently used unsupported tags identified through HTTP Archive BigQuery tables.

RFC 9309 Robots Exclusion Protocol
The IETF standard defining the Robots Exclusion Protocol, including access methods, parsing rules, special characters, redirects, and error handling.

Things to know about Google’s web crawling
Google’s overview of crawling, recrawling, rendering, and the purpose of Google’s crawler ecosystem.

List of Google’s common crawlers
Google’s reference list of common crawler user-agent strings, robots.txt tokens, and product associations.

Verify requests from Google crawlers and fetchers
Google’s verification guide for confirming whether requests claiming to be Googlebot or another Google crawler are legitimate.

Reduce the Google crawl rate
Google’s guidance for handling excessive crawler load, including emergency use of 500, 503, or 429 responses.

Crawl budget management
Google’s documentation explaining crawl capacity, crawl demand, perceived inventory, and when robots.txt can reduce crawl waste.

Managing crawling of faceted navigation URLs
Google’s documentation on overcrawling caused by faceted navigation, parameter combinations, and crawl-control options.

How HTTP status codes affect Google’s crawlers
Google’s crawling infrastructure guide explaining how status codes such as 429, 5xx, 401, and 403 affect crawl behavior.

Google’s crawling documentation changelog
Google’s changelog documenting the migration of crawling documentation to the crawling infrastructure site and later crawler-related updates.

From Googlebot to GPTBot: who’s crawling your site in 2025
Cloudflare’s analysis of AI crawler traffic, robots.txt rules for AI bots, and the changing relationship between crawler access and content use.

Giving users choice with Cloudflare’s new Content Signals Policy
Cloudflare’s announcement and explanation of content signals for search, AI input, and AI training preferences in robots.txt.

Overview of OpenAI crawlers
OpenAI’s official documentation explaining OAI-SearchBot, GPTBot, robots.txt controls, and crawler purpose separation.