The new robots.txt map for Google, OpenAI, Amazon, Yandex and Baidu

The new robots.txt map for Google, OpenAI, Amazon, Yandex and Baidu

The file once treated as a narrow SEO setting has become a front-line policy layer for the machine-readable web. Robots.txt now tells search engines, AI answer engines, shopping assistants, ad checkers, product crawlers, archives and training-data bots what they may fetch, what they should skip, and sometimes what the fetched material may be used for. The change is not theoretical. Google, OpenAI, Anthropic, Amazon, Apple, Perplexity, Mistral, DuckDuckGo, Yandex, Baidu, Seznam, Qwant, Mojeek, Common Crawl and AI2 now publish crawler controls, user-agent tokens, or opt-out mechanisms that site owners have to understand one by one.

Table of Contents

The old robots.txt bargain has been rewritten by AI

Robots.txt was built for a smaller web. The original bargain was blunt: a website placed a plain text file at the root of a host, and well-behaved crawlers read it before fetching pages. The file did not lock a door. It expressed a preference. RFC 9309, the current IETF specification for the Robots Exclusion Protocol, keeps that character intact: crawlers are “requested” to honor the rules, and the protocol is not access authorization.

That distinction matters more in 2026 than it did in the search-only web. Robots.txt is not security, not copyright licensing, not paywall enforcement, and not a substitute for authentication. It is a machine-readable signal. Its power depends on whether the visiting system chooses to read it, parse it correctly, cache it responsibly and obey it. Some major operators do. Some say they do, with caveats. Some crawlers identify themselves poorly. Some fetchers argue that a live user request is not the same thing as automatic crawling.

The sharpest change is that one company may now operate several agents with different jobs. OpenAI separates GPTBot, OAI-SearchBot, OAI-AdsBot and ChatGPT-User. Anthropic separates ClaudeBot, Claude-User and Claude-SearchBot. Amazon separates Amazonbot, Amzn-SearchBot and Amzn-User. Mistral separates MistralAI-User and MistralAI-Index. Google uses common crawlers, special-case crawlers, user-triggered fetchers and control-only tokens such as Google-Extended.

The result is a web governance problem disguised as a text-file problem. A publisher no longer asks only, “Should Google crawl this page?” The real questions are more granular. Should this page appear in web search? Should it appear in AI answers? Should it train foundation models? Should a chatbot fetch it when a user asks directly? Should a shopping assistant use it? Should an ad crawler inspect it? Should an archive or research dataset copy it?

The practical answer is that robots.txt now needs to be managed by purpose, not by brand alone.

The protocol still starts with host, path and user agent

The mechanics remain simple enough to fit in a few lines. A crawler requests /robots.txt at the top level of a site. The file contains records headed by User-agent: and followed by Allow: or Disallow: lines. A wildcard group, User-agent: *, gives default rules to crawlers that do not have a more specific group. Google’s robots.txt documentation explains the same host-level rule that SEO teams have relied on for years: rules apply only to the host, protocol and port where the file is served. A robots file for https://example.com/ is not automatically the robots file for https://sub.example.com/, http://example.com/, or another port.

That host boundary is where many crawler policies fail in real deployments. A company may block GPTBot on the main domain but forget the documentation subdomain. A publisher may allow Googlebot on the marketing site and accidentally block it on the image CDN. An ecommerce team may disallow /search/ for all crawlers, then wonder why product discovery suffers in a search surface that relies on indexable category pages. Robots.txt is small, but it touches every domain and subdomain architecture decision.

Google’s documentation also gives a useful reminder that status codes matter. A successful 2xx response causes Google crawlers to process the file; most 4xx responses are treated as if no valid robots file exists; server errors may trigger a temporary stop and cached-rule behavior. Google also enforces a 500 KiB file size limit. The exact behavior differs by crawler, but the operational lesson is the same: a broken robots.txt file can open or close the wrong parts of a site without anyone noticing until logs, indexing reports or AI referrals change.

The file is also case-sensitive in paths. /Private/ and /private/ are different paths on many servers. Robots parsers can differ in edge cases. Nonstandard directives such as Crawl-delay, Request-rate or Clean-param are not universal. Yandex supports several advanced directives; Google does not use every extension in the same way; Amazon states that its crawlers do not support Crawl-delay; Mojeek says MojeekBot does not support crawl-delay at this time.

This is why serious robots governance starts with a crawler inventory. The inventory should list the user-agent token, operator, purpose, verification method, business value and risk. A single Disallow: / line may be technically correct and commercially wrong.

Google still anchors search crawling, but Google-Extended changed the policy model

Google remains the reference case because its crawler documentation is unusually detailed and because Google Search, Discover, Images, Video, News, Shopping and testing tools each have different crawl implications. Google’s common crawlers page says common crawlers are used to build search indexes, run product-specific crawls and perform analysis, and that they always obey robots.txt rules when crawling automatically. It also lists user-agent tokens such as Googlebot, Googlebot-Image, Googlebot-Video, Googlebot-News, Storebot-Google, GoogleOther, GoogleOther-Image, GoogleOther-Video, Google-CloudVertexBot and Google-Extended.

The ordinary search logic is familiar. A User-agent: Googlebot group affects Google Search and features tied to Search, including Discover and search surfaces that use images, videos or news content. Blocking Googlebot is a high-impact decision because it can remove the page from crawling and reduce visibility in search products. Blocking a narrow crawler such as Googlebot-Image affects image use and image search surfaces, not necessarily the whole page.

The more modern policy shift is Google-Extended. Google says Google-Extended does not have a separate HTTP user-agent string. Crawling is done with existing Google user agents, while Google-Extended functions as a standalone robots.txt product token. Google says publishers can use it to manage whether content Google crawls may be used for training future Gemini models and for grounding in Gemini Apps and Grounding with Google Search on Vertex AI. Google also states that Google-Extended does not affect inclusion in Google Search and is not used as a Google Search ranking signal.

That is a new pattern: a robots.txt token that does not identify a separate crawler in logs but controls a downstream use of already crawled content. Applebot-Extended works in a similar policy category, though Apple defines its own behavior. This creates a split between crawling and usage. A publisher may allow Googlebot for search and disallow Google-Extended for Gemini model training or grounding. The server log may still show Googlebot. The control is not about whether a Google crawler ever sees the page. It is about what Google says it may do with content it crawls.

Google announced Google-Extended in 2023 as a publisher control for whether sites help improve Bard and Vertex AI generative APIs, and later documentation ties the token to Gemini and Vertex grounding use cases. For site owners, the operational rule is direct: do not treat every Google user agent as the same policy surface. Googlebot, GoogleOther, Storebot-Google, Google-InspectionTool, Google-CloudVertexBot and Google-Extended answer different questions.

OpenAI separates search visibility from training consent

OpenAI’s crawler documentation is one of the clearest examples of the new three-part model: search crawler, training crawler and user-triggered fetcher. OpenAI says it uses OAI-SearchBot and GPTBot robots.txt tags so webmasters can manage how their sites and content work with AI. It also states that each setting is independent: a site can allow OAI-SearchBot to appear in ChatGPT search results while disallowing GPTBot to indicate that crawled content should not be used for training generative AI foundation models.

That sentence carries the core strategic lesson. Blocking GPTBot is not the same as blocking ChatGPT Search visibility. Blocking OAI-SearchBot is the search visibility decision. OpenAI describes OAI-SearchBot as the agent used to surface websites in ChatGPT search features, while GPTBot is used to crawl content that may be used in training OpenAI’s generative AI foundation models.

OpenAI also lists ChatGPT-User. It is used for certain user actions in ChatGPT and Custom GPTs, and OpenAI says it is not used for automatic web crawling. Because these requests are initiated by users, OpenAI notes that robots.txt rules may not apply. OpenAI also says ChatGPT-User is not used to determine whether content may appear in Search; OAI-SearchBot is the control for search opt-outs and automatic crawl.

That caveat is one of the most contested areas of the new bot economy. A publisher may believe that Disallow: / should stop all automated access. A platform may argue that a user-triggered fetch is closer to a browser action than a crawl. The result is a policy gap. Robots.txt is designed for automatic clients, but AI products blur the line between automatic discovery and user-directed retrieval.

OpenAI also lists OAI-AdsBot, used to validate the safety of pages submitted as ads on ChatGPT. OpenAI says OAI-AdsBot visits pages submitted as ads and that content collected by OAI-AdsBot is not used to train generative AI foundation models. This is another reason not to block by company name without understanding the function. An ads validation bot, a search crawler, a training crawler and a live assistant fetcher do not carry the same risk or the same commercial upside.

Anthropic now publishes separate Claude controls for training, user retrieval and search

Anthropic’s current crawler article, dated April 7, 2026, says it uses several robots to gather public web data for model development, search the web and retrieve web content at users’ direction. It lists three bot classes: ClaudeBot, Claude-User and Claude-SearchBot.

ClaudeBot is Anthropic’s training-related crawler. Anthropic says ClaudeBot collects web content that could contribute to training, and that restricting ClaudeBot signals that the site’s future materials should be excluded from AI model training datasets. Claude-User supports user-directed Claude requests; disabling it can prevent the system from retrieving content in response to user queries and may reduce visibility for user-directed web search. Claude-SearchBot is used to improve search result quality; blocking it may reduce visibility and accuracy in user search results.

The notable part is not only the naming. It is the explicit link between robots.txt and AI visibility. Anthropic is telling publishers that blocking search-oriented Claude bots may reduce the chance their pages surface inside Claude’s search experience. This is the same trade-off OpenAI creates with OAI-SearchBot and Perplexity creates with PerplexityBot: block the crawler and you protect a use case, but you may also reduce answer-engine visibility.

Anthropic says its bots honor industry-standard robots.txt directives, respect anti-circumvention technologies such as CAPTCHAs, and aim to avoid intrusive crawling. It also supports the nonstandard Crawl-delay extension for limiting crawling activity. That makes Anthropic more granular than many AI operators, but it also puts work on publishers. A simple “block all AI” file may block training, user retrieval and search discovery at once. A more precise file can block ClaudeBot while allowing Claude-SearchBot, or make the opposite choice.

The strategic question is not whether Claude should be allowed. It is which Claude behavior has value. A paywalled publisher may reject training access but allow search snippets to public articles. A software documentation site may allow Claude-User because developers ask direct questions in Claude. A legal database may block every Claude agent except a signed partner integration. Robots.txt cannot decide that strategy. It can only express it.

Amazon has moved from Amazonbot to a three-agent model

Amazon’s crawler documentation now mirrors the wider market. It lists Amazonbot, Amzn-SearchBot and Amzn-User, and says webmasters can manage how sites and content are used by Amazon with independent crawler settings. Amazon says changes may take about 24 hours to reflect.

Amazonbot is the broad crawler. Amazon says it is used to improve products and services, provide more accurate information to customers, and may be used to train Amazon AI models. Amzn-SearchBot is used to improve search experiences in Amazon products and services; Amazon says allowing it can make content eligible to appear in experiences such as Alexa and Rufus, and that it does not crawl content for generative AI model training. Amzn-User supports user actions, such as responding to Alexa queries needing current information, and Amazon says it does not crawl content for generative AI training.

This turns Amazon into a discovery platform for more than ecommerce. A recipe site, product review site, local business page, support document or travel guide may be relevant to Alexa-style answers or Rufus-style product and shopping experiences. Blocking all Amazon agents may be sensible for some publishers. For others, the finer choice is to block Amazonbot for training while allowing Amzn-SearchBot where Amazon-driven discovery matters.

Amazon’s robots policy is unusually specific on caching. It says Amazon respects the Robots Exclusion Protocol, honors user-agent and allow/disallow directives, fetches host-level robots.txt files or uses a cached copy from the last 30 days, and behaves as if the file does not exist when it cannot fetch one. Amazon also says its crawlers respect link-level rel=nofollow, page-level robots meta tags of noarchive, noindex and none, and do not support crawl-delay.

That cached-copy window changes how teams should deploy robots updates. If a site makes a high-stakes Amazon policy change, the effect may not be instant across every system. The same is true across several platforms that cite 24-hour or multi-day propagation windows. Robots.txt is fast to edit. Crawler compliance is not always immediate.

Applebot and Applebot-Extended split discovery from foundation model training

Apple’s Applebot documentation says Applebot powers search technology integrated into Spotlight, Siri and Safari, and that enabling Applebot in robots.txt lets website content appear in search results for Apple users. It also says content crawled by Applebot may be used to help train Apple foundation models for generative AI features across Apple products, including Apple Intelligence, Services and Developer Tools. Publishers can opt out from training use by disallowing Applebot-Extended.

The Apple model is similar in principle to Google-Extended. Applebot-Extended is not an ordinary page crawler. Apple says Applebot-Extended is a secondary user agent that gives publishers controls over how website content can be used by Apple, and that Applebot-Extended does not crawl webpages. Instead, it is used to determine how data crawled by Applebot may be used. Pages that disallow Applebot-Extended can still be included in search results.

That is a crucial distinction for SEO and AI governance. Blocking Applebot can reduce discoverability in Apple search surfaces. Blocking Applebot-Extended controls training use while leaving Applebot crawling in place, according to Apple’s documentation. For publishers serving mobile users, Safari, Siri and Spotlight visibility can matter even if Apple’s web search market share is not measured like Google or Bing.

Apple also notes a special case: iTMS traffic may come from Applebot hosts and is identified with User-Agent: iTMS, but iTMS does not follow robots.txt because it is not a general search crawler and only crawls URLs tied to registered Apple Podcasts content. That caveat is another reminder that not every automated Apple request is governed by the same robot rules. Some product-specific fetchers operate under separate submission or registration workflows.

Applebot has one more unusual rule. Apple says that if robots instructions do not mention Applebot but mention Googlebot, Applebot will follow Googlebot instructions. That may help smaller sites that only write Google rules, but it can also surprise teams that intended Googlebot treatment to be Google-specific. Explicit Applebot groups are cleaner.

Microsoft Bing still matters because other systems depend on Bing-scale indexing

Bingbot remains a conventional search crawler in the eyes of many SEO teams, but Bing’s importance extends through Microsoft products, syndication partners, AI features and search APIs. Microsoft’s Bing crawler documentation lists Bing crawlers such as Bingbot and AdIdxBot, with AdIdxBot used by Bing Ads for ad-related crawling and landing-page quality checks.

Bing’s own blog history says Bingbot would honor robots.txt directives written for the older msnbot, a reminder that crawler naming often evolves while compatibility rules persist. Bing also provides a robots.txt tester in Bing Webmaster Tools that checks whether a URL is allowed or blocked for Bingbot and BingAdsBot.

For publishers, the larger point is practical. Blocking Bingbot may affect more than traffic from Bing.com. Bing data and Microsoft search infrastructure can touch other discovery routes, including partner search products and AI-enabled search surfaces. Even where another AI system uses its own crawler, Bing visibility remains part of the broader answer-engine ecosystem because model-grounded products often draw from search indexes, search APIs or web-scale retrieval systems.

Bing also has a long relationship with Crawl-delay. A Bing Webmaster Blog post on Bingbot crawl-rate control notes that a crawl delay in robots.txt can override the direction set in Bing Webmaster Tools. This contrasts with platforms that ignore crawl-delay or do not support it. A single robots directive can mean different things to different crawler families.

Yandex supports robots.txt with advanced rules and exceptions

Yandex’s English Webmaster documentation describes robots.txt as a file that contains site indexing parameters for search engine robots. Yandex says the file can restrict indexing of website pages by bots and reduce site load, while also warning that pages restricted in robots.txt can still participate in Yandex search; removal from search should use noindex or HTTP header rules rather than blocking the bot from reading the page.

Yandex supports the Robots Exclusion Protocol with advanced features. Its documentation lists directives including User-agent, Disallow, Sitemap, Clean-param and Allow. It also says robots from other search engines and services may interpret directives differently.

The Yandex user-agent logic deserves attention. Yandex says its indexing bot checks records starting with User-agent: and looks for either the substring Yandex or *. If User-agent: Yandex is detected, the User-agent: * group is ignored. Specific robot directives, such as YandexBot, can be used for the main indexing bot.

Yandex also documents exceptions. Some Yandex robots download documents for purposes other than indexing and may ignore general User-agent: * restrictions. Its server-log table lists whether specific robots take into account the general robots.txt rules. The main YandexBot indexing robot is marked as taking general rules into account, while several product or ad-related robots are marked differently.

That makes Yandex a strong example of a regional search platform where robots governance needs exact user-agent targeting. A rule for Yandex and a rule for YandexBot are not interchangeable if a site needs precision. If a site has Russian-language, Eastern European or CIS-market exposure, Yandex rules belong in the crawler inventory even when Google and Bing dominate internal SEO reporting.

Baidu treats robots.txt as a root-level crawler scope file

Baidu’s English robots page says search engines use spiders to visit sites automatically and get contents, and that before accessing a site, the spider checks whether a robots.txt file exists in the site root. If it exists, the spider works according to the instructions.

Baidu also states a point that many publishers misunderstand: if pages blocked by robots.txt are linked from other sites, they may still appear in Baidu search results, but Baidu says it will not crawl, index or show the content of pages blocked by robots.txt; the descriptions in results may come from other sites.

Baidu’s documentation defines User-agent, Disallow and Allow, includes examples for Baiduspider, and notes support for wildcard characters * and $. It also says the order of Disallow and Allow records is material because the robot performs the job once a record matches.

For international SEO teams, the Baidu lesson is simple. China-focused pages need Baiduspider rules reviewed separately from Google and Bing rules. A site can be technically perfect for Googlebot and still mishandle Baidu if encoding, host paths, wildcard behavior or Chinese-market subdomains are not reviewed. Baidu may not be central for every publisher, but for brands that need Chinese-language discoverability, Baiduspider is not a footnote.

DuckDuckGo separates organic search from DuckAssistBot

DuckDuckGo’s DuckAssistBot page says DuckAssistBot is a crawler for DuckDuckGo Search that crawls pages in real time for AI-assisted answers and prominently cites sources. DuckDuckGo says the data is not used to train AI models. It also says publishers can opt out by disallowing the DuckAssistBot user agent in robots.txt, with changes taking effect after 72 hours, and that opting out does not affect organic DuckDuckGo search rankings or whether websites appear in standard search results.

That makes DuckAssistBot a clean case of answer-engine access without training use. It is not a broad web search crawler in the same sense as Googlebot or Bingbot. It is tied to AI-assisted answers. The strategic choice is therefore separate from ordinary DuckDuckGo organic visibility.

A publisher that blocks DuckAssistBot may still remain in DuckDuckGo organic search, but it may remove itself from a real-time AI answer source pool. For publishers that care about citations and referral traffic from answer surfaces, allowing DuckAssistBot may be sensible. For publishers that see AI answer extraction as a substitute for visits, blocking may be a business decision.

The same pattern appears across more platforms: answer engines want permission to fetch content for answers, while publishers want proof that fetches bring attribution, traffic or licensing value. Robots.txt is the negotiation signal. It is not the negotiation itself.

Perplexity shows the trust problem around user-triggered AI fetches

Perplexity’s official crawler documentation says Perplexity collects data using crawlers and user agents that gather and index information from the internet, operating either automatically or in response to user requests. It lists PerplexityBot as the crawler designed to surface and link websites in Perplexity search results and says it is not used to crawl content for AI foundation models. It also lists Perplexity-User, which supports user actions and may visit a page when users ask Perplexity a question. Perplexity says that because a user requested the fetch, this fetcher generally ignores robots.txt rules.

Perplexity’s help center says PerplexityBot respects robots.txt and will not index full or partial text content from sites that disallow it, although Perplexity may still index the domain, headline and a brief factual summary. It also says Perplexity does not build foundation models and that PerplexityBot is not used for AI model pre-training.

The controversy is that Perplexity has also been accused of behavior that went beyond its declared crawler model. Cloudflare published a 2025 report alleging that Perplexity used stealth, undeclared crawlers to evade no-crawl directives, modified user agents, changed source ASNs and sometimes ignored or failed to fetch robots.txt. Cloudflare said it delisted Perplexity as a verified bot and added detection rules.

This dispute captures the fragile nature of robots.txt in the AI era. The file works only when crawler identity is honest, user-agent strings are stable, IP ranges are verifiable and operators accept the publisher’s signal. Perplexity’s official documentation presents one policy. Cloudflare’s research alleges conflicting behavior in observed traffic. Site owners cannot resolve that debate through syntax alone. They need logs, bot verification, WAF rules, rate limits and business rules.

Cloudflare’s report also states a best-practice principle that many publishers now use as a standard for every AI bot: be transparent, use a unique user agent, publish IP ranges or sign requests, define the purpose clearly, separate bots by activity and follow robots.txt. That is a useful checklist for evaluating every crawler, not only Perplexity.

Mistral’s crawler split follows the search and user-action pattern

Mistral AI’s crawler documentation says it employs web crawlers, robots and user agents to execute tasks for its products, either automatically or upon user request, and uses specific robots.txt tags to help webmasters manage how sites and content interact with AI.

The two listed agents have separate roles. MistralAI-User is for user actions in Le Chat; when users ask Le Chat a question, it may visit a web page to answer and include a source link. Mistral says it is not used for automatic web crawling and not used to crawl content for generative AI training. MistralAI-Index is for automated indexing, used to index content for Mistral AI’s search engine to help answer user questions in Le Chat. Mistral says content crawled by MistralAI-Index is not used for generative AI training.

Mistral is important because it shows how the European AI market is converging on the same crawler vocabulary. Search index, user action and training are separate uses. Even when a company says it is not using a crawler for training, the publisher still has to decide whether answer visibility is worth access.

For documentation-heavy sites, MistralAI-Index may be useful because technical users increasingly ask assistants for product, API and troubleshooting answers. For news, finance, health or legal publishers, the decision may depend on licensing, attribution and jurisdiction. Robots.txt can express consent at URL level, but it cannot price that consent.

Common Crawl and AI2 show the research and dataset side of robots.txt

Search and answer engines are not the only robots.txt processors that matter. Open datasets and research crawlers have become central to AI training pipelines. Common Crawl’s CCBot is one of the most important examples because Common Crawl data has been widely used in academic and commercial machine-learning work.

Common Crawl says CCBot identifies itself with a CCBot/2.0 user-agent string and provides a simple robots.txt block example using User-agent: CCBot and Disallow: /. It also warns that crawlers can falsely identify themselves as CCBot and recommends verification; CCBot now runs on dedicated IP ranges with reverse DNS and a JSON list.

AI2, the Allen Institute for AI, publishes a crawler notice for AI2 Bot. It says the AI2 Bot explores certain domains to find web content and that this content is used to train open language models. It lists the user-agent header Mozilla/5.0 (compatible) AI2Bot (+https://www.allenai.org/crawler) and says the user-agent string can be used to filter or reject traffic.

These crawlers are different from search crawlers because there is often no direct referral loop. A search crawler can send traffic. A training-data crawler may never send a visit back. That does not mean a publisher should always block it. Some institutions want their public knowledge represented in open models. Some open-source projects want their documentation widely usable. Some public-interest publishers may value research reuse. Commercial publishers often reach the opposite conclusion.

The policy split is not “AI good” or “AI bad.” It is whether the crawler creates reciprocal value for the site owner, and whether the use matches the site’s rights strategy.

Qwant, Seznam and Mojeek keep the independent search crawler model alive

Independent and regional search engines still process robots.txt in recognizably search-like ways. Qwant says it uses web crawlers to improve its index and that its user-agent strings always contain Qwantbot. It says the crawler respects robots rules and provides reverse DNS and IP-range verification guidance.

Seznam, the Czech search engine, says SeznamBot fully complies with the robots exclusion standard and reads the robots.txt file first when accessing a website, adjusting behavior according to the directives. It also says it may take days, and sometimes weeks for less frequently visited sites, for SeznamBot to recheck restrictions and update the index.

Mojeek says MojeekBot is the web crawler for the Mojeek search engine and obeys the Robot Exclusion Standard. Mojeek also says MojeekBot will obey the first record with a user-agent containing MojeekBot, or the first User-agent: * group if no MojeekBot group exists. It states that MojeekBot does not support the nonstandard crawl-delay directive at this time.

These engines do not get as much attention as Google, Bing or AI answer engines, but they show that robots.txt remains a universal publishing layer. A publisher with international visibility needs rules for local and independent search engines, not only for the largest crawler in its analytics dashboard. If a site serves Czech, French, privacy-search, independent-search or regional audiences, blocking every non-Google crawler can quietly reduce reach.

Meta, social bots and preview fetchers are part of the same access debate

Social crawlers are often misclassified because they do not behave like search crawlers. A Facebook, LinkedIn, Slack, X, Pinterest or messaging-preview bot may fetch a URL only when a user shares a link. The crawl may generate a card, title, image, description or preview. The visit may not create search visibility, but it controls how the page appears in social feeds and private messages.

Meta’s official crawler page lists user-agent strings and uses for Meta’s common web crawlers, according to the search result for Meta’s developer documentation. Public bot directories also show Meta crawlers such as facebookexternalhit, FacebookBot, meta-externalagent, meta-externalfetcher, meta-externalads and meta-webindexer. Cloudflare Radar lists Meta-ExternalAgent as verified traffic with a meta-externalagent/1.1 user agent and a robots.txt example using User-Agent: meta-externalagent and Disallow: /.

The Meta case is complicated because social preview fetching, ads checks, AI indexing and external content collection can overlap in server logs. Publishers may want link previews to work on Facebook and Instagram but may not want AI-related external-agent access. That requires exact agent groups, not a blanket Meta block.

The same applies to LinkedInBot, Slackbot, Twitterbot, Pinterestbot and other preview fetchers. Blocking them is rarely an SEO decision. It is a distribution and presentation decision. A blocked preview bot can make a page look broken when a reader shares it, even if Google rankings are unaffected.

Brave, Yahoo, Naver, Sogou and other crawlers extend the map

The crawler map keeps expanding beyond the names that dominate policy debates. Brave Search has its own crawler, Bravebot. Cloudflare Radar lists Bravebot as a verified AI Search crawler operated by Brave Software and shows its user-agent string plus a standard robots.txt block example.

Yahoo’s Slurp has historically crawled and indexed pages for Yahoo Search, though Yahoo search results have also depended on partners at different times. Yahoo’s own help result describes Slurp as the Yahoo Search robot for crawling and indexing web page information. Naver’s Yeti, Sogou’s spider, PetalBot, 360Spider, Daum, CocCocBot, Mail.Ru-related crawlers and many country-specific or product-specific bots may appear in logs depending on market exposure.

Not all of these crawlers deserve the same priority. A Slovak publisher focused on Central Europe may care more about Googlebot, Bingbot, SeznamBot and Applebot than Baiduspider or Sogou. A multinational ecommerce site may need every major search, shopping and AI crawler in a managed allowlist. A private SaaS documentation site may block nearly everything except a few search and answer engines.

The right map is not the longest map. It is the map that matches the site’s audience, licensing model, infrastructure tolerance and commercial goals.

The crawler list now divides into seven practical categories

The market is messy, but the categories are becoming clear. Every robots.txt governance plan should classify crawlers by function before deciding to allow or block.

Major crawlers and robots.txt controls

CategoryCommon user-agent tokensMain purposePublisher decision
Search indexingGooglebot, Bingbot, Baiduspider, YandexBot, SeznamBot, Qwantbot, MojeekBot, BravebotDiscover and index pages for search resultsUsually allow public, indexable content
AI search indexingOAI-SearchBot, Claude-SearchBot, PerplexityBot, MistralAI-Index, Amzn-SearchBotMake pages eligible for AI search or answer retrievalAllow when citation and discovery matter
AI trainingGPTBot, ClaudeBot, Amazonbot, AI2Bot, CCBot, Applebot-Extended, Google-ExtendedCollect or control use of content for model developmentDecide by rights, licensing and brand strategy
User-triggered fetchesChatGPT-User, Claude-User, Perplexity-User, MistralAI-User, Amzn-UserFetch pages in response to a user actionTreat separately from bulk crawling
Product and shoppingStorebot-Google, Amzn-SearchBot, product-specific agentsSupport shopping, product answers or assistant surfacesAllow where product visibility matters
Ads and validationAdsBot-Google, AdIdxBot, OAI-AdsBot, meta-externaladsCheck landing pages, ads quality or policy complianceUsually allow for active ad programs
Social previews and sharingfacebookexternalhit, FacebookBot, LinkedInBot, Slackbot, Twitterbot, PinterestbotBuild link previews and social cardsAllow if sharing appearance matters

This table is compact by design. The real operational sheet should add verification domains, IP JSON feeds, rate limits, last-seen dates, business owners and escalation contacts for every user agent that appears in logs.

Robots.txt is a traffic policy, not a content security system

The most dangerous robots.txt mistake is using it as a hiding place. Because robots.txt is public, a disallowed path can advertise sensitive locations to anyone who reads the file. The protocol does not require authentication. A malicious bot can ignore it. A scraper can spoof another user agent. A curious person can open the file in a browser.

Sensitive data belongs behind authentication, authorization, network rules or signed access, not behind Disallow:. Robots.txt can keep cooperative crawlers out of staging folders, search result pages, cart paths, internal search pages, low-value parameters, duplicate archives and AI training datasets. It cannot protect customer data, confidential PDFs, pre-release product pages or paid content by itself.

This matters for AI because many teams now add long blocklists of AI crawlers and assume the job is done. It is not. A robots block tells named cooperative crawlers not to fetch. It does not remove already crawled data. It does not stop another crawler using another identity. It does not create a license agreement. It does not guarantee that quoted snippets, headlines or factual summaries disappear from every downstream system. Perplexity’s own help page, for example, says blocked pages may still have domain, headline and brief factual summary indexed.

Good robots governance is therefore layered. Use robots.txt for cooperative crawler policy. Use meta robots and X-Robots-Tag for index-level control where a crawler must read the page to see the rule. Use canonical tags and parameter handling for duplication. Use WAF and rate limits for abusive traffic. Use authentication for anything private. Use licensing and contracts for commercial reuse.

The new allow or block decision is commercial, not purely technical

A robots.txt file used to be owned by SEO and engineering. It now needs input from legal, editorial, product, data, partnerships and infrastructure. The file can affect search traffic, AI citations, model-training consent, ad validation, product listings, social sharing and server load.

An editorial publisher may decide to allow Googlebot, Bingbot, Applebot, DuckAssistBot, OAI-SearchBot and Claude-SearchBot, while blocking GPTBot, ClaudeBot, CCBot, AI2Bot and Google-Extended. A software company may allow training crawlers because broad AI knowledge of its public APIs reduces support pressure and improves developer answers. A database publisher may block nearly all AI crawlers because licensing is the core business. A government agency may allow broad crawling for public information but block duplicate archives and internal search paths.

There is no universal robots.txt file for the AI era. There are only policy choices. The same user agent can be good for one business and harmful for another. The same company can run one bot worth allowing and another worth blocking.

A practical allow or block matrix for publishers

Site goalUsually allowUsually restrictRisk to monitor
Maximize search trafficGooglebot, Bingbot, Applebot, Baiduspider, YandexBot where relevantInternal search, carts, duplicate parametersAccidental deindexing
Appear in AI answersOAI-SearchBot, Claude-SearchBot, PerplexityBot, DuckAssistBot, MistralAI-IndexTraining-only bots if rights are sensitiveAnswer extraction without visits
Protect paid editorial IPSearch crawlers for public pagesGPTBot, ClaudeBot, CCBot, AI2Bot, Google-Extended, Applebot-ExtendedLoss of AI visibility
Reduce server loadHigh-value search botsAggressive or low-value crawlers, duplicate URL pathsBlocking useful discovery
Support ecommerce discoveryGooglebot, Storebot-Google, Bingbot, Amzn-SearchBotCheckout, cart, account, filtersProduct pages missing from surfaces
Keep social sharing cleanPreview fetchers and social botsPrivate paths and unneeded mediaBroken link cards

The matrix should be reviewed after log analysis, not copied blindly. The same category may need a different rule for news, ecommerce, SaaS documentation, public-sector content, forums, marketplaces and research libraries.

User-triggered fetchers are the unresolved frontier

The largest grey zone is the user-triggered fetch. ChatGPT-User, Claude-User, Perplexity-User, MistralAI-User and Amzn-User exist because a person asks an assistant to open or use a web page. Operators often distinguish these from automatic crawling. Some say robots.txt may not apply. Some give publishers a way to control them anyway. Some use published IP ranges. Some requests may look like a browser rather than a classic bot.

From a publisher’s point of view, the distinction can feel artificial. The server sees automated traffic. The page may be summarized, transformed or quoted. The user may never click through. The assistant may serve as the interface between reader and source. Whether the fetch began with a human prompt does not erase the commercial impact.

From the platform’s point of view, a user-triggered fetch can be compared to a browser retrieving a page for a user. If a person is allowed to open a public URL, the assistant may claim it is acting as the user’s agent. That reasoning is not settled across law, platform policy or publisher expectations.

Robots.txt was not designed to carry all of that nuance. The industry is trying to stretch a 1990s access convention into a 2020s AI-use policy. It works well enough for polite bulk crawlers. It strains under live agents, browser automation, tool use and mixed human-machine retrieval.

This is why new signals such as content credentials, content usage policies, signed bot authentication, Web Bot Auth and AI-specific controls keep appearing in technical and policy discussions. Robots.txt will remain central because it is simple, deployed and known. It will not be enough by itself.

Verification now matters as much as syntax

A user-agent string is easy to spoof. Google warns that HTTP user-agent strings can be spoofed and points site owners to crawler verification. Yandex gives step-by-step reverse DNS checks and says genuine Yandex robots end in yandex.ru, yandex.net or yandex.com, with forward DNS confirmation to catch fake hostnames. Common Crawl warns that crawlers can falsely identify as CCBot and provides reverse DNS and IP-range verification. Qwant, Mojeek and Apple also publish DNS or IP guidance for crawler verification.

A modern robots workflow should pair every important user-agent rule with verification. That means:

  • reverse DNS and forward DNS checks for major search crawlers;
  • published IP JSON feeds where available;
  • request-signing or bot-auth mechanisms where supported;
  • CDN or WAF rules that combine verified identity with user-agent matching;
  • monitoring for traffic that claims a known user agent but comes from unrelated infrastructure.

The point is not paranoia. It is accuracy. A fake Googlebot can hit a site while the real Googlebot is allowed. A stealth scraper can claim to be a browser. A misconfigured internal crawler can accidentally trigger block rules. A CDN rule that blocks by user-agent alone can be bypassed by any actor that changes the string.

Robots.txt tells honest bots what to do. Verification tells the site whether the bot is honest.

Crawl-delay is useful only when the crawler supports it

Crawl-delay is one of the oldest sources of false confidence. Site owners add it to reduce server load, then assume all bots will slow down. They will not. RFC 9309 does not make crawl-delay a universal rule. Support is crawler-specific.

Anthropic says it supports the nonstandard Crawl-delay extension to limit crawling activity. Bing has long documented crawl-delay behavior and crawl-rate settings through Bing Webmaster Tools. Seznam supports a Request-rate directive that lets publishers specify how many documents SeznamBot can download in a period. Amazon says its crawlers do not support the crawl-delay directive. Mojeek says MojeekBot does not support crawl-delay at this time.

The safer operational approach is to use crawl-delay where supported but not depend on it for load protection. Server-side rate limiting, CDN rules, 429 Too Many Requests, Retry-After headers, cache design, sitemap hygiene and URL-parameter control matter more.

High-volume crawler waste often starts with bad URL architecture. Infinite calendars, faceted navigation, internal search pages, tracking parameters, sort orders and duplicate archives can trap bots. A clean robots file can block the worst traps. A clean site structure reduces the need for emergency crawler throttling.

AI search rewards accessibility but raises extraction risk

Allowing AI search crawlers can increase the chance that pages are cited in answer engines. Blocking them can protect against extraction, but it may also remove the site from a growing discovery channel. The trade-off is different from the Google era because AI answers may satisfy the user without a click.

OpenAI says sites opted out of OAI-SearchBot will not be shown in ChatGPT search answers, though they can still appear as navigational links. Anthropic says disabling Claude-SearchBot may reduce visibility and accuracy in user search results. Perplexity recommends allowing PerplexityBot for appearance in Perplexity search results. DuckDuckGo says DuckAssistBot is for AI-assisted answers and that opt-out does not affect ordinary organic rankings. MistralAI-Index is used to index content for Mistral’s search engine to help answer user questions in Le Chat.

This creates a new SEO discipline: answer-engine access control. The old goal was to be crawled, indexed and ranked. The new goal is to be crawled for the right surfaces, cited with the right attribution, and not used for the wrong downstream purpose.

A publisher may allow AI search crawlers for public evergreen explainers but block them from premium analysis. A SaaS company may allow AI answer crawlers for docs but block customer forums. An ecommerce site may allow product pages but block cart, checkout, account, search and recommendation endpoints. A university may allow public research pages but block student systems, PDFs with licensing restrictions or old course archives.

Robots.txt can apply path-level rules, which makes partial strategies possible. The hard part is maintaining them as products, URLs and crawler policies change.

Training crawlers force a rights decision

Training crawlers are the most sensitive category because they may copy content into datasets that shape model behavior without producing a visible referral path. OpenAI’s GPTBot, Anthropic’s ClaudeBot, Amazonbot, AI2Bot, CCBot, Google-Extended and Applebot-Extended are not identical, but they all sit in the rights-and-reuse conversation.

OpenAI says disallowing GPTBot indicates site content should not be used in training OpenAI’s generative AI foundation models. Anthropic says restricting ClaudeBot signals that future site materials should be excluded from model-training datasets. Amazon says Amazonbot may be used to train Amazon AI models. AI2 says AI2 Bot finds web content used to train open language models. Common Crawl provides a robots.txt block path for CCBot, whose data can feed research and machine-learning work. Google-Extended and Applebot-Extended work as usage controls tied to AI model training or grounding, not as ordinary crawlers.

The rights decision is rarely one-size-fits-all. A news publisher may block training to preserve licensing value. A standards body may allow training to spread accurate technical knowledge. A brand may allow training for public marketing pages but block training for research, paid reports or community content. A developer platform may prefer that assistants know its APIs, because bad AI answers increase support costs.

The practical mistake is to make this decision accidentally. Many sites still have User-agent: * Allow: / and no AI-specific groups. That may be fine. It may also mean the site never made a policy choice. In 2026, silence in robots.txt is itself a policy.

Search removal still requires more than robots.txt

Robots.txt blocks crawling. It does not always remove a URL from search results. Google, Baidu and Yandex all explain versions of this issue: a blocked URL can still be known through links, and if a crawler cannot fetch the page, it cannot read page-level removal instructions. Baidu says blocked pages may still appear if linked by other sites, though content from blocked pages is not crawled or shown. Yandex warns that pages restricted in robots.txt can participate in search and says removal should use noindex in HTML or an HTTP header; if the bot cannot crawl the page, it cannot detect the instruction.

The technical rule is counterintuitive but vital. To remove a page from an index through noindex, the crawler must be allowed to crawl the page and see the noindex. If robots.txt blocks it first, the crawler may never see the removal directive. For public pages that should disappear from search, robots.txt is often the wrong first tool.

Robots.txt is better for avoiding crawl waste, keeping crawlers out of low-value areas and expressing crawler-specific access preferences. Meta robots and X-Robots-Tag are better for indexing decisions on crawlable pages. HTTP authentication, deletion, canonicalization and search-console removal tools each have their place.

Crawl control and index control are related, but they are not the same job.

A publisher’s robots.txt file should now be audited like legal infrastructure

Because robots.txt now expresses consent, access and commercial strategy, it deserves a formal audit cycle. The audit should not be limited to “Does Googlebot pass?” It should test at least six layers.

First, confirm file availability. Every host that serves public content should have the intended robots file at the root. HTTP, HTTPS, subdomains, CDNs and localized domains should be checked separately.

Second, confirm syntax. The file should parse under Google’s tester, Bing’s tester and any market-specific tools such as Yandex Webmaster where relevant. Large files should stay below size limits used by major crawlers.

Third, classify user agents. Each named bot should have a reason for being listed. Dead entries should be removed. New AI crawlers should be evaluated before being added to blocklists or allowlists.

Fourth, test business paths. Public articles, product pages, docs, author pages, category pages, APIs, PDFs, images and feeds should be checked for the crawler categories that matter.

Fifth, inspect logs. Robots.txt is policy. Logs are behavior. The audit should compare expected crawler behavior with real visits, status codes, request rates, blocked paths and suspicious identities.

Sixth, assign owners. SEO may own Googlebot, but legal may own training crawlers, product may own AI search visibility, performance engineering may own rate limits, and paid media may own ads bots. A file without governance will drift.

The organizations that treat robots.txt as a shared policy artifact will make better AI-era access decisions than organizations that leave it as an old SEO template.

Blocklists are growing, but they can create hidden costs

Many publishers have responded to AI crawling by adding long lists of user agents to block. The impulse is understandable. AI crawlers can create server load, copy content without clear value, and reduce control over downstream use. Yet blocklists can also create hidden damage.

A broad User-agent: * Disallow: / blocks more than AI training. It can block search visibility, social previews, product feeds, archiving, accessibility checks and ad validation. A long AI-specific blocklist can catch search-oriented AI bots that might have sent citations. A rule copied from another publisher can block a crawler relevant to one region but irrelevant to another. A misspelled user-agent token can create a false sense of protection.

The most common hidden cost is blocking AI search while trying to block AI training. A site that blocks OAI-SearchBot, Claude-SearchBot, PerplexityBot, MistralAI-Index and DuckAssistBot may reduce its appearance in AI answer products. That may be the right choice for a publisher negotiating licenses. It may be a bad choice for a B2B brand trying to be cited in buying research.

A good robots policy is not the longest blocklist. It is the clearest set of permissions tied to business goals.

Server logs reveal which crawlers deserve attention

Theoretical crawler lists are useful, but server logs decide priority. A Slovak ecommerce site may discover heavy Googlebot, Bingbot, SeznamBot and Meta preview traffic, with little Baidu or Sogou. A developer documentation site may see GPTBot, ClaudeBot, OAI-SearchBot, PerplexityBot, CCBot, AI2Bot and MistralAI-User. A news site may see aggressive bot traffic from AI agents, social previews, content syndicators and scrapers.

Log analysis should answer concrete questions:

  • Which user agents requested /robots.txt?
  • Which crawlers ignored disallowed paths?
  • Which agents create the most bandwidth and CPU load?
  • Which verified crawlers send referral traffic or visible citations?
  • Which user agents appear to be spoofed?
  • Which blocked bots keep retrying?
  • Which AI search crawlers access pages that later appear in answer referrals?
  • Which training crawlers revisit content after disallow rules change?

The answer may surprise teams. A crawler that looks risky may send useful traffic. A crawler that looks legitimate may be spoofed. A bot that obeys robots.txt may still create too much load if URL traps are open. A crawler that rarely appears may still deserve a rule because its downstream use is sensitive.

Logs also expose propagation delays. OpenAI and Amazon mention roughly 24-hour adjustment windows for some crawler settings. DuckDuckGo says DuckAssistBot opt-out takes effect after 72 hours. Seznam says rechecking restrictions may take days or weeks for less frequently visited sites. A robots update should be monitored after deployment, not assumed complete.

Ecommerce sites need separate rules for search, shopping and assistants

Ecommerce robots strategy is harder than it looks. Product pages, category pages, reviews, images, pricing and availability may be valuable to search engines and shopping assistants. Cart, checkout, account, internal search, filters, sort orders, tracking parameters and recommendation endpoints are often crawl waste or privacy-sensitive.

Google’s Storebot-Google affects Google Shopping surfaces, while ordinary Googlebot affects Google Search and related features. Amazon’s Amzn-SearchBot can make content eligible for Amazon search experiences such as Alexa and Rufus, while Amazonbot may be used for broader product improvement and model training. Bing and ad crawlers may inspect landing pages for paid search. Meta preview and ad crawlers affect social sharing and campaigns.

The ecommerce rule set should start by protecting transactional paths. /cart/, /checkout/, /account/, /login/, /order/, /payment/, /compare/, internal search URLs and session parameters should usually be disallowed or canonicalized. Product and category pages should usually remain crawlable for the search and shopping agents that matter.

AI agents introduce a second layer. Product descriptions, specifications, reviews and FAQs may surface in AI shopping answers. Blocking every AI search crawler can reduce product discovery in answer engines. Allowing every training crawler may feed proprietary descriptions, reviews and merchandising content into models. The compromise is path-based: allow public product pages to selected AI search crawlers, block training agents where rights matter, and keep transactional or personalized paths closed.

For ecommerce, robots.txt is now part of merchandising. It decides which machines can learn that a product exists.

News publishers face the hardest trade-off

News publishers sit at the center of the robots.txt conflict because their content is current, expensive to produce, legally protected and highly attractive to AI systems. A search crawler can send traffic. An AI answer engine may summarize the article. A training crawler may absorb reporting without returning a visit. A live user fetch may quote a page in a chat. A social bot may generate a preview that drives readers to the article.

The old web bargain gave publishers enough search traffic to justify broad crawling. The AI bargain is less settled. If an answer engine quotes enough of a story, the user may not click. If a model trains on journalism, the value may be indirect or invisible. If a publisher blocks all AI crawlers, it may lose visibility in the interfaces where readers increasingly ask questions.

That is why many news organizations now split rules by use case. Public homepage and article pages may remain open to Googlebot and Bingbot. AI training bots may be blocked. AI search bots may be allowed only for public free articles, or blocked pending licensing. Paywalled content should be protected by authentication and structured paywall signals, not just robots.txt. Archive pages, tag pages and internal search should be controlled to avoid crawl waste.

A news robots policy is also an editorial policy. It answers: who may use the newsroom’s work, for what surface, under what expectation of attribution and traffic? Robots.txt cannot enforce all of that, but it is the first public signal.

Technical documentation may benefit from selective AI access

Software documentation, standards pages, API references and troubleshooting guides face a different calculus. Users increasingly ask AI assistants how to use a library, fix an error, configure a service or compare an API. If the assistant cannot access current docs, it may hallucinate or rely on outdated examples. That can increase support tickets and damage trust.

For docs sites, allowing AI search and user-triggered fetchers may be valuable. OAI-SearchBot, Claude-SearchBot, PerplexityBot, MistralAI-Index, DuckAssistBot and similar crawlers can help current documentation appear in answer systems. User-triggered fetchers such as ChatGPT-User, Claude-User and MistralAI-User can retrieve pages when a developer asks about a specific URL or feature.

Training access is more complicated. Some documentation owners want models trained on their public docs so generated code and support answers improve. Others worry about outdated examples being baked into models, license conflicts, or AI tools replacing visits to the docs. Path and version rules can help: allow current docs to search crawlers, block deprecated docs from broad indexing, and use clear canonical and noindex rules for old versions.

For documentation, being absent from AI answers can be worse than being summarized. The risk is not only lost traffic; it is wrong answers about your product.

Public-sector and educational sites should prioritize access with safeguards

Government, university and public-interest websites often have a mission to make information discoverable. Blocking all AI and search crawlers may undermine public access. Yet these sites also host sensitive PDFs, outdated documents, personal data, policy drafts, student systems and internal search paths.

The right posture is usually broad access to public information with strict protection for private systems. Googlebot, Bingbot, Applebot, DuckDuckGo, Qwant, Seznam, Baidu or Yandex may matter depending on audience. AI search crawlers may help citizens or students find answers in conversational tools. Training crawlers may be allowed for public knowledge pages or blocked for rights-managed materials.

Accessibility and translation also matter. Some crawlers check pages for accessibility, mobile rendering, snippets or previews. Blocking every non-search bot can weaken those features. Yandex’s documentation, for example, lists product-specific robots such as accessibility and advertising robots with different treatment of general robots rules.

The public-sector mistake is to treat robots.txt as privacy. The correct pattern is to remove private data from public web roots, require authentication for systems, then use robots.txt to keep cooperative crawlers focused on public pages worth discovering.

Regional strategy changes the crawler priority list

A crawler that is irrelevant in one market may be central in another. Baidu matters for Chinese-language discovery. Yandex matters for Russian-language and some regional search behavior. Seznam matters in Czech search. Qwant matters for a privacy-oriented European audience. Naver’s Yeti matters in Korea. Sogou and 360Spider matter in parts of China. Mojeek and Brave matter to independent-search users. DuckDuckGo matters in privacy search and AI-assisted answers.

International sites often make the same mistake: they write a robots.txt file for Google and assume global coverage. That can be costly. Baidu has its own examples and matching behavior. Yandex supports advanced directives such as Clean-param and has exact user-agent handling. Seznam supports its own rate-related extension. Qwant publishes Qwantbot verification. Mojeek has its own user-agent matching behavior.

The crawler policy should follow the market map. If a site has localized domains, each domain should have a localized crawler policy. A Chinese subdomain should be tested for Baiduspider. A Czech site should account for SeznamBot. A Russian-language site should account for YandexBot. A European public site may want Qwantbot and MojeekBot. A global English-language site may care more about AI search and assistants.

Clean URL architecture is still the best crawler control

Robots.txt is often asked to fix problems that URL architecture created. Infinite crawl spaces are the classic example. Faceted navigation generates thousands of parameter combinations. Internal search pages create low-value paths. Tracking parameters duplicate every page. Calendars stretch years into the future. Sort orders and filters multiply product pages. Session IDs create unique URLs for the same content.

A robots file can block many of these paths, but the cleaner fix is structural. Use canonical tags where content duplicates. Use parameter handling where supported. Avoid exposing infinite links. Add nofollow where appropriate. Keep sitemaps clean. Return proper status codes. Use stable URLs. Keep internal search results out of indexable paths unless there is a deliberate reason.

Yandex’s Clean-param directive is one example of a search engine-specific tool for URL parameters. Google has its own parsing rules and documentation. Baidu supports wildcards and order-sensitive matching. The more engines a site cares about, the more dangerous it is to depend on one engine’s interpretation.

The best robots.txt file is short because the site architecture is clean. Long files are sometimes necessary, especially for large ecommerce and publishing sites, but they often signal unresolved crawl traps.

AI crawler governance needs a named owner

Most crawler failures are governance failures. Someone copies a blocklist from a blog. A developer deploys a staging rule to production. A legal team requests “block AI” without knowing which bots affect search visibility. SEO allows every crawler to preserve traffic. Infrastructure blocks a noisy bot that paid media needed for ads validation. No one checks subdomains. No one reviews logs after deployment.

A mature governance model names owners by crawler class:

SEO owns classic search crawlers and index health. Editorial or content strategy owns AI search visibility. Legal or licensing owns training-data permissions. Infrastructure owns rate limits, bot verification and WAF rules. Product owns documentation and support-agent access. Paid media owns ad validation bots. Social or communications owns preview fetchers.

This sounds heavy for a text file, but the file now encodes business policy. A ten-line robots change can affect revenue, licensing, visibility and server cost. It deserves change control.

The process does not need to be slow. A strong workflow can be lightweight: proposed change, affected user agents, affected paths, reason, expected impact, test URLs, approval owner, deployment date, log review date. That is enough for most teams.

The worst robots.txt file is the one nobody owns.

The most common mistakes site owners make

The first mistake is blocking all bots with User-agent: * Disallow: / during development and forgetting to remove it at launch. This can keep search engines and AI search crawlers from discovering the site.

The second is using robots.txt to hide confidential files. Disallowed paths are public. Crawlers may ignore them. The fix is authentication or removal, not secrecy-by-disallow.

The third is blocking pages that need noindex. If the crawler cannot fetch the page, it may never see the noindex directive. Yandex and Baidu both document versions of this issue.

The fourth is treating company names as crawler categories. Blocking Googlebot blocks search; blocking Google-Extended controls AI training and grounding use without affecting Google Search, according to Google. Blocking Amazonbot is not the same as blocking Amzn-SearchBot. Blocking GPTBot is not the same as blocking OAI-SearchBot.

The fifth is ignoring user-triggered fetchers. These agents may access pages even when teams think they have blocked “AI crawlers.” Policies differ across platforms, and some providers explicitly state that robots.txt may not apply to user-initiated fetches.

The sixth is never checking logs. A robots file is only a statement. Logs show whether crawlers read it, obey it, ignore it, or arrive under names the team did not expect.

A baseline robots.txt strategy for 2026

A sensible baseline starts with separation. Do not write one rule for “AI.” Separate search indexing, AI search indexing, training, user-triggered fetches, product crawlers, ad validators and social previews.

For most public sites, allow classic search crawlers on public pages: Googlebot, Bingbot, Applebot, and regional crawlers that match the audience. Add Baiduspider, YandexBot, SeznamBot, Qwantbot, MojeekBot, Bravebot or others where market exposure justifies it.

Then decide on AI search crawlers. Allow OAI-SearchBot, Claude-SearchBot, PerplexityBot, DuckAssistBot, MistralAI-Index and Amzn-SearchBot only if answer-engine visibility is desired. For many brands and docs sites, it is. For some publishers, it is a licensing issue.

Then decide on training. GPTBot, ClaudeBot, Amazonbot, AI2Bot, CCBot, Google-Extended and Applebot-Extended should be allowed or blocked based on content rights and model-training strategy. A public-interest organization may choose openness. A subscription publisher may choose restriction.

Then control low-value paths for everyone: internal search, carts, checkout, accounts, login pages, admin paths, session URLs, tracking parameters and infinite filters. Use path rules carefully so public content remains accessible to selected crawlers.

Then add verification and monitoring. Use DNS and IP checks for major crawlers. Use WAF rules for abusive traffic. Review logs after changes.

The baseline is not “allow” or “block.” The baseline is “classify, decide, test and monitor.”

The crawler map will keep changing

The crawler names in today’s robots.txt files will not be stable forever. Google has already added tokens tied to Gemini and Vertex use cases. OpenAI’s crawler set changed as ChatGPT search and ads products developed. Anthropic updated its crawler documentation to split training, search and user access. Amazon added search and user agents around Alexa and Rufus-style experiences. Apple added Applebot-Extended for generative model controls. Mistral now publishes Le Chat-related crawler controls.

New agents will arrive from AI browsers, shopping assistants, vertical search tools, enterprise knowledge systems, code agents, voice assistants and research crawlers. Some will be transparent. Some will be vague. Some will publish IP ranges. Some will not. Some will claim user-triggered status. Some will masquerade as browsers.

A static robots.txt copied once a year will fall behind. The file now needs the same review rhythm as schema markup, sitemaps, consent tools, CDN rules and analytics tagging. For large sites, quarterly review is reasonable. For publishers and high-traffic ecommerce sites, monthly log checks may be necessary.

The direction is clear. Robots.txt is becoming a consent and routing file for machine readers. It is still too weak to carry that burden alone, but it is too widely deployed to ignore.

The answer to who processes robots.txt is now layered

The direct answer is that most major search crawlers process robots.txt: Googlebot, Bingbot, Baiduspider, YandexBot, Applebot, SeznamBot, Qwantbot, MojeekBot, Bravebot and many regional bots. Many AI-related crawlers also process it: GPTBot, OAI-SearchBot, ClaudeBot, Claude-SearchBot, PerplexityBot, DuckAssistBot, MistralAI-Index, Amazonbot, Amzn-SearchBot, CCBot and AI2Bot. Several control tokens, such as Google-Extended and Applebot-Extended, do not behave like ordinary crawlers but are still read through robots.txt as publisher controls for AI-related use.

The second answer is that not every automated fetch follows the same rule. User-triggered fetchers may be treated differently. Perplexity says Perplexity-User generally ignores robots.txt because a user requested the fetch. OpenAI says robots rules may not apply to ChatGPT-User because actions are initiated by a user. Apple says iTMS does not follow robots.txt because it is not a general search crawler.

The third answer is that compliance is not only a documentation claim. It must be verified in logs. Cloudflare’s Perplexity report shows why publishers now look beyond official user-agent lists and ask whether a crawler is transparent, verifiable and consistent with no-crawl signals.

So the best operational answer is not a single list. It is a matrix. Who is the operator? Which token do they use? Is the agent for search, training, user retrieval, ads, shopping, social previews or research? Does it obey robots.txt? Does it honor Crawl-delay or only Disallow and Allow? Does it publish IP ranges? What happens if it is blocked? What business value does it create?

That is the new map. Robots.txt has become the place where publishers describe their relationship with the machine web. The file is still plain text, but the decisions behind it are now strategic.

Practical questions publishers ask about robots.txt and AI crawlers

Which major companies process robots.txt?

Major search and technology companies that publish robots.txt controls or crawler behavior include Google, Microsoft Bing, OpenAI, Anthropic, Amazon, Apple, Perplexity, Mistral, DuckDuckGo, Yandex, Baidu, Seznam, Qwant, Mojeek, Common Crawl and AI2. Their user-agent names and rules differ, so each should be reviewed separately.

Does Googlebot process robots.txt?

Yes. Google says its common crawlers always obey robots.txt rules when crawling automatically. Googlebot controls Google Search and related search surfaces, while tokens such as Google-Extended control separate AI-related uses.

Does Google-Extended appear in server logs?

Google says Google-Extended does not have a separate HTTP user-agent string. It is a robots.txt product token used as a control for whether crawled content may be used for Gemini model training and grounding.

Does blocking Google-Extended hurt Google Search rankings?

Google states that Google-Extended does not affect inclusion in Google Search and is not used as a ranking signal in Google Search.

Which OpenAI bots should publishers know?

The main OpenAI agents are OAI-SearchBot for ChatGPT search visibility, GPTBot for foundation model training, OAI-AdsBot for ad landing-page checks and ChatGPT-User for user-triggered actions.

Should I block GPTBot if I still want to appear in ChatGPT search?

Blocking GPTBot signals that content should not be used for OpenAI foundation model training. For ChatGPT search visibility, OpenAI points publishers to OAI-SearchBot rather than GPTBot.

Which Anthropic crawlers process robots.txt?

Anthropic lists ClaudeBot for training-related collection, Claude-User for user-directed retrieval and Claude-SearchBot for search result quality. Anthropic says its bots honor robots.txt directives.

What does Amazonbot do?

Amazon says Amazonbot improves Amazon products and services and may be used to train Amazon AI models. Amazon also lists Amzn-SearchBot for search experiences such as Alexa and Rufus, and Amzn-User for user actions.

Does Applebot-Extended crawl pages?

Apple says Applebot-Extended does not crawl webpages. It is used to determine how data crawled by Applebot may be used, including whether content can be used for Apple foundation model training.

Does Bingbot follow robots.txt?

Bing has long documented robots.txt support and Bingbot behavior, including legacy support for directives written for msnbot. Bing Webmaster Tools also includes a robots.txt tester for Bingbot and BingAdsBot.

Does YandexBot process robots.txt?

Yes. Yandex documents robots.txt support and advanced directives, including Disallow, Allow, Sitemap and Clean-param. Its documentation also explains how Yandex user-agent matching works.

Does Baiduspider check robots.txt?

Baidu says its spider checks for a robots.txt file in the root directory before accessing a site and works according to the file’s instructions when present.

Does DuckAssistBot train AI models?

DuckDuckGo says DuckAssistBot crawls pages in real time for AI-assisted answers and that the data is not used to train AI models.

Does PerplexityBot respect robots.txt?

Perplexity says PerplexityBot respects robots.txt and is not used for foundation model training. Its documentation separately says Perplexity-User generally ignores robots.txt because it is user-requested.

Why is Perplexity controversial among publishers?

Cloudflare accused Perplexity in 2025 of using stealth, undeclared crawlers to evade website no-crawl directives, an allegation Perplexity disputed in public reporting. The dispute shows why logs and verification matter.

Does Common Crawl’s CCBot process robots.txt?

Common Crawl provides a CCBot user-agent string and a robots.txt example for blocking CCBot. It also recommends verification because other crawlers may falsely identify as CCBot.

Does AI2Bot use web content for model training?

AI2 says its bot explores domains to find web content and that this content is used to train open language models. It publishes the AI2Bot user-agent header.

Can robots.txt remove a page from search results?

Not reliably. Robots.txt blocks crawling. Search engines may still know a blocked URL from external links, and if a crawler cannot fetch the page, it may not see a noindex tag. Yandex and Baidu both document versions of this issue.

Is robots.txt enough to protect private content?

No. Robots.txt is a voluntary crawler signal, not access control. Private content needs authentication, authorization, network protection or removal from public web roots.

How often should robots.txt be reviewed?

High-traffic publishers, ecommerce sites and documentation sites should review logs at least monthly and audit crawler rules quarterly. AI and answer-engine crawler policies change often, and stale rules can block useful discovery or allow unwanted reuse.

Author:
Jan Bielik
CEO & Founder of Webiano Digital & Marketing Agency

The new robots.txt map for Google, OpenAI, Amazon, Yandex and Baidu
The new robots.txt map for Google, OpenAI, Amazon, Yandex and Baidu

This article is an original analysis supported by the sources cited below

Robots Exclusion Protocol RFC 9309
The IETF specification defining the modern Robots Exclusion Protocol and its limits as a crawler-access convention.

How Google interprets the robots.txt specification
Google’s technical documentation on robots.txt file location, parsing, status-code handling, caching and file limits.

Overview of Google crawlers and fetchers
Google’s overview of common crawlers, special-case crawlers and user-triggered fetchers.

Google’s common crawlers
Google’s detailed list of major crawler user agents, robots.txt tokens and affected products, including Googlebot and Google-Extended.

An update on web publisher controls
Google’s announcement of Google-Extended as a publisher control for generative AI-related uses.

Grounding with Google Search on Vertex AI
Google Cloud documentation explaining that Vertex AI grounding respects Google-Extended disallow rules.

Overview of OpenAI crawlers
OpenAI’s official documentation for OAI-SearchBot, GPTBot, OAI-AdsBot and ChatGPT-User.

Does Anthropic crawl data from the web
Anthropic’s official crawler policy for ClaudeBot, Claude-User and Claude-SearchBot.

About AmazonBot
Amazon’s official documentation for Amazonbot, Amzn-SearchBot, Amzn-User and Amazon’s robots.txt behavior.

About Applebot
Apple’s official documentation for Applebot, Applebot-Extended, iTMS behavior and Apple search crawler controls.

Bing crawler bingbot on the horizon
Bing Webmaster Blog post explaining Bingbot naming and compatibility with robots.txt directives written for msnbot.

Bing Webmaster Tools robots.txt tester
Microsoft Bing documentation about testing robots.txt behavior for Bingbot and BingAdsBot.

Using robots.txt in Yandex Webmaster
Yandex’s official documentation on robots.txt rules, supported directives and Yandex-specific recommendations.

The User-agent directive in Yandex Webmaster
Yandex documentation explaining how Yandex robots interpret user-agent groups.

How to check that a robot belongs to Yandex
Yandex documentation on crawler verification and Yandex robot behavior in server logs.

Baidu robots.txt help page
Baidu’s English-language documentation for Baiduspider, robots.txt placement, directives and examples.

DuckAssistBot help page
DuckDuckGo’s official explanation of DuckAssistBot, AI-assisted answers, opt-out handling and user-agent details.

Perplexity crawlers
Perplexity’s official documentation for PerplexityBot and Perplexity-User.

How Perplexity follows robots.txt
Perplexity’s help-center article describing its robots.txt policy and crawler use.

Perplexity is using stealth undeclared crawlers
Cloudflare’s investigation into alleged Perplexity stealth crawling and crawler-verification norms.

Mistral AI crawlers
Mistral AI’s official documentation for MistralAI-User and MistralAI-Index.

Common Crawl CCBot
Common Crawl’s documentation for CCBot user-agent identification, opt-out and verification.

AI2 crawling notice
Allen Institute for AI documentation describing AI2Bot and its use of web content for open language model training.

Qwant web crawler
Qwant’s official documentation for Qwantbot user-agent identification, robots behavior and verification.

Seznam crawling control
Seznam’s official English documentation for SeznamBot robots.txt compliance and crawler controls.

MojeekBot
Mojeek’s official documentation for MojeekBot, robots.txt behavior and crawler verification.

Bravebot information
Cloudflare Radar’s verified bot profile for Bravebot, including operator, user-agent and robots.txt example.