Skip to content

AI Crawler Policy: robots.txt for the Three-Tier Crawler Landscape

AI crawlers split into retrieval bots (allow for citations), training scrapers (disallow), and non-compliant bots (WAF block) — each requiring a distinct robots.txt strategy.

Related lesson: Four Engines, Four Backends — this concept features in a hands-on lesson with quizzes.

The three-tier taxonomy

AI crawlers are not monolithic. Each major provider now operates separate bots for distinct purposes, each with its own user-agent string:

Tier Purpose User-agents robots.txt behaviour
Tier 1 — User-facing retrieval Powers real-time citations in AI chat and search ChatGPT-User*, OAI-SearchBot, Claude-User, Claude-SearchBot, PerplexityBot†, Perplexity-User Allow — drives referral traffic and AI citations
Tier 2 — Training scrapers Ingests content for model training datasets GPTBot, ClaudeBot, Google-Extended, Meta-ExternalAgent Disallow — no citation benefit; opts out of training data
Tier 3 — Non-compliant bots Crawlers documented to ignore robots.txt Bytespider (ByteDance) CDN/WAF block — robots.txt is ineffective

The tier distinction matters. You can block training crawlers without blocking retrieval bots. That keeps your content eligible for AI search citations while opting out of training datasets.

* As of OpenAI's December 2025 policy update, ChatGPT-User no longer respects robots.txt; disallow rules are ignored (coverage).

† Cloudflare documented Perplexity rotating user-agents and ASNs to bypass robots.txt (August 2025 report). Use WAF for hard blocks.

Decision matrix

Goal Action
Appear in AI search answers (ChatGPT, Claude, Perplexity) Allow Tier 1
Prevent content entering training datasets Disallow Tier 2
Stop ByteDance/Bytespider from crawling WAF custom rule
Opt out of everything Disallow all AI user-agents + WAF

The emerging practitioner consensus for documentation sites: allow Tier 1, disallow Tier 2.

Reference configuration

This site's robots.txt implements the three-tier policy:

# ── Default: allow all standard crawlers ──────────────────────────────────────
User-agent: *
Allow: /

# ── Tier 1: User-facing retrieval bots (ALLOW) ────────────────────────────────

User-agent: ChatGPT-User
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: Claude-User
Allow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Perplexity-User
Allow: /

# ── Tier 2: Training scrapers (DISALLOW) ──────────────────────────────────────

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

# ── Tier 3: CDN-level block (robots.txt ineffective) ──────────────────────────
# Bytespider — configure WAF custom rule: User-Agent contains "Bytespider" → Block

Sitemap: https://agentpatterns.ai/sitemap.xml

Compliance caveats

robots.txt is advisory, not enforceable. Watch for these caveats:

  • Major providers comply: OpenAI (GPTBot, OAI-SearchBot), Anthropic (ClaudeBot, Claude-SearchBot, Claude-User), and Google (Google-Extended) respect robots.txt directives.
  • ChatGPT-User exempt (December 2025): OpenAI's updated crawler documentation reclassified ChatGPT-User as a user-initiated agent and removed its robots.txt compliance requirement. Disallow rules for ChatGPT-User are now ignored. You can only block interactive ChatGPT browsing at the CDN/WAF layer.
  • Perplexity stealth crawling documented: Cloudflare reported in August 2025 that Perplexity rotates user-agents and ASNs to evade blocks and has been observed ignoring robots.txt. Treat allow-listing PerplexityBot and Perplexity-User as directional only, and use WAF rules for any hard block.
  • Bytespider ignores it: ByteDance's Bytespider is documented to not respect robots.txt, so block it at the CDN/WAF level. See Cloudflare WAF custom rules for setup.
  • No legal enforcement: robots.txt does not prevent crawling. It signals intent. Legal protection requires ToS, CFAA claims, or contractual agreements.
  • EU AI Act alignment: the EU regulatory framework encourages GPAI providers to document and respect publisher opt-out signals, and a robots.txt disallow for training crawlers is the de facto mechanism. Verify specific commitments against the published Code of Practice text as obligations evolve.

Provider user-agent reference

Provider Training Search index User retrieval
OpenAI GPTBot OAI-SearchBot ChatGPT-User*
Anthropic ClaudeBot Claude-SearchBot Claude-User
Google Google-Extended (standard Googlebot) Google-CloudVertexBot
Perplexity (PerplexityBot serves both) PerplexityBot Perplexity-User
Meta Meta-ExternalAgent Meta-ExternalFetcher

*ChatGPT-User — no longer bound by robots.txt as of OpenAI's December 2025 policy update; block at CDN/WAF if required.

Why allow Tier 1

Blocking all AI crawlers has a compounding cost:

  • Retrieval bots power citation-eligible AI answers — being absent means competitors fill that space
  • AI-referred sessions grew substantially year over year through 2025, so blocking Tier 1 opts out of this traffic source entirely
  • Cloudflare data shows the crawl-to-referral ratio for OpenAI is ~1,700:1 and Anthropic ~73,000:1 — training crawlers give no referral return; retrieval bots give direct search traffic

Key Takeaways

  • The three-tier taxonomy (retrieval / training / non-compliant) maps directly to three distinct robots.txt strategies: allow / disallow / CDN block
  • Blocking training crawlers does not block retrieval bots — they use separate user-agent strings
  • robots.txt compliance is voluntary; most major providers respect it, but ChatGPT-User was exempted in December 2025 and Perplexity has been documented evading blocks — use CDN/WAF rules when hard enforcement is required
  • The default strategy for documentation sites: allow Tier 1, disallow Tier 2, WAF-block Bytespider
Feedback