AI Crawlers Reference Guide

Overview

This document provides comprehensive information about AI crawlers used by major platforms including ChatGPT, Perplexity, Microsoft Copilot, Google Gemini, and Claude. Each crawler has specific purposes ranging from AI training to search indexing to on-demand content fetching.

AI Crawlers by Platform

ChatGPT (OpenAI)

Crawler	User Agent	Purpose
GPTBot	Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; GPTBot/1.1; +https://openai.com/gptbot	AI training - collects data for training GPT models
OAI-SearchBot	Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; OAI-SearchBot/1.0; +https://openai.com/searchbot	Search functionality - retrieves web content for ChatGPT search
ChatGPT-User	Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; ChatGPT-User/1.0; +https://openai.com/bot	On-demand fetcher - when users ask ChatGPT to visit a page

IP Ranges & Documentation:

GPTBot IPs: https://openai.com/gptbot.json
OAI-SearchBot IPs: https://openai.com/searchbot.json
ChatGPT-User IPs: https://openai.com/chatgpt-user.json
Documentation: https://platform.openai.com/docs/bots

Perplexity

Crawler	User Agent	Purpose
PerplexityBot	Mozilla/5.0 (compatible; PerplexityBot/1.0; +https://www.perplexity.ai/bot)	AI search indexing - indexes content for search results (not for model training)
Perplexity-User	Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Perplexity/1.0; +https://www.perplexity.ai)	On-demand fetcher - retrieves content when users ask questions

IP Ranges & Documentation:

PerplexityBot IPs: https://www.perplexity.com/perplexitybot.json
Perplexity-User IPs: https://www.perplexity.com/perplexity-user.json
Documentation: https://docs.perplexity.ai/guides/bots

Microsoft Copilot / Bing

Crawler	User Agent	Purpose
Bingbot	Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) Chrome/W.X.Y.Z Safari/537.36	Search indexing - powers Bing Search and Microsoft Copilot answers

Documentation: https://www.bing.com/webmaster/help/which-crawlers-does-bing-use-8c184ec0

Google (Gemini / AI Overview)

Crawler	User Agent	Purpose
Googlebot	Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)	Search indexing - indexes content for Google Search, Discover, Images, Video, News
Google-Extended	Google-Extended (token)	AI training - collects data for Gemini Apps and Vertex AI (does NOT affect Google Search)
Gemini-Deep-Research	Varies	On-demand fetcher - used by Gemini Deep Research feature for user queries

IP Ranges & Documentation:

Googlebot IPs: https://developers.google.com/static/search/apis/ipranges/googlebot.json
Documentation: https://developers.google.com/search/docs/crawling-indexing/overview-google-crawlers

Claude (Anthropic)

Crawler	User Agent	Purpose
ClaudeBot	Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +https://www.anthropic.com/)	AI training - collects web content for training Claude AI models
Claude-SearchBot	Claude-SearchBot	Search indexing - indexes content to improve Claude search result quality
Claude-User	Claude-User	On-demand fetcher - retrieves web content at users' direction when using Claude

Documentation: https://support.anthropic.com/en/articles/8896518 (Note: Anthropic does NOT publish IP ranges)

If you want to block some of the ai crawlers or agents, here are some tipps for the

Firewall Configuration Guide

Option 1: Using robots.txt (Recommended)

All major AI crawlers respect robots.txt directives. Add these to your site's robots.txt file:

# Block OpenAI GPTBot (training)

User-agent: GPTBot

Disallow: /

# Block OpenAI SearchBot

User-agent: OAI-SearchBot

Disallow: /

# Block ChatGPT-User

User-agent: ChatGPT-User

Disallow: /

# Block Perplexity

User-agent: PerplexityBot

Disallow: /

User-agent: Perplexity-User

Disallow: /

# Block Bingbot (affects Bing Search AND Copilot)

User-agent: bingbot

Disallow: /

# Block Googlebot (affects ALL Google products)

User-agent: Googlebot

Disallow: /

# Block Google-Extended (AI training ONLY)

User-agent: Google-Extended

Disallow: /

# Block Anthropic ClaudeBot (training)

User-agent: ClaudeBot

Disallow: /

# Block Claude-SearchBot

User-agent: Claude-SearchBot

Disallow: /

# Block Claude-User

User-agent: Claude-User

Disallow: /

Option 2: IP-Based Firewall Rules

For platforms that publish IP ranges, you can implement IP-based blocking:

Platforms with Published IP Ranges:

OpenAI: GPTBot, OAI-SearchBot, ChatGPT-User (see JSON links above)
Perplexity: PerplexityBot, Perplexity-User (see JSON links above)
Google: Googlebot (see JSON link above)
Microsoft: Verify via Bing Webmaster Tools

Important: Anthropic (Claude) does NOT publish IP ranges and advises against IP-based blocking as it prevents the crawler from reading your robots.txt file.

Option 3: Web Application Firewall (WAF)

Cloudflare WAF Configuration:

Go to Security → WAF
Create custom rules
Field: User-Agent
Operator: Contains
Value: Crawler name (e.g., PerplexityBot)
Action: Block or Allow as needed

Important Notes

About Blocking Crawlers:

robots.txt is the recommended method - All major AI companies respect it
IP blocking has limitations: ranges change frequently, some don't publish IPs
Allow 24-48 hours for robots.txt changes to take effect

Selective Blocking Strategy:

Allow search crawlers for visibility in search results
Block training crawlers if you don't want content used for AI training
Allow user-initiated fetchers to enable AI assistant access

Special Considerations:

Perplexity Warning: Reports indicate Perplexity has used undeclared crawlers to bypass robots.txt. Consider using WAF rules in addition to robots.txt.
Google-Extended: Only affects Gemini Apps and Vertex AI training - blocking it does NOT affect Google Search rankings.
Bingbot: Blocking Bingbot affects BOTH Bing Search AND Microsoft Copilot functionality.

Last Updated: January 2026

Note: Crawler information and IP ranges are subject to change. Always verify with official documentation.