AI Crawlers Reference Guide
Overview
This document provides comprehensive information about AI crawlers used by major platforms including ChatGPT, Perplexity, Microsoft Copilot, Google Gemini, and Claude. Each crawler has specific purposes ranging from AI training to search indexing to on-demand content fetching.
AI Crawlers by Platform
ChatGPT (OpenAI)
|
Crawler |
User Agent |
Purpose |
|
GPTBot |
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; GPTBot/1.1; +https://openai.com/gptbot |
AI training - collects data for training GPT models |
|
OAI-SearchBot |
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; OAI-SearchBot/1.0; +https://openai.com/searchbot |
Search functionality - retrieves web content for ChatGPT search |
|
ChatGPT-User |
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; ChatGPT-User/1.0; +https://openai.com/bot |
On-demand fetcher - when users ask ChatGPT to visit a page |
IP Ranges & Documentation:
- GPTBot IPs: https://openai.com/gptbot.json
- OAI-SearchBot IPs: https://openai.com/searchbot.json
- ChatGPT-User IPs: https://openai.com/chatgpt-user.json
- Documentation: https://platform.openai.com/docs/bots
Perplexity
|
Crawler |
User Agent |
Purpose |
|
PerplexityBot |
Mozilla/5.0 (compatible; PerplexityBot/1.0; +https://www.perplexity.ai/bot) |
AI search indexing - indexes content for search results (not for model training) |
|
Perplexity-User |
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Perplexity/1.0; +https://www.perplexity.ai) |
On-demand fetcher - retrieves content when users ask questions |
IP Ranges & Documentation:
- PerplexityBot IPs: https://www.perplexity.com/perplexitybot.json
- Perplexity-User IPs: https://www.perplexity.com/perplexity-user.json
- Documentation: https://docs.perplexity.ai/guides/bots
Microsoft Copilot / Bing
|
Crawler |
User Agent |
Purpose |
|
Bingbot |
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) Chrome/W.X.Y.Z Safari/537.36 |
Search indexing - powers Bing Search and Microsoft Copilot answers |
Documentation: https://www.bing.com/webmaster/help/which-crawlers-does-bing-use-8c184ec0
Google (Gemini / AI Overview)
|
Crawler |
User Agent |
Purpose |
|
Googlebot |
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) |
Search indexing - indexes content for Google Search, Discover, Images, Video, News |
|
Google-Extended |
Google-Extended (token) |
AI training - collects data for Gemini Apps and Vertex AI (does NOT affect Google Search) |
|
Gemini-Deep-Research |
Varies |
On-demand fetcher - used by Gemini Deep Research feature for user queries |
IP Ranges & Documentation:
- Googlebot IPs: https://developers.google.com/static/search/apis/ipranges/googlebot.json
- Documentation: https://developers.google.com/search/docs/crawling-indexing/overview-google-crawlers
Claude (Anthropic)
|
Crawler |
User Agent |
Purpose |
|
ClaudeBot |
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +https://www.anthropic.com/) |
AI training - collects web content for training Claude AI models |
|
Claude-SearchBot |
Claude-SearchBot |
Search indexing - indexes content to improve Claude search result quality |
|
Claude-User |
Claude-User |
On-demand fetcher - retrieves web content at users' direction when using Claude |
Documentation: https://support.anthropic.com/en/articles/8896518 (Note: Anthropic does NOT publish IP ranges)
If you want to block some of the ai crawlers or agents, here are some tipps for the
Firewall Configuration Guide
Option 1: Using robots.txt (Recommended)
All major AI crawlers respect robots.txt directives. Add these to your site's robots.txt file:
|
# Block OpenAI GPTBot (training) User-agent: GPTBot Disallow: /
# Block OpenAI SearchBot User-agent: OAI-SearchBot Disallow: /
# Block ChatGPT-User User-agent: ChatGPT-User Disallow: /
# Block Perplexity User-agent: PerplexityBot Disallow: /
User-agent: Perplexity-User Disallow: /
# Block Bingbot (affects Bing Search AND Copilot) User-agent: bingbot Disallow: /
# Block Googlebot (affects ALL Google products) User-agent: Googlebot Disallow: /
# Block Google-Extended (AI training ONLY) User-agent: Google-Extended Disallow: /
# Block Anthropic ClaudeBot (training) User-agent: ClaudeBot Disallow: /
# Block Claude-SearchBot User-agent: Claude-SearchBot Disallow: /
# Block Claude-User User-agent: Claude-User Disallow: / |
Option 2: IP-Based Firewall Rules
For platforms that publish IP ranges, you can implement IP-based blocking:
Platforms with Published IP Ranges:
- OpenAI: GPTBot, OAI-SearchBot, ChatGPT-User (see JSON links above)
- Perplexity: PerplexityBot, Perplexity-User (see JSON links above)
- Google: Googlebot (see JSON link above)
- Microsoft: Verify via Bing Webmaster Tools
Important: Anthropic (Claude) does NOT publish IP ranges and advises against IP-based blocking as it prevents the crawler from reading your robots.txt file.
Option 3: Web Application Firewall (WAF)
Cloudflare WAF Configuration:
- Go to Security → WAF
- Create custom rules
- Field: User-Agent
- Operator: Contains
- Value: Crawler name (e.g., PerplexityBot)
- Action: Block or Allow as needed
Important Notes
About Blocking Crawlers:
- robots.txt is the recommended method - All major AI companies respect it
- IP blocking has limitations: ranges change frequently, some don't publish IPs
- Allow 24-48 hours for robots.txt changes to take effect
Selective Blocking Strategy:
- Allow search crawlers for visibility in search results
- Block training crawlers if you don't want content used for AI training
- Allow user-initiated fetchers to enable AI assistant access
Special Considerations:
- Perplexity Warning: Reports indicate Perplexity has used undeclared crawlers to bypass robots.txt. Consider using WAF rules in addition to robots.txt.
- Google-Extended: Only affects Gemini Apps and Vertex AI training - blocking it does NOT affect Google Search rankings.
- Bingbot: Blocking Bingbot affects BOTH Bing Search AND Microsoft Copilot functionality.
Last Updated: January 2026
Note: Crawler information and IP ranges are subject to change. Always verify with official documentation.