Skip to content

Instantly share code, notes, and snippets.

@edmondburnett
Last active July 13, 2024 05:26
Show Gist options
  • Save edmondburnett/8694c18c8fdc38fd81ea23a4fae74a14 to your computer and use it in GitHub Desktop.
Save edmondburnett/8694c18c8fdc38fd81ea23a4fae74a14 to your computer and use it in GitHub Desktop.
robots.txt for blocking known AI bots/crawlers
User-agent: ChatGPT-User
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: GPTBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: Omgilibot
Disallow: /
User-agent: FacebookBot
Disallow: /
User-agent: ImagesiftBot
Disallow: /
User-agent: Applebot-Extended
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Omgili
Disallow: /
User-agent: Diffbot
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: PerplexityBot
Disallow: /
User-agent: cohere-ai
Disallow: /
@edmondburnett
Copy link
Author

edmondburnett commented Jun 18, 2024

There's limitations to this. The most obvious one is that many bots will simply not respect or honor your robots.txt.

Also, Microsoft and Google AI have been shown to essentially bypass the need for direct crawling by their AI bots, by training their models from the data already collected and cached from your site by the Google or Bing search engines. This loophole unfortunately makes it impossible to opt-out of AI without also removing your sites from the major search engines.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment