Skip to content

Instantly share code, notes, and snippets.

@dangovorenefekt
Last active May 9, 2024 23:39
Show Gist options
  • Save dangovorenefekt/b187b30e59ed1b827515cdbc833bc1bf to your computer and use it in GitHub Desktop.
Save dangovorenefekt/b187b30e59ed1b827515cdbc833bc1bf to your computer and use it in GitHub Desktop.
Block Meta and Twitter (nginx)
  1. Modify /etc/nginx/nginx.conf file
  2. Modify /etc/nginx/sites-available/site.conf file
  3. Create /etc/nginx/useragent.rule file

Where to find user agent strings?
https://explore.whatismybrowser.com/useragents/explore/software_name/facebook-bot/

Looking for same but for Apache2? Here: https://techexpert.tips/apache/apache-blocking-bad-bots-crawlers/

Test:

[rubin@reaper ~]$ curl -A "instagram" -I https://plrm.podcastalot.com
HTTP/2 418
server: nginx/1.18.0
date: Mon, 26 Jun 2023 06:07:25 GMT
content-type: text/html
content-length: 197
http {
.....
include /etc/nginx/useragent.rules
}
server {
....
if ($badagent) {
return 418;
}
....
}
map $http_user_agent $badagent {
default 0;
~*AdsBot-Google 1;
~*Amazonbot 1;
~*anthropic-ai 1;
~*AwarioRssBot 1;
~*AwarioSmartBot 1;
~*Bytespider 1;
~*CCBot 1;
~*ChatGPT-User 1;
~*ClaudeBot 1;
~*Claude-Web 1;
~*cohere-ai 1;
~*DataForSeoBot 1;
~*FacebookBot 1;
~*facebookexternalhit 1;
~*facebook 1;
~*facebot 1;
~*Google-Extended 1;
~*GPTBot 1;
~*ImagesiftBot 1;
~*magpie-crawler 1;
~*omgili 1;
~*omgilibot 1;
~*peer39_crawler 1;
~*peer39_crawler/1.0 1;
~*PerplexityBot 1;
~*YouBot 1;
~*instagram 1;
~*tweet 1;
~*tweeter 1;
}
@dangovorenefekt
Copy link
Author

Blocking Common Crawl too:

 ~*ccbot         1;

Those trillion-dollar organizations have proven record of being dishonest.
There is nothing to assume here.
They were caught red-handed many times. Read the news. To believe in their honesty is a dangerous delusion.

Yes, I am aware there is many ways they can obtain my content, this does not mean I should give it up for free without making them pay for it - if not to me, then to some third party. I don't mind as long as they waste some cents to get it.

@BloodyIron
Copy link

Thanks for this! Tuning to my needs, but this should help me with a Claudebot and Amazon bot scraping issue I'm having, fuckers not respecting reasonable traffic rates! Some of my sites have crashed as a result.... Not interested in feeding LLMs for $000FREEE while they make money off the backs of my work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment