Skip to content

Instantly share code, notes, and snippets.

@dangovorenefekt
Last active April 26, 2024 11:12
Show Gist options
  • Star 55 You must be signed in to star a gist
  • Fork 6 You must be signed in to fork a gist
  • Save dangovorenefekt/b187b30e59ed1b827515cdbc833bc1bf to your computer and use it in GitHub Desktop.
Save dangovorenefekt/b187b30e59ed1b827515cdbc833bc1bf to your computer and use it in GitHub Desktop.
Block Meta and Twitter (nginx)
  1. Modify /etc/nginx/nginx.conf file
  2. Modify /etc/nginx/sites-available/site.conf file
  3. Create /etc/nginx/useragent.rule file

Where to find user agent strings?
https://explore.whatismybrowser.com/useragents/explore/software_name/facebook-bot/

Looking for same but for Apache2? Here: https://techexpert.tips/apache/apache-blocking-bad-bots-crawlers/

Test:

[rubin@reaper ~]$ curl -A "instagram" -I https://plrm.podcastalot.com
HTTP/2 418
server: nginx/1.18.0
date: Mon, 26 Jun 2023 06:07:25 GMT
content-type: text/html
content-length: 197
http {
.....
include /etc/nginx/useragent.rules
}
server {
....
if ($badagent) {
return 418;
}
....
}
map $http_user_agent $badagent {
default 0;
~*AdsBot-Google 1;
~*Amazonbot 1;
~*anthropic-ai 1;
~*AwarioRssBot 1;
~*AwarioSmartBot 1;
~*Bytespider 1;
~*CCBot 1;
~*ChatGPT-User 1;
~*ClaudeBot 1;
~*Claude-Web 1;
~*cohere-ai 1;
~*DataForSeoBot 1;
~*FacebookBot 1;
~*facebookexternalhit 1;
~*facebook 1;
~*facebot 1;
~*Google-Extended 1;
~*GPTBot 1;
~*ImagesiftBot 1;
~*magpie-crawler 1;
~*omgili 1;
~*omgilibot 1;
~*peer39_crawler 1;
~*peer39_crawler/1.0 1;
~*PerplexityBot 1;
~*YouBot 1;
~*instagram 1;
~*tweet 1;
~*tweeter 1;
}
@kisamoto
Copy link

Worth noting that most of these bots are 'good bots' (i.e. they will obey robots.txt). So you can avoid the nginx resource usage entirely by adding suitable robots.txt entries.

I think using nginx tests like this could have negative effects on showing OpenGraph metadata (including images).

If choosing this approach however I would probably respond with a 403 code to mark forbidden as bots are more likely to continue making attempts if they think the server would come back online.

@xvilo
Copy link

xvilo commented Jun 26, 2023

+1 for @kisamoto

The solution here, although it will work, is not great

@dangovorenefekt
Copy link
Author

You miss the point. I don't want my content on those sites in any form and I don't want my content to feed their algorithms.
Using robot.txt assumes they will 'obey' it. But they may choose not to. Its not mandatory in anyway.

@kisamoto
Copy link

If there is the risk that a bot does not obey robots.txt, why would there be an assumption that they will obey the user agent rules and not just randomly alter legitimate browser user agents?

If you're serious about blocking access to your content I would:

  • Use robots.txt (even if you think they might not obey it, it's a good start);
  • Use your method (for when bad bots are good with their user agent);
  • Add blocking of known Meta/Twitter etc. IP CIDR blocks (for bad bots with a regular user agent);
  • Rate limit (to stop aggresive crawling);

@dangovorenefekt
Copy link
Author

That is right. Only this measure won't do much it has to be part of more complex approach.

@Seirdy
Copy link

Seirdy commented Jul 6, 2023

I don’t want my content on those sites in any form and I don’t want my content to feed their algorithms. Using robot.txt assumes they will ‘obey’ it. But they may choose not to.

@dangovorenefekt So you’d like to block the scrapers used by some large companies, most of which appear well-behaved (i.e. they observe robots.txt and noindex directives). Some organizations disclose IPs they use, but you are banking on their honesty.

If we assume that trillion-dollar organizations are dishonest about how they scrape: they can spoof their user-agent, TLS fingerprint, IPs, etc. and use a headless browser very easily. There isn’t really a way to protect yourself from this without also excluding real users (e.g. invasive/inaccessible CAPTCHAs for which workarounds exist).

They can get your content without scraping by downloading other data sets like the Common Crawl (Google did this for Bard), purchasing data sets from other vendors, or acquiring other companies with their own indexes.

The alternative is to assume they’re at least somewhat honest about scraping content. If you use a noindex robots directive in your markup and HTTP headers but allow crawling, their crawlers will visit but won’t index your site no matter what user-agent or data set they use. Check their webmaster documentation to double-check their support for these features.


POSSE note from https://seirdy.one/notes/2023/07/06/blocking-certain-bots/

@dangovorenefekt
Copy link
Author

Blocking Common Crawl too:

 ~*ccbot         1;

Those trillion-dollar organizations have proven record of being dishonest.
There is nothing to assume here.
They were caught red-handed many times. Read the news. To believe in their honesty is a dangerous delusion.

Yes, I am aware there is many ways they can obtain my content, this does not mean I should give it up for free without making them pay for it - if not to me, then to some third party. I don't mind as long as they waste some cents to get it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment