- Modify /etc/nginx/nginx.conf file
- Modify /etc/nginx/sites-available/site.conf file
- Create /etc/nginx/useragent.rule file
Where to find user agent strings?
https://explore.whatismybrowser.com/useragents/explore/software_name/facebook-bot/
Looking for same but for Apache2? Here: https://techexpert.tips/apache/apache-blocking-bad-bots-crawlers/
Test:
[rubin@reaper ~]$ curl -A "instagram" -I https://plrm.podcastalot.com
HTTP/2 418
server: nginx/1.18.0
date: Mon, 26 Jun 2023 06:07:25 GMT
content-type: text/html
content-length: 197
@dangovorenefekt So you’d like to block the scrapers used by some large companies, most of which appear well-behaved (i.e. they observe
robots.txt
andnoindex
directives). Some organizations disclose IPs they use, but you are banking on their honesty.If we assume that trillion-dollar organizations are dishonest about how they scrape: they can spoof their user-agent, TLS fingerprint, IPs, etc. and use a headless browser very easily. There isn’t really a way to protect yourself from this without also excluding real users (e.g. invasive/inaccessible CAPTCHAs for which workarounds exist).
They can get your content without scraping by downloading other data sets like the Common Crawl (Google did this for Bard), purchasing data sets from other vendors, or acquiring other companies with their own indexes.
The alternative is to assume they’re at least somewhat honest about scraping content. If you use a
noindex
robots directive in your markup and HTTP headers but allow crawling, their crawlers will visit but won’t index your site no matter what user-agent or data set they use. Check their webmaster documentation to double-check their support for these features.POSSE note from https://seirdy.one/notes/2023/07/06/blocking-certain-bots/