Skip to content

Instantly share code, notes, and snippets.

@hans2103
Last active February 18, 2024 11:39
  • Star 66 You must be signed in to star a gist
  • Fork 20 You must be signed in to fork a gist
Star You must be signed in to star a gist
Save hans2103/733b8eef30e89c759335017863bd721d to your computer and use it in GitHub Desktop.
NGINX to block bad bots. (add Twenga|TwengaBot if you want to exclude them too)
if ($http_user_agent ~* (360Spider|80legs.com|Abonti|AcoonBot|Acunetix|adbeat_bot|AddThis.com|adidxbot|ADmantX|AhrefsBot|AngloINFO|Antelope|Applebot|BaiduSpider|BeetleBot|billigerbot|binlar|bitlybot|BlackWidow|BLP_bbot|BoardReader|Bolt\ 0|BOT\ for\ JCE|Bot\ mailto\:craftbot@yahoo\.com|casper|CazoodleBot|CCBot|checkprivacy|ChinaClaw|chromeframe|Clerkbot|Cliqzbot|clshttp|CommonCrawler|comodo|CPython|crawler4j|Crawlera|CRAZYWEBCRAWLER|Curious|Curl|Custo|CWS_proxy|Default\ Browser\ 0|diavol|DigExt|Digincore|DIIbot|discobot|DISCo|DoCoMo|DotBot|Download\ Demon|DTS.Agent|EasouSpider|eCatch|ecxi|EirGrabber|Elmer|EmailCollector|EmailSiphon|EmailWolf|Exabot|ExaleadCloudView|ExpertSearchSpider|ExpertSearch|Express\ WebPictures|ExtractorPro|extract|EyeNetIE|Ezooms|F2S|FastSeek|feedfinder|FeedlyBot|FHscan|finbot|Flamingo_SearchEngine|FlappyBot|FlashGet|flicky|Flipboard|g00g1e|Genieo|genieo|GetRight|GetWeb\!|GigablastOpenSource|GozaikBot|Go\!Zilla|Go\-Ahead\-Got\-It|GrabNet|grab|Grafula|GrapeshotCrawler|GTB5|GT\:\:WWW|Guzzle|harvest|heritrix|HMView|HomePageBot|HTTP\:\:Lite|HTTrack|HubSpot|ia_archiver|icarus6|IDBot|id\-search|IlseBot|Image\ Stripper|Image\ Sucker|Indigonet|Indy\ Library|integromedb|InterGET|InternetSeer\.com|Internet\ Ninja|IRLbot|ISC\ Systems\ iRc\ Search\ 2\.1|jakarta|Java|JetCar|JobdiggerSpider|JOC\ Web\ Spider|Jooblebot|kanagawa|KINGSpider|kmccrew|larbin|LeechFTP|libwww|Lingewoud|LinkChecker|linkdexbot|LinksCrawler|LinksManager\.com_bot|linkwalker|LinqiaRSSBot|LivelapBot|ltx71|LubbersBot|lwp\-trivial|Mail.RU_Bot|masscan|Mass\ Downloader|maverick|Maxthon$|Mediatoolkitbot|MegaIndex|MegaIndex|megaindex|MFC_Tear_Sample|Microsoft\ URL\ Control|microsoft\.url|MIDown\ tool|miner|Missigua\ Locator|Mister\ PiX|mj12bot|Mozilla.*Indy|Mozilla.*NEWT|MSFrontPage|msnbot|Navroad|NearSite|NetAnts|netEstate|NetSpider|NetZIP|Net\ Vampire|NextGenSearchBot|nutch|Octopus|Offline\ Explorer|Offline\ Navigator|OpenindexSpider|OpenWebSpider|OrangeBot|Owlin|PageGrabber|PagesInventory|panopta|panscient\.com|Papa\ Foto|pavuk|pcBrowser|PECL\:\:HTTP|PeoplePal|Photon|PHPCrawl|planetwork|PleaseCrawl|PNAMAIN.EXE|PodcastPartyBot|prijsbest|proximic|psbot|purebot|pycurl|QuerySeekerSpider|R6_CommentReader|R6_FeedFetcher|RealDownload|ReGet|Riddler|Rippers\ 0|rogerbot|RSSingBot|rv\:1.9.1|RyzeCrawler|SafeSearch|SBIder|Scrapy|Scrapy|Screaming|SeaMonkey$|search.goo.ne.jp|SearchmetricsBot|search_robot|SemrushBot|Semrush|SentiBot|SEOkicks|SeznamBot|ShowyouBot|SightupBot|SISTRIX|sitecheck\.internetseer\.com|siteexplorer.info|SiteSnagger|skygrid|Slackbot|Slurp|SmartDownload|Snoopy|Sogou|Sosospider|spaumbot|Steeler|sucker|SuperBot|Superfeedr|SuperHTTP|SurdotlyBot|Surfbot|tAkeOut|Teleport\ Pro|TinEye-bot|TinEye|Toata\ dragostea\ mea\ pentru\ diavola|Toplistbot|trendictionbot|TurnitinBot|turnit|Twitterbot|URI\:\:Fetch|urllib|Vagabondo|Vagabondo|vikspider|VoidEYE|VoilaBot|WBSearchBot|webalta|WebAuto|WebBandit|WebCollage|WebCopier|WebFetch|WebGo\ IS|WebLeacher|WebReaper|WebSauger|Website\ eXtractor|Website\ Quester|WebStripper|WebWhacker|WebZIP|Web\ Image\ Collector|Web\ Sucker|Wells\ Search\ II|WEP\ Search|WeSEE|Wget|Widow|WinInet|woobot|woopingbot|worldwebheritage.org|Wotbox|WPScan|WWWOFFLE|WWW\-Mechanize|Xaldon\ WebSpider|XoviBot|yacybot|Yahoo|YandexBot|Yandex|YisouSpider|zermelo|Zeus|zh-CN|ZmEu|ZumBot|ZyBorg) ) {
return 410;
}
@philippeowagner
Copy link

Thanks for sharing this. Works like a charm but I would suggest to use HTTP 444 instead of 410.

@kenguish
Copy link

Thanks. You might want to add libcurl and libwww-perl too.

@extensionsapp
Copy link

extensionsapp commented Jul 31, 2017

These are good search bots. Why are they on the list?
Yahoo|YandexBot|Yandex|Twitterbot

@imagina
Copy link

imagina commented Mar 21, 2018

We found another bad bot scanning our servers: trovitBot

@dmitryd
Copy link

dmitryd commented Feb 19, 2019

@extensionsapp Yandex tend to be too aggressive.

@dronezzzko
Copy link

@precogtyrant
Copy link

Hello,
Thanks for the code. However, it also contains Yahoo in the list. Does this mean Yahoo search engine's bot. I would rather not block that one ;)

@Vish-was
Copy link

Vish-was commented Sep 6, 2019

Still giving this "Mozilla/5.0 (compatible; MJ12bot/v1.4.8; http://mj12bot.com/) in my access.log

@Vish-was
Copy link

Vish-was commented Sep 9, 2019

Hi This won't work in the nginx.conf setting
also, I can manage to remove some bots via robots.txt

User-agent: MJ12bot
user-agent: SemrushBot
User-agent: Yandex
User-agent: YandexBot
User-agent: UptimeRobot
User-agent: AhrefsBot
User-agent: GoogleBot
User-agent: BingBot
Disallow: /

but some are still there
like GoogleBot, BingBot

@dmhendricks
Copy link

Still giving this "Mozilla/5.0 (compatible; MJ12bot/v1.4.8; http://mj12bot.com/) in my access.log

access_log off;
return 444;

@qaisjp
Copy link

qaisjp commented Apr 29, 2020

Still giving this "Mozilla/5.0 (compatible; MJ12bot/v1.4.8; http://mj12bot.com/) in my access.log

access_log off;
return 444;

If it says:

nginx: [emerg] "access_log" directive is not allowed here

Put the if block inside your location directive, as per https://nginx.org/en/docs/http/ngx_http_log_module.html#access_log:

Context: http, server, location, if in location, limit_except

@Inner-Creator
Copy link

Hey Guys, How about a piece of code that allows the bots listed in the .htaccess file to allow crawling my website and blocks all other bots that are not listed in the file. Is that even possible?

@hans2103
Copy link
Author

hans2103 commented Jul 4, 2023

@Small-Being change to logical if-statement to check if-not-in-list, instead of if-in-list

@Inner-Creator
Copy link

@hans2103 Thanks for the solution, it would be great if you could type in the piece of code i should apply in my WP .htaccess file. TIA 👍

@qaisjp
Copy link

qaisjp commented Jul 4, 2023

This gist is about nginx . If your WordPress instance makes use of .htaccess files, that's a different technology called Apache HTTP Server, sorry.

@Devastatia
Copy link

Here are some I block. Some may be duplicates of what you already have.

A lot of homebrew crawlers running on EC2 and other cloud hosts use HeadlessChrome.

SummalyBot, Mastodon, and Misskey are used to create a link preview when a user posts a link on a Mastodon instance. That wouldn't be so bad, except they send 200+ bots at the same time to verify one link.

facebookexternalhit is used for the same thing. I'm banned from Faecesbook, so I block their bot. 🤷

ByteSpider may be a legit search engine, but also the largest AI firm in China. These firms scrape websites for content to train AIs, which is IP theft IMO.

'HeadlessChrome',
'trendiction.de',
'Bytespider',
'ahrefs',
'okhttp',
'SemrushBot',
'cpp-httplib',
'aiohttp',
'Go-http-client',
'Ruby',
'curl',
'python-requests',
'facebookexternalhit',
'DataForSeoBot',
'Python',
'Mastodon',
'SummalyBot',
'got',
'Misskey',
'IonCrawl',
't3versions',
'Dataprovider.com'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment