Skip to content

Instantly share code, notes, and snippets.

Last active April 7, 2024 14:28
Show Gist options
  • Star 68 You must be signed in to star a gist
  • Fork 20 You must be signed in to fork a gist
  • Save hans2103/733b8eef30e89c759335017863bd721d to your computer and use it in GitHub Desktop.
Save hans2103/733b8eef30e89c759335017863bd721d to your computer and use it in GitHub Desktop.
NGINX to block bad bots. (add Twenga|TwengaBot if you want to exclude them too)
if ($http_user_agent ~* (360Spider||Abonti|AcoonBot|Acunetix|adbeat_bot||adidxbot|ADmantX|AhrefsBot|AngloINFO|Antelope|Applebot|BaiduSpider|BeetleBot|billigerbot|binlar|bitlybot|BlackWidow|BLP_bbot|BoardReader|Bolt\ 0|BOT\ for\ JCE|Bot\ mailto\:craftbot@yahoo\.com|casper|CazoodleBot|CCBot|checkprivacy|ChinaClaw|chromeframe|Clerkbot|Cliqzbot|clshttp|CommonCrawler|comodo|CPython|crawler4j|Crawlera|CRAZYWEBCRAWLER|Curious|Curl|Custo|CWS_proxy|Default\ Browser\ 0|diavol|DigExt|Digincore|DIIbot|discobot|DISCo|DoCoMo|DotBot|Download\ Demon|DTS.Agent|EasouSpider|eCatch|ecxi|EirGrabber|Elmer|EmailCollector|EmailSiphon|EmailWolf|Exabot|ExaleadCloudView|ExpertSearchSpider|ExpertSearch|Express\ WebPictures|ExtractorPro|extract|EyeNetIE|Ezooms|F2S|FastSeek|feedfinder|FeedlyBot|FHscan|finbot|Flamingo_SearchEngine|FlappyBot|FlashGet|flicky|Flipboard|g00g1e|Genieo|genieo|GetRight|GetWeb\!|GigablastOpenSource|GozaikBot|Go\!Zilla|Go\-Ahead\-Got\-It|GrabNet|grab|Grafula|GrapeshotCrawler|GTB5|GT\:\:WWW|Guzzle|harvest|heritrix|HMView|HomePageBot|HTTP\:\:Lite|HTTrack|HubSpot|ia_archiver|icarus6|IDBot|id\-search|IlseBot|Image\ Stripper|Image\ Sucker|Indigonet|Indy\ Library|integromedb|InterGET|InternetSeer\.com|Internet\ Ninja|IRLbot|ISC\ Systems\ iRc\ Search\ 2\.1|jakarta|Java|JetCar|JobdiggerSpider|JOC\ Web\ Spider|Jooblebot|kanagawa|KINGSpider|kmccrew|larbin|LeechFTP|libwww|Lingewoud|LinkChecker|linkdexbot|LinksCrawler|LinksManager\.com_bot|linkwalker|LinqiaRSSBot|LivelapBot|ltx71|LubbersBot|lwp\-trivial|Mail.RU_Bot|masscan|Mass\ Downloader|maverick|Maxthon$|Mediatoolkitbot|MegaIndex|MegaIndex|megaindex|MFC_Tear_Sample|Microsoft\ URL\ Control|microsoft\.url|MIDown\ tool|miner|Missigua\ Locator|Mister\ PiX|mj12bot|Mozilla.*Indy|Mozilla.*NEWT|MSFrontPage|msnbot|Navroad|NearSite|NetAnts|netEstate|NetSpider|NetZIP|Net\ Vampire|NextGenSearchBot|nutch|Octopus|Offline\ Explorer|Offline\ Navigator|OpenindexSpider|OpenWebSpider|OrangeBot|Owlin|PageGrabber|PagesInventory|panopta|panscient\.com|Papa\ Foto|pavuk|pcBrowser|PECL\:\:HTTP|PeoplePal|Photon|PHPCrawl|planetwork|PleaseCrawl|PNAMAIN.EXE|PodcastPartyBot|prijsbest|proximic|psbot|purebot|pycurl|QuerySeekerSpider|R6_CommentReader|R6_FeedFetcher|RealDownload|ReGet|Riddler|Rippers\ 0|rogerbot|RSSingBot|rv\:1.9.1|RyzeCrawler|SafeSearch|SBIder|Scrapy|Scrapy|Screaming|SeaMonkey$||SearchmetricsBot|search_robot|SemrushBot|Semrush|SentiBot|SEOkicks|SeznamBot|ShowyouBot|SightupBot|SISTRIX|sitecheck\.internetseer\.com||SiteSnagger|skygrid|Slackbot|Slurp|SmartDownload|Snoopy|Sogou|Sosospider|spaumbot|Steeler|sucker|SuperBot|Superfeedr|SuperHTTP|SurdotlyBot|Surfbot|tAkeOut|Teleport\ Pro|TinEye-bot|TinEye|Toata\ dragostea\ mea\ pentru\ diavola|Toplistbot|trendictionbot|TurnitinBot|turnit|Twitterbot|URI\:\:Fetch|urllib|Vagabondo|Vagabondo|vikspider|VoidEYE|VoilaBot|WBSearchBot|webalta|WebAuto|WebBandit|WebCollage|WebCopier|WebFetch|WebGo\ IS|WebLeacher|WebReaper|WebSauger|Website\ eXtractor|Website\ Quester|WebStripper|WebWhacker|WebZIP|Web\ Image\ Collector|Web\ Sucker|Wells\ Search\ II|WEP\ Search|WeSEE|Wget|Widow|WinInet|woobot|woopingbot||Wotbox|WPScan|WWWOFFLE|WWW\-Mechanize|Xaldon\ WebSpider|XoviBot|yacybot|Yahoo|YandexBot|Yandex|YisouSpider|zermelo|Zeus|zh-CN|ZmEu|ZumBot|ZyBorg) ) {
return 410;
Copy link

Thanks for sharing this. Works like a charm but I would suggest to use HTTP 444 instead of 410.

Copy link

Thanks. You might want to add libcurl and libwww-perl too.

Copy link

extensionsapp commented Jul 31, 2017

These are good search bots. Why are they on the list?

Copy link

imagina commented Mar 21, 2018

We found another bad bot scanning our servers: trovitBot

Copy link

dmitryd commented Feb 19, 2019

@extensionsapp Yandex tend to be too aggressive.

Copy link

Copy link

Thanks for the code. However, it also contains Yahoo in the list. Does this mean Yahoo search engine's bot. I would rather not block that one ;)

Copy link

Vish-was commented Sep 6, 2019

Still giving this "Mozilla/5.0 (compatible; MJ12bot/v1.4.8; in my access.log

Copy link

Vish-was commented Sep 9, 2019

Hi This won't work in the nginx.conf setting
also, I can manage to remove some bots via robots.txt

User-agent: MJ12bot
user-agent: SemrushBot
User-agent: Yandex
User-agent: YandexBot
User-agent: UptimeRobot
User-agent: AhrefsBot
User-agent: GoogleBot
User-agent: BingBot
Disallow: /

but some are still there
like GoogleBot, BingBot

Copy link

Still giving this "Mozilla/5.0 (compatible; MJ12bot/v1.4.8; in my access.log

access_log off;
return 444;

Copy link

qaisjp commented Apr 29, 2020

Still giving this "Mozilla/5.0 (compatible; MJ12bot/v1.4.8; in my access.log

access_log off;
return 444;

If it says:

nginx: [emerg] "access_log" directive is not allowed here

Put the if block inside your location directive, as per

Context: http, server, location, if in location, limit_except

Copy link

Hey Guys, How about a piece of code that allows the bots listed in the .htaccess file to allow crawling my website and blocks all other bots that are not listed in the file. Is that even possible?

Copy link

hans2103 commented Jul 4, 2023

@Small-Being change to logical if-statement to check if-not-in-list, instead of if-in-list

Copy link

@hans2103 Thanks for the solution, it would be great if you could type in the piece of code i should apply in my WP .htaccess file. TIA 👍

Copy link

qaisjp commented Jul 4, 2023

This gist is about nginx . If your WordPress instance makes use of .htaccess files, that's a different technology called Apache HTTP Server, sorry.

Copy link

Here are some I block. Some may be duplicates of what you already have.

A lot of homebrew crawlers running on EC2 and other cloud hosts use HeadlessChrome.

SummalyBot, Mastodon, and Misskey are used to create a link preview when a user posts a link on a Mastodon instance. That wouldn't be so bad, except they send 200+ bots at the same time to verify one link.

facebookexternalhit is used for the same thing. I'm banned from Faecesbook, so I block their bot. 🤷

ByteSpider may be a legit search engine, but also the largest AI firm in China. These firms scrape websites for content to train AIs, which is IP theft IMO.


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment