-
-
Save hans2103/733b8eef30e89c759335017863bd721d to your computer and use it in GitHub Desktop.
if ($http_user_agent ~* (360Spider|80legs.com|Abonti|AcoonBot|Acunetix|adbeat_bot|AddThis.com|adidxbot|ADmantX|AhrefsBot|AngloINFO|Antelope|Applebot|BaiduSpider|BeetleBot|billigerbot|binlar|bitlybot|BlackWidow|BLP_bbot|BoardReader|Bolt\ 0|BOT\ for\ JCE|Bot\ mailto\:craftbot@yahoo\.com|casper|CazoodleBot|CCBot|checkprivacy|ChinaClaw|chromeframe|Clerkbot|Cliqzbot|clshttp|CommonCrawler|comodo|CPython|crawler4j|Crawlera|CRAZYWEBCRAWLER|Curious|Curl|Custo|CWS_proxy|Default\ Browser\ 0|diavol|DigExt|Digincore|DIIbot|discobot|DISCo|DoCoMo|DotBot|Download\ Demon|DTS.Agent|EasouSpider|eCatch|ecxi|EirGrabber|Elmer|EmailCollector|EmailSiphon|EmailWolf|Exabot|ExaleadCloudView|ExpertSearchSpider|ExpertSearch|Express\ WebPictures|ExtractorPro|extract|EyeNetIE|Ezooms|F2S|FastSeek|feedfinder|FeedlyBot|FHscan|finbot|Flamingo_SearchEngine|FlappyBot|FlashGet|flicky|Flipboard|g00g1e|Genieo|genieo|GetRight|GetWeb\!|GigablastOpenSource|GozaikBot|Go\!Zilla|Go\-Ahead\-Got\-It|GrabNet|grab|Grafula|GrapeshotCrawler|GTB5|GT\:\:WWW|Guzzle|harvest|heritrix|HMView|HomePageBot|HTTP\:\:Lite|HTTrack|HubSpot|ia_archiver|icarus6|IDBot|id\-search|IlseBot|Image\ Stripper|Image\ Sucker|Indigonet|Indy\ Library|integromedb|InterGET|InternetSeer\.com|Internet\ Ninja|IRLbot|ISC\ Systems\ iRc\ Search\ 2\.1|jakarta|Java|JetCar|JobdiggerSpider|JOC\ Web\ Spider|Jooblebot|kanagawa|KINGSpider|kmccrew|larbin|LeechFTP|libwww|Lingewoud|LinkChecker|linkdexbot|LinksCrawler|LinksManager\.com_bot|linkwalker|LinqiaRSSBot|LivelapBot|ltx71|LubbersBot|lwp\-trivial|Mail.RU_Bot|masscan|Mass\ Downloader|maverick|Maxthon$|Mediatoolkitbot|MegaIndex|MegaIndex|megaindex|MFC_Tear_Sample|Microsoft\ URL\ Control|microsoft\.url|MIDown\ tool|miner|Missigua\ Locator|Mister\ PiX|mj12bot|Mozilla.*Indy|Mozilla.*NEWT|MSFrontPage|msnbot|Navroad|NearSite|NetAnts|netEstate|NetSpider|NetZIP|Net\ Vampire|NextGenSearchBot|nutch|Octopus|Offline\ Explorer|Offline\ Navigator|OpenindexSpider|OpenWebSpider|OrangeBot|Owlin|PageGrabber|PagesInventory|panopta|panscient\.com|Papa\ Foto|pavuk|pcBrowser|PECL\:\:HTTP|PeoplePal|Photon|PHPCrawl|planetwork|PleaseCrawl|PNAMAIN.EXE|PodcastPartyBot|prijsbest|proximic|psbot|purebot|pycurl|QuerySeekerSpider|R6_CommentReader|R6_FeedFetcher|RealDownload|ReGet|Riddler|Rippers\ 0|rogerbot|RSSingBot|rv\:1.9.1|RyzeCrawler|SafeSearch|SBIder|Scrapy|Scrapy|Screaming|SeaMonkey$|search.goo.ne.jp|SearchmetricsBot|search_robot|SemrushBot|Semrush|SentiBot|SEOkicks|SeznamBot|ShowyouBot|SightupBot|SISTRIX|sitecheck\.internetseer\.com|siteexplorer.info|SiteSnagger|skygrid|Slackbot|Slurp|SmartDownload|Snoopy|Sogou|Sosospider|spaumbot|Steeler|sucker|SuperBot|Superfeedr|SuperHTTP|SurdotlyBot|Surfbot|tAkeOut|Teleport\ Pro|TinEye-bot|TinEye|Toata\ dragostea\ mea\ pentru\ diavola|Toplistbot|trendictionbot|TurnitinBot|turnit|Twitterbot|URI\:\:Fetch|urllib|Vagabondo|Vagabondo|vikspider|VoidEYE|VoilaBot|WBSearchBot|webalta|WebAuto|WebBandit|WebCollage|WebCopier|WebFetch|WebGo\ IS|WebLeacher|WebReaper|WebSauger|Website\ eXtractor|Website\ Quester|WebStripper|WebWhacker|WebZIP|Web\ Image\ Collector|Web\ Sucker|Wells\ Search\ II|WEP\ Search|WeSEE|Wget|Widow|WinInet|woobot|woopingbot|worldwebheritage.org|Wotbox|WPScan|WWWOFFLE|WWW\-Mechanize|Xaldon\ WebSpider|XoviBot|yacybot|Yahoo|YandexBot|Yandex|YisouSpider|zermelo|Zeus|zh-CN|ZmEu|ZumBot|ZyBorg) ) { | |
return 410; | |
} |
Still giving this "Mozilla/5.0 (compatible; MJ12bot/v1.4.8; http://mj12bot.com/)
in my access.log
Hi This won't work in the nginx.conf
setting
also, I can manage to remove some bots via robots.txt
User-agent: MJ12bot
user-agent: SemrushBot
User-agent: Yandex
User-agent: YandexBot
User-agent: UptimeRobot
User-agent: AhrefsBot
User-agent: GoogleBot
User-agent: BingBot
Disallow: /
but some are still there
like GoogleBot, BingBot
Still giving this
"Mozilla/5.0 (compatible; MJ12bot/v1.4.8; http://mj12bot.com/)
in my access.log
access_log off;
return 444;
Still giving this
"Mozilla/5.0 (compatible; MJ12bot/v1.4.8; http://mj12bot.com/)
in my access.logaccess_log off; return 444;
If it says:
nginx: [emerg] "access_log" directive is not allowed here
Put the if
block inside your location
directive, as per https://nginx.org/en/docs/http/ngx_http_log_module.html#access_log:
Context:
http
,server
,location
,if in location
,limit_except
Hey Guys, How about a piece of code that allows the bots listed in the .htaccess file to allow crawling my website and blocks all other bots that are not listed in the file. Is that even possible?
@Small-Being change to logical if-statement to check if-not-in-list, instead of if-in-list
@hans2103 Thanks for the solution, it would be great if you could type in the piece of code i should apply in my WP .htaccess file. TIA 👍
This gist is about nginx
. If your WordPress instance makes use of .htaccess
files, that's a different technology called Apache HTTP Server, sorry.
Here are some I block. Some may be duplicates of what you already have.
A lot of homebrew crawlers running on EC2 and other cloud hosts use HeadlessChrome.
SummalyBot, Mastodon, and Misskey are used to create a link preview when a user posts a link on a Mastodon instance. That wouldn't be so bad, except they send 200+ bots at the same time to verify one link.
facebookexternalhit is used for the same thing. I'm banned from Faecesbook, so I block their bot. 🤷
ByteSpider may be a legit search engine, but also the largest AI firm in China. These firms scrape websites for content to train AIs, which is IP theft IMO.
'HeadlessChrome',
'trendiction.de',
'Bytespider',
'ahrefs',
'okhttp',
'SemrushBot',
'cpp-httplib',
'aiohttp',
'Go-http-client',
'Ruby',
'curl',
'python-requests',
'facebookexternalhit',
'DataForSeoBot',
'Python',
'Mastodon',
'SummalyBot',
'got',
'Misskey',
'IonCrawl',
't3versions',
'Dataprovider.com'
Hello,
Thanks for the code. However, it also contains Yahoo in the list. Does this mean Yahoo search engine's bot. I would rather not block that one ;)