Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
Apache: Better Blocking with common rules
Following on from other Gists I have posted, this one shows a neat way of using Includes to centralise general blocking rules for Bad Bots, creepy crawlers and irritating IPs
see the full post at http://www.blue-bag.com/blog/apache-better-blocking-common-rules
## A list of known problem IPs
# pen test on FKEditor
SetEnvIfNoCase REMOTE_ADDR "175\.44\.30\.180" BlockedAddress
SetEnvIfNoCase REMOTE_ADDR "175\.44\.29\.92" BlockedAddress
SetEnvIfNoCase REMOTE_ADDR "175\.44\.30\.180" BlockedAddress
SetEnvIfNoCase REMOTE_ADDR "174\.139\.240\.74" BlockedAddress
# looking for backups
SetEnvIfNoCase REMOTE_ADDR "192\.99\.12\.128" BlockedAddress
# Bad Crawler
SetEnvIfNoCase REMOTE_ADDR "144\.76\.195\.72" BlockedAddress
SetEnvIfNoCase REMOTE_ADDR "54\.189\.47\.213" BlockedAddress
# Java scraper
SetEnvIfNoCase REMOTE_ADDR "62\.116\.110\.111" BlockedAddress
# Big hitter - known spammer
SetEnvIfNoCase REMOTE_ADDR "109\.201\.137\.166" BlockedAddress
# list obtained from 3rd party
SetEnvIfNoCase User-Agent ^$ bad_bot #this is for blank user-agents
SetEnvIfNoCase User-Agent "Jakarta" BlockedAgent
SetEnvIfNoCase User-Agent "User-Agent" BlockedAgent
SetEnvIfNoCase User-Agent "libwww," BlockedAgent
SetEnvIfNoCase User-Agent "lwp-trivial" BlockedAgent
SetEnvIfNoCase User-Agent "Snoopy" BlockedAgent
SetEnvIfNoCase User-Agent "PHPCrawl" BlockedAgent
SetEnvIfNoCase User-Agent "WEP Search" BlockedAgent
SetEnvIfNoCase User-Agent "Missigua Locator" BlockedAgent
SetEnvIfNoCase User-Agent "ISC Systems iRc" BlockedAgent
SetEnvIfNoCase User-Agent "lwp-trivial" BlockedAgent
SetEnvIfNoCase User-Agent "GbPlugin" BlockedAgent
SetEnvIfNoCase User-Agent "Wget" BlockedAgent
SetEnvIfNoCase User-Agent "EmailSiphon" BlockedAgent
SetEnvIfNoCase User-Agent "EmailWolf" BlockedAgent
SetEnvIfNoCase User-Agent "libwww-perl" BlockedAgent
## end of 3rd party list (note could also block them in Robots.txt see article)
## List derived from actual activity
# Mozilla/5.0 (compatible; BLEXBot/1.0; +http://webmeup-crawler.com/)
SetEnvIfNoCase User-Agent "BLEXBot" BlockedAgent
# Mozilla/5.0 (compatible; 007ac9 Crawler; http://crawler.007ac9.net/)
SetEnvIfNoCase User-Agent "007ac9 Crawler" BlockedAgent
#Mozilla/5.0 (compatible; MJ12bot/v1.4.5; http://www.majestic12.co.uk/bot.php?+)
SetEnvIfNoCase User-Agent "MJ12bot" BlockedAgent
# Fetchbot (https://github.com/PuerkitoBio/fetchbot)
SetEnvIfNoCase User-Agent "Fetchbot" BlockedAgent
#Mozilla/5.0 (compatible; SISTRIX Crawler; http://crawler.sistrix.net/)
SetEnvIfNoCase User-Agent "SISTRIX" BlockedAgent
<VirtualHost *:80>
## Note this is heavily reduced just to show the relevant lines
## Expires and security options have been removed
## Don't just paste this - but refer to it along with your customisations
ServerName www.example.com
DocumentRoot /var/www/example.com/live/htdocs
<Directory /var/www/example.com/live/htdocs>
Options +FollowSymLinks
# Disable .htaccess files (remember to account for any rules they implement)
AllowOverride None
# Include our blocked lists
Include /etc/apache2/blocked-addresses.conf
Include /etc/apache2/blocked-agents.conf
Order allow,deny
Allow from all
# Deny from our blocked lists
deny from env=BlockedAddress
deny from env=BlockedAgent
<IfModule mod_rewrite.c>
RewriteEngine on
# Intercept Microsoft Office Protocol Discovery
# OPTION requests for this were hitting site regularly
RewriteCond %{REQUEST_METHOD} ^OPTIONS
RewriteCond %{HTTP_USER_AGENT} ^Microsoft\ Office\ Protocol\ Discovery [OR]
RewriteCond %{HTTP_USER_AGENT} ^Microsoft\ Office\ Existence\ Discovery [OR]
RewriteCond %{HTTP_USER_AGENT} ^Microsoft\-WebDAV\-MiniRedir.*$
RewriteRule .* - [R=405,L]
##### Security hardening ####
## DENY REQUEST BASED ON REQUEST METHOD ###
RewriteCond %{REQUEST_METHOD} ^(TRACE|TRACK|OPTIONS|HEAD)$ [NC]
RewriteRule ^.*$ - [F]
</IfModule>
</Directory>
## the following log details are included to show
## how to use SetEnvIf to include/exclude certain requests for images etc
## Also turn on robots.txt logging to check robots behaviour
## Custom Logging for combined logs - note they are filtered to not log images, robots.txt, cs, js etc
UseCanonicalName On
LogFormat "%V %h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-agent}i\"" vcommon
ErrorLog /var/www/log/customer-error.log
# Possible values include: debug, info, notice, warn, error, crit,
# alert, emerg.
LogLevel warn
## we aren't logging images, css, js etc
## flag robots.txt requests - allow these to test robot behaviour
SetEnvIf Request_URI "^/robots\.txt$" robots-request=0
## flag favicon requests
SetEnvIf Request_URI "^/favicon\.ico$" favicon-request=1
## flag image requests
SetEnvIf Request_URI "(\.gif|\.png|\.jpg)$" image-request=1
## flag Css and JS requests
SetEnvIf Request_URI \.css css-request=1
SetEnvIf Request_URI \.js js-request=1
## set do_not_log if any of the above flags are set
SetEnvIf robots-request 1 do_not_log=1
SetEnvIf favicon-request 1 do_not_log=1
SetEnvIf image-request 1 do_not_log=1
SetEnvIf css-request 1 do_not_log=1
SetEnvIf js-request 1 do_not_log=1
## only log if do_not_log is not set
CustomLog /var/www/log/customer-access.log vcommon env=!do_not_log
</VirtualHost>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.