Bot-Trap is a package that will enable your web
site to automatically ban bad web robots (aka web spiders) that ignore
the robots.txt file. Bot-trap is great for blocking email harvesters,
corporate tattletales, scrapers, and more.
Bot-trap - A Bad Web-Robot Blocker
This package will enable your web site to automatically ban bad web robots (aka web
spiders) that ignore the robots.txt file.
This does not include Googlebot and other well-behaved robots. The software requirements
are Apache and PHP, but the principles would work with any web server setup.
Three of the most common bad robots are:
- Email harvesters: they want to spam you.
- Corporate tattletales: they report back to corporations if you use their
trademark, criticize them, violate their copyrights, and so on.
- Scrapers: they copy your whole site, then set it up somewhere else and put Adwords
on it. Bot-trap can't protect against these bots because they usually follow
robots.txt.
This package will protect against email-harvesting robots whether they follow
robots.txt or not:
- Exclude your contact page in robots.txt.
- Email harvesting bots that follow robots.txt won't get your email.
- Email harvesting bots that don't follow robots.txt will quickly get banned and
won't get your email.
Update, December 2006:
So many people have started using bot-trap and other bad robot banners that many email
harvesting robots appear to be following robots.txt! This means that simply placing
your contact page in robots.txt as I do will drastically cut the number of spammers
that get your email, even if you don't run bot-trap, and even if you use a
direct mailto: link. A partial victory!
Demo
To see bot-trap in action, go to the page on my site where bad robots go:http://daniewebb.us/bot-trap/index.php.
You'll be banned for going where you weren't supposed to go (didn't you read the
robots.txt file!?) Then go back to this page with the back button or type in the main page
URL in the URL bar of your browser. Reload the page or you'll probably get the old page
cached by your browser. The 403 page will allow you to unban yourself. Bad robots
shouldn't be going to that link, because my robots.txt forbids it. You were a very bad
robot indeed.
How It Works
- You place a small "web-bug" strategically in your web pages. This bug is just a
tiny image link that says to go to /bot-trap/index.php. Normal people don't see this
link, but web bots do.
- You create a /robots.txt file that tells web bots not to go to the /bot-trap
directory.
- When the bad robot visits /bot-trap/index.php anyway, /bot-trap/index.php adds the
IP address of the bad bot to a block list
in /.htaccess. They are blocked from access to the site from then on. You can also be
emailed when this happens.
Safeguards
It is possible that someone is banned who shouldn't be. Perhaps a previous user of an
IP address in a DHCP pool was a naughty user and ran a bad bot, but now
the new user is banned. Not to worry, the custom "403 Forbidden" page allows any user to
unban themselves by typing a requested word into a form box. Real people can easily do
this, but bots can't!
Installation
- Unpack the tarball in your web page root directory:
# tar -xzf bot-trap-x.x.tar.gz
- Make the bot-trap directory in your web root directory owned by the
same user or group as the web server (www-data on Debian GNU/Linux).
Either way, the web server user needs read access to the bot-trap
directory, but it doesn't have to have write access to it.
- Either add a line to your root .htaccess file like:
ErrorDocument 403 /bot-trap/forbid.php
or copy the premade one (bot-trap/htaccess-root-example). Notice that since once an
IP is banned, it can't access anything in /, so the 403 page should be in /bot-trap,
and /bot-trap/.htaccess should only say "Allow from all". Look at the forbid.php
file in the distribution to see how to do this, or just use it as-is.
- Create the empty file blacklist.dat in your web root directory. The bot-trap
system stores a log of bans here.
- Make blacklist.dat and .htaccess writable by the web server user.
- Make sure .htaccess controls are allowed in your Apache configuration (especially
the "AllowOverride" directive). This allows bot-trap to ban IP addresses using the
htaccess mechanism.
- Edit bot-trap/settings.php to hold the correct email addresses to send alerts
to.
- Add "web-bugs" to your main web page to catch the bad bots. This is the XHTML
code:
<!-- Bad robot trap: Don't go here or your IP will be banned! -->
<a href="/bot-trap/index.php"><img src="bot-trap/pixel.gif" border="0"
alt=" " width="1" height="1"/></a>
- Add the bot-trap directory to your robots.txt file, or copy the example
robots.txt file (bot-trap/robots.txt.example) to the root directory.
- Make sure /.htaccess and all other files have the correct permissions and
ownership for your site.
WARNING
Warning! Don't mess with this if you don't have the ability to fix things if this
breaks them! If you mess up /.htaccess, your whole site could go down.
BUGS
The directory for bot-trap is hard-coded as /bot-trap. To change this, you
have to change all the instances of '/bot-trap' to your new directory.
I used the file locking mechanism flock(), and you're bound to get a race
condition eventually when two processes set the bot trap at the same time or two processes
unban at the same time. Unfortunately, if the comments there are to be believed (and I
don't know one way or the other), there is no way around this. I guess I figure if you
have so many bots tripping the trap that you get race collisions, you've got bigger
problems than race collisions.
LICENSE
bot-trap is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.
bot-trap is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
|