Articles: 843 | Categories: 148   
   
   
Home Articles Contact Us
 
 
 
 
Bot-Trap (0 Comments)
Admin: Posted Date: April 4, 2010

Bot-Trap is a package that will enable your web site to automatically ban bad web robots (aka web spiders) that ignore the robots.txt file. Bot-trap is great for blocking email harvesters, corporate tattletales, scrapers, and more.

Bot-trap - A Bad Web-Robot Blocker

This package will enable your web site to automatically ban bad web robots (aka web spiders) that ignore the robots.txt file. This does not include Googlebot and other well-behaved robots. The software requirements are Apache and PHP, but the principles would work with any web server setup.

Three of the most common bad robots are:

  1. Email harvesters: they want to spam you.
  2. Corporate tattletales: they report back to corporations if you use their trademark, criticize them, violate their copyrights, and so on.
  3. Scrapers: they copy your whole site, then set it up somewhere else and put Adwords on it. Bot-trap can't protect against these bots because they usually follow robots.txt.

This package will protect against email-harvesting robots whether they follow robots.txt or not:

  1. Exclude your contact page in robots.txt.
  2. Email harvesting bots that follow robots.txt won't get your email.
  3. Email harvesting bots that don't follow robots.txt will quickly get banned and won't get your email.
Update, December 2006:
So many people have started using bot-trap and other bad robot banners that many email harvesting robots appear to be following robots.txt! This means that simply placing your contact page in robots.txt as I do will drastically cut the number of spammers that get your email, even if you don't run bot-trap, and even if you use a direct mailto: link. A partial victory!
                                  


Demo

To see bot-trap in action, go to the page on my site where bad robots go:http://daniewebb.us/bot-trap/index.php. You'll be banned for going where you weren't supposed to go (didn't you read the robots.txt file!?) Then go back to this page with the back button or type in the main page URL in the URL bar of your browser. Reload the page or you'll probably get the old page cached by your browser. The 403 page will allow you to unban yourself. Bad robots shouldn't be going to that link, because my robots.txt forbids it. You were a very bad robot indeed.

How It Works

  1. You place a small "web-bug" strategically in your web pages. This bug is just a tiny image link that says to go to /bot-trap/index.php. Normal people don't see this link, but web bots do.
  2. You create a /robots.txt file that tells web bots not to go to the /bot-trap directory.
  3. When the bad robot visits /bot-trap/index.php anyway, /bot-trap/index.php adds the IP address of the bad bot to a block list in /.htaccess. They are blocked from access to the site from then on. You can also be emailed when this happens.

Safeguards

It is possible that someone is banned who shouldn't be. Perhaps a previous user of an IP address in a DHCP pool was a naughty user and ran a bad bot, but now the new user is banned. Not to worry, the custom "403 Forbidden" page allows any user to unban themselves by typing a requested word into a form box. Real people can easily do this, but bots can't!

Installation

  1. Unpack the tarball in your web page root directory:
    # tar -xzf bot-trap-x.x.tar.gz
  2. Make the bot-trap directory in your web root directory owned by the same user or group as the web server (www-data on Debian GNU/Linux). Either way, the web server user needs read access to the bot-trap directory, but it doesn't have to have write access to it.
  3. Either add a line to your root .htaccess file like:
    ErrorDocument 403 /bot-trap/forbid.php
    or copy the premade one (bot-trap/htaccess-root-example). Notice that since once an IP is banned, it can't access anything in /, so the 403 page should be in /bot-trap, and /bot-trap/.htaccess should only say "Allow from all". Look at the forbid.php file in the distribution to see how to do this, or just use it as-is.
  4. Create the empty file blacklist.dat in your web root directory. The bot-trap system stores a log of bans here.
  5. Make blacklist.dat and .htaccess writable by the web server user.
  6. Make sure .htaccess controls are allowed in your Apache configuration (especially the "AllowOverride" directive). This allows bot-trap to ban IP addresses using the htaccess mechanism.
  7. Edit bot-trap/settings.php to hold the correct email addresses to send alerts to.
  8. Add "web-bugs" to your main web page to catch the bad bots. This is the XHTML code:
<!-- Bad robot trap: Don't go here or your IP will be banned! -->
<a href="/bot-trap/index.php"><img src="bot-trap/pixel.gif" border="0"
alt=" " width="1" height="1"/></a>
  1. Add the bot-trap directory to your robots.txt file, or copy the example robots.txt file (bot-trap/robots.txt.example) to the root directory.
  2. Make sure /.htaccess and all other files have the correct permissions and ownership for your site.
                           


WARNING

Warning! Don't mess with this if you don't have the ability to fix things if this breaks them! If you mess up /.htaccess, your whole site could go down.

BUGS

The directory for bot-trap is hard-coded as /bot-trap. To change this, you have to change all the instances of '/bot-trap' to your new directory.

I used the file locking mechanism flock(), and you're bound to get a race condition eventually when two processes set the bot trap at the same time or two processes unban at the same time. Unfortunately, if the comments there are to be believed (and I don't know one way or the other), there is no way around this. I guess I figure if you have so many bots tripping the trap that you get race collisions, you've got bigger problems than race collisions.

LICENSE

bot-trap is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

bot-trap is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

 

 
 
Add a Comment:
 
(You must be signed in to comment on an article. Not a member? Click here to register)
   
Title:

Comments: