WARNING: Be very careful editing your server configuration
or .htaccess files. Even a minor typographical error can
make your site unusable! Always make a backup copy of any file so you
can recover quickly.
Using the Mod Rewrite URL Rewriting Engine to Deal with Bad Robots and Pesky Spambots
One of the greatest features of the APACHE server is Mod Rewrite. This
optional module allows you to control URL access in an almost infinite
manner of ways. Our task at hand though is to protect our server from
wasteful accesses that for a variety of reasons can drag the server
to its knees.
The problems with many robots and spambots can be broken down into a few
areas:
They either ignore the robots.txt instructions file, or
attempt to exploit it to find otherwise unlinked directories.
Due to programming errors, they can get caught in loops, attempting
to access files that do not exist.
If they are what is called multi-threaded, they can
launch an almost unlimited number of concurrent connections to
your site creating a serious system load.
Do you really feel like paying for bandwidth when all somebody
is doing is trying to get e-mail addresses out of your pages?
There was a time when I was using a browser detection in my
Server Side Includes that would basically spill about 200K of
garbage down the throat of any spambot that came our way. Okay,
I confess that revenge felt good, but when I thought it over I realized
that I was placing more strain on our server, and by providing a huge
list of bogus e-mail addresses, was placing a strain on the SMTP
server that the spammer would eventually hijack. It was then that I
decided to start using the RewriteEngine module.
Any visiting spambot or what I feel is a problem robot is directed
to:
In this page, I explain why they ended up where they did. In the case
of people attempting to capture the site for off-line viewing, I try
to be of assistance. If somebody thinks enough of BNB to save it, I
owe them something in return.
The elegance of this solution is that the offending 'bot never sees
anything but that one small page. No matter what URL they request from
our site, that is the only page they will ever see. It is
handled at the server level and cannot be bypassed.
NOTE: In order to use this feature of the Apache Server, you must
make sure that the server was installed with the mod_rewrite.o
file. This is done by adding the line
to the Configuration file before compiling the server.
AddModule modules/standard/mod_rewrite.o
THIS SOUNDS GREAT! HOW DO I DO IT?
As of this writing, here is my little rewrite instruction code:
What this code basically says, is that if the HTTP_USER_AGENT
from the beginning matches any of the listed values, to redirect
them to the problem.html page.
There is a performance penalty for placing RewriteEngine directives
in your .htaccess file, but I recommend
doing so for the following reasons.
Since you are most likely not going to be dealing with a lot of
spiders at once, and since they are not going to get anyplace anyway,
what is called the Chicken & the Egg Problem is not going
to be much of an issue. As you identify new 'bots, you can add them
to the list without having to restart your server.
Note: Do NOT place any links to your site on the page the spiders
or spambots are being redirected to! You can also protect individual
directories by creating an .htaccess file in the directory you would to forbid
access to.