|
Robots.txt on 6 search engines
Search engine robots check a special file that can be included
in the root directory of Web servers called "robots.txt".
This is a plain text file (without HTML code) that allows
the Web site administrator to define which parts of the site
robots may access and which not.
Rumor has it that some search engine robots do not index
Web pages that lack the robots.txt file as they don't know
whether it's allowed to access your Web site or not.
In our search engine ranking study, we examined 103,260 top
10 Web pages on Google, AltaVista, iWon/Inktomi, AllTheWeb,
Teoma and Wisenut. Here are the results for the robots.txt
file:
AllTheWeb: 30.5% have it, 69.5% don't have it
AltaVista: 36.3% have it, 63.7% don't have it
Google: 35.7% have it, 64.3% don't have it
iWon/Inktomi: 32.7% have it, 67.3% don't have it
Teoma: 30.4% have it, 69.6% don't have it
Wisenut: 31.4% have it, 68.6% don't have it
As you can see, the majority of the top 10 Web pages don't
have the robots.txt file and they are still indexed.
Sometimes, it's necessary to include a robots.txt file in
the root directory of your server. For example, you don't
want the search engine robot to index your log files so that
anyone can find your logs in the search engines. In addition,
you may want to exclude robots from accessing dynamically
created pages because of the heavy load for your server.
If you already have a robots.txt file, it's very important
that you check its syntax. Here's a very good free tool:
http://www.tardis.ed.ac.uk/~sxw/robots/check/
Martijn Koster, author of the Robots Exclusion Protocol,
has compiled information on robots:
http://www.robotstxt.org/wc/robots.html
Listing of robot names (so that you can recognize them in
your Web server logs):
http://www.jafsoft.com/searchengines/webbots.html
Source of the search engines percentages above: Search
Engine Ranking Studies Q2/2002 .
|