Use ROBOTS.TXT to control search engine indexing

A robot (also called a spider) is an automated software program which scans web pages (as well as newsgroups and other internet structures) looking for things. There are hundreds, if not thousands, of robots tirelessly scanning the internet day and night. The result of their toils is often beneficial (they allow massive search engines like Google to exist). Sometimes their purposes are merely interesting (as with internet mapping robots), and occasionally they are actually malicious and evil (as with email harvesters).

What these robots are looking for depends upon their purpose. 

Usually it's safe to just ignore these robots, although if you have access to your server log files it is always a good idea to keep an eye on their travels through your site. Sometimes, however, there are pages in your site that you simply do not want or need indexed. This could be for many different reasons:

One good reason to exclude certain directories is to help out the search engines. Think about it - they have a lot of work to do to completely index your entire site. This causes traffic on the internet, on your ISP and your host. Anything that you can do to help reduce this traffic will help the greater good.

In order to aid you in informing robots of your intentions, a series of agreements called the Robots Exclusion Standard has been created. This is not supported by any official internet standards committee, it is not backed by any big corporations and it is not enforced by anyone, including web server software. Instead, the standard was created by a bunch of webmasters and made public to aid in solving the problems that robots create.

So what good is it? Well, many robots, including most of those used by major search engines, have agreed to follow the standard. In fact, it is now considered good form for any beneficial or well-written robot to follow this standard, as well as the ROBOTS metatag. (Actually, any robot that does not follow the standard is often looked upon as either malicious or sloppily coded).

What the standard allows you to do is created a special file called  robots.txt. The file always has the same name, and it must reside in your root directory. Only one file is allowed per web site.

Important

This is critically important. The robots.txt file is a standard which is voluntarily supported by a robot or spider. There is no requirement that it be used. Thus, malicious spiders (such as EmailSiphon and Cherry Picker) will not use this file.

Robots.txt is a simple text file which contains some keywords and file specifications. Each line of the file is either blank or consists of a single keyword and it's related information. The keywords are used to tell robots which portions of your web site are NOT to be spidered (we will refer to this as exclusions).

These are the keywords that are allowed:

User-agent - This is the name of the robot or spider. You may also include more than one agent name if the same exclusion is to apply to them all. You do not need to worry about case (in other words "googlebot" is the same as "GOOGLEBOT" and "GoogleBot".

A "*" indicates this is the "default" record, which applies if no other match is found. For example, if you specify "GoogleBot" only, then the "*" would apply to any other robot. 

Disallow - This tells the robot(s) specified in the User-agent which parts of your web site are off-limits. For example, /images tells the robot not to look at any files in the images directory, any any directory below it. Thus, "/images/special/" would not be indexed by the robot.

Note that /se will match any directory beginning with "/se" while /se/ will only match a directory named "/se/". 

You can also specify individual files. For example, you could say /mydata/help.html to prevent just that one single file from being spidered.

A value of just / indicates nothing is allowed to be spidered.

You must have at least one disallow per user-agent record.

# - Start of a comment. You can include a pound character anywhere in a line to being a comment.

The following example disallows certain directories and all files contained within those directories.

User-agent: *
Disallow: /images/
Disallow: /banners/
Disallow: /Forms/
Disallow: /Dictionary/
Disallow: /_borders/
Disallow: /_fpclass/
Disallow: /_overlay/
Disallow: /_private/
Disallow: /_themes/

This example disallows all robots:

User-agent: *
Disallow: /

This file disallows Googlebot from examining a specific web page:

User-agent: GoogleBot
Disallow: tempindex.htm

It is important to remember that the Robots.Txt file is available to everyone. Thus you never want to specify the names of sensitive files or folders. If you must exclude them, it is better to use password protected pages which cannot be reached by search engines at all (they don't have the password!)

Some Cool Tools

There are very few tools to help you with creating your robots.txt file. However, I have found a few which are very useful.

Syntax Checher

This handy online utility works well.

Robogen

Check out a product called Robogen. This is a really nice shareware product (very inexpensive) which allows you to select files and directories within your site (via FTP) to exclude. If you have a complicated robots.txt file this is a great product to use.

Internet Tips Contents
404 Errors Advertising Autoresponse Awardmaster Basics Browsers Careers Chatting Disasters Domains Email Emoticons Ezines Free Stuff Fun Stuff FTP Graphics Homepages HTML Reference HTML Tutorial Interactive Legal Links Msg Boards Microsoft Money Multimedia Networks Newsgroups Newsletter Products RFC's Ringmaster Searches Security Sticky Sites Surfing TANSTAAFL Telnet Viral Webmaster Your System