University of Alabama at Birmingham
Information Technology
    IT Home   UAB Home  
 

Search UAB IT!
  
  
Questions & Answers

How Do I Create a Robots.txt file?

Search engines will look in your root domain for a special file named "robots.txt" (http://www.mydomain.com/robots.txt). The file tells the robot (spider) which files it may spider (download). This system is called The Robots Exclusion Standard.

The format for the robots.txt file is special. It consists of records. Each record consists of two fields: a User-agent line and one or more Disallow: lines. The format is:

<Field> ":" <value>

The robots.txt file should be created in UNIX line ender mode! Most good text editors will have a UNIX mode (or your FTP client *should* do the conversion for you). Do not attempt to use an HTML editor that does not specifically have a text mode to create a robots.txt file.

User-agent

The User-agent line specifies the robot. For example:

User-agent: googlebot  

You may also use the wildcard character "*" to specify all robots:

User-agent: *  

You can find user agent names in your own logs by checking for requests to robots.txt. Most major search engines have short names for their spiders.

Disallow:

The second part of a record consists of Disallow: directive lines. These lines specify files and/or directories. For example, the following line instructs spiders that it can not download email.htm:

Disallow: email.htm  

You may also specify directories:

Disallow: /cgi-bin/  

Which would block spiders from your cgi-bin directory.

There is a wildcard nature to the Disallow directive. The standard dictates that /bob would disallow /bob.html and /bob/index.html (both the file bob and files in the bob directory will not be indexed).

If you leave the Disallow line blank, it indicates that ALL files may be retrieved. At least one disallow line must be present for each User-agent directive to be correct. A completely empty Robots.txt file is the same as if it were not present.

White Space & Comments

Any line in the robots.txt that begins with # is considered to be a comment only. The standard allows for comments at the end of directive lines, but this is really bad style:

Disallow: bob #comment  

Some spiders will not interpret the above line correctly and instead will attempt to disallow "bob#comment". The moral is to place comments on lines by themselves.

White space at the beginning of a line is allowed, but not recommended. 

  bob #comment

Examples

The following allows all robots to visit all files because the wildcard "*" specifies all robots.

User-agent: *
Disallow:  

This one keeps all robots out.

User-agent: *
Disallow: /  

The next one bars all robots from the cgi-bin and images directories:

User-agent: *
Disallow: /cgi-bin/
Disallow: /images/  

This one bans Roverdog from all files on the server:

User-agent: Roverdog
Disallow: /  

This one bans keeps googlebot from getting at the cheese.htm file:

User-agent: googlebot
Disallow: cheese.htm  

Problems with Robots.txt

Backwards Syntax

One of the most common mistakes is backwards syntax:

User-agent: *
Disallow: scooter

Which should be:

User-agent: scooter
Disallow: *

Multiple Disallows on one line:

A large number of people had multiple directories per line.

Disallow: /css/ /cgi-bin/ /images/

Most spiders will misinterpret that line in variety of ways. Some will throw out the space and try to use /css//cgi-bin//images/ or they may use just /images/ or /css/ or forget the whole thing.

The correct syntax would be:

Disallow: /css/
Disallow: /cgi-bin/
Disallow: /images/

DOS Line Enders:

Another common mistake, is editing your robots.txt in DOS mode. Although it is such a common problem, it is bad practice. Always edit your robots.txt in UNIX mode and upload in ASCII. Many FTP clients will make the transformation to UNIX line enders for you seamlessly, but obviously some will not. Make sure your text editor is in UNIX mode before editing a robots.txt file.

Comments at the end of line:

Per standard, this is acceptable:

Disallow: /cgi-bin/ #this bans robots from our cgi-bin

In the past, there have been search engines that would toss out the entire line. We know of no current major search engine that has a problem with it, but can you afford to risk it? Put the comments on a line by themselves.

Leading spaces: 

 Disallow: /cgi-bin/

The standard does not specifically address this, but it is bad style. Again, can you afford to risk it?

404 Redirects that lead to another page:

Quite common is the website without a robots.txt that seamlessly redirects the request to another page. Often that redirect is done without generating a server status error or redirect status message. It is then up to the spider to figure out if it is looking at a robots.txt or an html file. Although it should not cause you any problems, can you afford to risk it? To fix it without reconfiguring your server, place a blank robots.txt file in your root.

Conflicting Declarations:

If you were slurp, what would you do?

User-agent: *
Disallow: /
#
User-agent: slurp
Disallow:

Does the allow for slurp override the disallow, or does the disallow override slurp? We have little faith in the less complex robot being able to deduce the difference and take the appropriate action. In the example, slurp would walk right in and have a go at your site. All others would be banned.

Capitalization - More Bad Style

USER-AGENT: EXCITE
DISALLOW:  

Although the standard is not case sensitive, directory and filnames are case sensitive. It bodes well to follow the examples in the standard and capitalize User and Disallow only.

Listing of All Files

Another common style mistake is specifying each and every file in a directory:

Disallow: /AL/Alabama.html
Disallow: /AL/AR.html
Disallow: /Az/AZ.html
Disallow: /Az/bali.html
Disallow: /Az/bed-breakfast.html  

The above could be replaced by using the directory option:

Disallow: / AL
Disallow: /Az  

Remember, a trailing slash indicates to the spider that the directory is off-limits. It is a question of style and of size.

Disallow, Not Allow!

There is no Allow, only Disallow. This is wrong:

User-agent: Spot Disallow: /john/ allow: /jane/  

This is correct:

User-agent: Spot Disallow: /john/ Disallow:

No leading Slash

What should a spider do with this?

User-agent: Spot Disallow: john  

Per standard, it will disallow the filenamed "john" and the directory named john. Use leading and trailing slashes on all paths to be sure.

LINKS:

UAB Home | IT Home | Legal | Privacy Statement | Disclaimer | SiteMap | Contact Us
© 2004 - The University of Alabama at Birmingham - All Rights Reserved
This file was last updated on: September 19, 2005, 11:52 am