Creating a robots.txt File
The robots.txt file is a simple ASCII text file used to indicate web site files and directories that should not be indexed. Many webmasters choose not to revise their robots.txt file because they are uncertain how the changes could impact their rankings. However, a poorly written robots.txt file can cause your complete web site to be indexed, gathering information like passwords, email addresses, hidden links, membership areas, and confidential files. A poorly written robots.txt file could also cause portions of your web site to be ignored by the search engines.
A web robot is a software program that automatically searches web sites to find information. Well-behaved robots follow instructions in the robots.txt file. Specialized robots collect email addresses, create huge databases of URLs with products sold by online stores, detect plagiarism, search for intellectual property, find reciprocal link partners, scrape web site content, and gather icons and images. Spider robots are commonly used for indexing web sites for search engines. The process of using these robots to find patterns from the large amounts of data is called data mining. Web robots are also called agents or bots.
ASP member, John Fotheringham, provides links to the spider's home page, IP addresses, and more details about each of the spiders at http://www.jafsoft.com/searchengines/webbots.html. Another list of 298 known robots can be found at http://www.robotstxt.org/wc/active/html/index.html.
In general, search engines consider all web pages to be crawlable. Well-behaved robots do not spider your entire site in one visit. They come back periodically, so the web server is not slowed down by a constant bombardment of requests from the robot. Bad robots ignore the instructions in the robots.txt file, or use the information listed in the robots.txt to search excluded directories and files. Some robots, like spam bots, don't care about instructions in your robots.txt file. These bad bots need to be banned by your .htaccess file. When researching this article, one site claimed to identify 135 bad robots from their web site visits.
The robots.txt file does not provide web site security. Any visitor, human or web robot, can view the web site robots.txt file. If you decide to keep confidential files on your web site, consider some type of authorization, or place your files in directories protected by .htaccess. For example, you do not want files in your directories containing passwords and confidential information to be indexed by robots or viewed by snoops. If needed, encrypt your confidential information on your web site.
LIMITING ROBOTS
Some areas of your web site may have content that should not be scanned. For example, you may have private web site pages that are designed only for your clients and staff. You should not risk private web content being stolen, used by your competitors, or placed on other web sites, or indexed by search engines.
Areas of your web sites that are not ready to go live should be blocked. For example, you would not want an unoptimized page with an incomplete link structure to be indexed. If your web site is not indexed properly, you risk the original web content being stolen, placed on a different web sites, and having the copied content indexed by search engines before your original web content is indexed.
Some SEO experts believe allowing access to all files on your site can dilute the relevance of your site content, causing your site's ranking to drop. The robots.txt file can be used to block images and text content unrelated to your web site theme. Some search engines only index a limited number of web pages from a site. It makes no sense to have those search engines index your search, error, and thank you pages.
Your web site may include web pages optimized for different search engines (doorway pages). Your web site may have printer-friendly pages that could be judged as duplicate content. The robots.txt file can be used to prevent indexing that could penalize your web site for duplicate content. Another option is to include HTML meta tags (meta name="robots" content="noindex,nofollow") on either the printer-friendly pages or your regular HTML pages. However, not all search engine robots interpret meta tags.
If you have very limited bandwidth, you may prefer to not let some robots waste any bandwidth. Specific robots can be banned from indexing the entire site, or selected directories and files. For example, you may want to prevent your web site from being archived at http://www.archive.org. If your web site has multilingual pages, the robots.txt file can be used to disallow specific robots from directories and web pages with languages not supported by some search engines.
FOLLOWING THE STANDARDS
The robots.txt follow informal standards created in 1994. These simple standards are explained at http://www.robotstxt.org. A good tutorial for creating the robots.txt file is at http://www.robotstxt.org/wc/exclusion-admin.html. Errors in the robots.txt file can prevent web robots (search engines) from correctly indexing your site. To have your web site indexed properly by the largest number of search engine spiders, you should follow the generic standards. Well-behaved robots are supposed to ignore instructions they do not understand. Some webmasters claim that if the robots cannot understand your robots.txt file, the robot will leave without indexing your site.
In April 2007, Google, Yahoo, and MSN agreed to use a uniform directive in the robots.txt file to discover a web sitemap. A sitemap is an XML file listing web site URLS, with useful search engine meta data (date of last modification, and expected frequency of change). A description of the sitemap XML protocol is at http://www.sitemaps.org/protocol.php . The following line is added to the robot.txt file to allow search engines to more efficiently crawl your web site using your sitemap:
Sitemap: http://www.example.com/sitemap.xml
The robots.txt file must be located in the root directory of your domain (where your home page is), and is always in lowercase letters. Well-behaved spider robots always request the robots.txt file on each visit before requesting other web site pages. If the spider robots cannot find the robots.txt file, a cluster of 404 errors will be found in your log files. This makes it more difficult to trace real 404 errors.
Comments can be included in the robots.txt file by placing a pound sign(#) at the start of the line. The wild card character (*) can be used in the User-agent fields. Robots use a very simple string matching technique, so always match the capitalization of agent names, file, and directories in the robots.txt file. Multiple path/URLs can be excluded by using several Disallow lines with each robot.
My research indicates there is a conflict concerning listing specific User-agents (robots) before or after wildcard (*) user-agents in the robots.txt file. Some sites suggest the use of the wildcard (*) setting for all robots can be overridden by listing specific robots later in the robots.txt file. I disagree. The simple standards at http://www.robotstxt.org/wc/norobots.html does not mention overriding a wildcard setting. The FAQs at http://www.robotstxt.org/wc/faq.html#log uses an example listing specific robots before providing directives to any User-agent (with the * symbol). The standards also mentions the wildcard character with the following text: "describes the default access policy that HAS NOT MATCHED any of the other records." This would also indicate specific robots should be listed first. I believe most robots read the robots.txt file until they find something that applies to them. If my assumption is correct, when a robot finds the wildcard applying to all search engines, they will stop searching for their specific name, which may or may not occur later in the robots.txt file. Based on my research, banned robots should always be listed before off-limit files and directories. I believe following the standards is the best advice to have your web site indexed correctly by the spider robots.
SEARCH ENGINE DIFFERENCES
Some search engines use multiple robots to perform different functions. Google currently uses four user-agents. Googlebot indexes images for Google.com searches. Googlebot-Image indexes images for Google's image search at http://www.google.com/imghp. To exclude files in your images directory, both user agents, Googlebot, and Googlebot-Image, would need to be included in your robots.txt file.
Currently the main search engines offer a mixture of supported and non-supported directives. Google supports "Allow", "Archive", and "Cache" directives. The Ask robot also supports "Allow". Ask, MSN, and Yahoo support "Crawl-delay". Google does not. "Request-return" is sometimes found in the robots.txt file, but none of the major search engines support this directive. Google and Yahoo permit wildcards(*) for directories and file names. Others do not. Avoiding the use of the wild card character (*) with file and directory names in the robots.txt file allows your web site to be properly spidered by all search engines.
PROPER SYNTAX
Robots can sometimes get stuck in cgi programs, so your cgi directory should always be disallowed in the robots.txt file. Directory names always end with the slash character. If you disallow a directory, all directories and files below it will also be disallowed. If a directive starts with /private (no trailing slash), all files and directives that start with that path /URL will be off-limits to well behaved bots.
Simple robots.txt online generators can be found at http://www.mcanerin.com/EN/search-engine/robots-txt.asp and http://www.clockwatchers.com/robots_tool.html. Frank Rietta, ASP member, sells his software, RoboGen, for creating robots.txt files. RoboGen is a visual editor using an FTP-like interface, with a database of over 180 search engine user-agents. The best examples I found for manually creating the robots.txt file is at http://www.searchtools.com/robots/robots-txt.html. It includes good examples, and points out problems using bad examples.
Most robots (and browsers) identify themselves when they request a web page. That information is written to your web server log files. To catch a bad robot, place a special directory on your web site. Only mention that directory in the robots.txt file. If that directory is accessed, it means that either a bad robot is ignoring the robots.txt file, or somebody is surfing the web searching for your private information.
To view sample robots.txt files, you can enter the web site domain, with robots.txt in your web browser, for example http://www.google.com/robots.txt, or http://www.alexa.com/robots.txt. The largest robots.txt file I found was at http://www.whitehouse.gov/robots.txt. One of my robots.txt files is at http://www.wiscocomputing.com/robots.txt.
Google Sitemap software should not be the only software used to check the syntax of your robots.txt file. Google supports many non-standard directives that are ignored by other search engines. You should also check the robots.txt file with generic syntax checkers. Three free generic syntax checkers are found at http://tool.motoricerca.info/robots-checker.phtml, http://www.sxw.org.uk/computing/robots/check.html, and http://www.searchenginepromotionhelp.com/m/robots-text-tester/robots-checker.php
Terry Jepsonwww.wiscocomputing.com