Managing Robot's Access To Your Website
by Vanessa Fox ~ June 4, 2008
Controlling what content is blocked from being found in search engines is crucial for many websites. Fortunately, the major search engines and other well-behaved robots observe the Robots Exclusion Protocol (REP), which has evolved organically since the early 1990's to provide a set of controls over what parts of a web site search engines robots can crawl and index.
Capabilities of the REP
The Robots Exclusion Protocol provides controls that can be applied at the site level (robots.txt), at the page level (META tag, or X-Robots-Tag), or at the HTML element level to control both the crawl of your site and the way it's listed in the search engine results pages (SERPs). Below is a table listing the common scenarios, directives, and which search engines support them.
Deciding What Should be Public vs. Private
One of the first steps in managing the robots is knowing what type of content should be public vs. private. Start with the assumption that by default, everything is public, then explicitly identify the items that are private.
If you want search engines to access all the content on your site, you don't need a robots.txt file at all. When a search engine tries to access the robots.txt file on your site and the server can't return one (ideally by returning a 404 HTTP status code), the search engine treats this the same as a robots.txt file that allows access to everything.
Every website and every business has a different set of needs, so there's no blanket rule for what to make private, but some common elements may apply.
Private data - You may have content on your site that you don't want to be searchable in search engines. For instance, you may have private user information (such as addresses) that you don't want surfaced. For this type of content, you may want to use a more secure approach that keeps all visitors from the pages (such as password protection). However, some types of content are fine for visitor access, but not search engine access. For instance, you may run a discussion forum that is open for public viewing, but you may not want individual posts to appear in search results for forum member names.
Non-content content - Some content, like images used for navigation, provides little value to searchers. It's not harmful to include these items in search engine indices, but since search engines allocate limited bandwidth to crawl each site and limited space to store content from each site, it may make sense to block these items to help direct the bots to the content on your site that you do want indexed.
Printer-friendly pages - if you have specific pages (URLs) that are formatted for printing you may want to block them out to avoid duplicate content issues. The drawback to allowing the printer-friendly page to be indexed is that it could potentially be listed in the search results instead of the default version of the page, which wouldn't provide an ideal user experience for a visitor coming to the site through search.
Affiliate links and advertising - If you include advertising on your site, you can keep search engine robots from following the links by redirecting them to a blocked page, then on to the destination page. (There are other methods for implementing advertising-based links as well.)
Landing pages - Your site may include multiple variations of entry pages used for advertising purposes. For instance, you may run AdWords campaigns that link to a particular version of a page based on the ad, or you may print different URLs for different print ad campaigns (either for tracking purposes or to provide a custom experience related to the ad). Since these pages are meant to be an extension of the ad, and are generally near duplicates of the default version of the page, you may want to block these landing pages from being indexed.
Experimental pages - As you try new ideas on your site (for instance, using A/B testing), you likely want to block all but the original page from being indexed during the experiment.
Implementing the REP
REP is flexible and can be implemented a number of ways. This flexibility lets you easily specify some policies for your entire site (or subdomain) and then enhance them more granularly at the page or link level as needed.
Site Level Implementation (Robots.txt)
Site wide directives are stored in a robots.txt file, which must be located in the root directory of each domain or sub-domain (e.g. http://janeandrobot.com/robots.txt
.) Note that robots.txt files only apply to the hostname where they are placed, and do not apply to subdomains. So a robots.txt file located on http://microsoft.com/robots.txt
will not apply to the MSDN subdomain http://msdn.microsoft.com
. However, the robots.txt file does apply to all subfolders and pages within the specified hostname.
A robots.txt file is a UTF-8 encoded file that contains entries that consist of a user-agent line (that tells the search engine robot if the entry is directed at it) and one or more directives that specify content that the search engine robot is blocked from crawling or indexing. A simple robots.txt file is shown below.
user-agent: - Specifies which robots the entry applies to.
* Set this to * to specify that this entry applies to all search engine robots.
* Set this to a specific robot name to provide instructions for just that robot. You can find a complete list of robot names at robotstxt.org.
* If you direct an entry at a particular robot, then it obeys that entry instead of any entries defined for user-agent: * (rather than in addition to those entries).
The major search engines have multiple robots that crawl the web for different types of content (such as images or mobile). They generally begin all robots with the same name so that if you block the major robot, all robots for that search engine are blocked as well. However, if you want to block only the more specific robot, you can block it directly and still allow web crawl access.
* Google - The primary search engine robot is Googlebot.
* Yahoo! - The primary search engine robot is Slurp.
* Live Search - The primary search engine robots is MSNbot.
Disallow: - Specifies what content is blocked
* Must begin with a slash (/).
* Blocks access to any URLs that begin with the characters after the /. For instance, Disallow: /images blocks access to /images/, /images/image1.jpg, and /images10.
You can specify other rules for search engine robots in addition to the standard instructions that block access to content as noted in other robot instructions.
Some things to note about robots.txt implementation:
* The major search engines support pattern matching using the asterisk character
for wildcard match and the dollar sign ($) for end of sequence matching as described below in using pattern matching.
* The robots.txt file is case sensitive, so Disallow: /images would block http://www.example.com/images
but not http://www.example.com/Images
* If conflicts exist in the file, the robot obeys the longest (and therefore generally more specific) line.
Block all robots - Useful when your site is in pre-launch development and isn't ready for search traffic.
# This keeps out all well-behaved robots.
# Disallow: * is not valid.
See Rest Of the Article For Graphs and Imageshttp://janeandrobot.com/post/Managing-Robots-Access-To-Your-Website.aspx