Friday, June 6, 2008

Google, Yahoo and Live Search Robots Exclusion Protocol

Wikipedia.org defines "The robot exclusion standard, also known as the Robots Exclusion Protocol or robots.txt protocol, is a convention to prevent cooperating web spiders and other web robots from accessing all or part of a website which is otherwise publicly viewable. "

In Lay Man Words The robot exclusion standard, also known as the Robots Exclusion Protocol or robots.txt protocol, is a way to inform or prevent or disable the search engine spiders from accessing all or part of a website.

Earlier this week, Microsoft announced that, together with Google and Yahoo, it would offer insight on their respective way to tackle the protocol.

This means that webmasters will be able to reap the benefits out of a common implementation of REP across Google, Yahoo and Live Search.

Common REP Directives
The following list are all the major REP features currently implemented by Google, Microsoft, and Yahoo!.


1.Robots.txt Directives

Directive: Disallow

Impact : Tells a crawler not to crawl your site or parts of your site -- your site's robots.txt still needs to be crawled to find this directive, but the disallowed pages will not be crawled

Use Cases: 'No crawl' pages from a site. This directive in the default syntax prevents specific path(s) of a site from crawling

Directive: Allow
Impact : Tells a crawler the specific pages on your site you want indexed so you can use this in combination with Disallow. If both Disallow and Allow clauses apply to a URL, the most specific rule – the longest rule – applies.

Use Cases: This is useful in particular in conjunction with Disallow clauses, where a large section of a site is disallowed, except a small section within it.


Directive: $ Wildcard Support

Impact : Tells a crawler to match everything from the end of a URL -- large number of directories without specifying specific pages

Use Cases: 'No Crawl' files with specific patterns, for e.g., files with certain file types that always have a certain extension, say '.pdf', etc.

Directive: * Wildcard Support

Impact : Tells a crawler to match a sequence of characters (available by end of June)
Use Cases: 'No Crawl' URLs with certain patterns, for e.g., disallow URLs with session ids or other extraneous parameters, etc.

Directive: Sitemaps Location

Impact : Tells a crawler where it can find your sitemaps.

Use Cases: Point to other locations where feeds exist to point the crawlers to the site's content

2. HTML META Directives

Directive: NOINDEX META Tag

Impact : Tells a crawler not to index a given page

Use Cases: Don't index the page. This allows pages that are crawled to be kept out of the index.

Directive: NOFOLLOW META Tag

Impact : Tells a crawler not to follow a link to other content on a given page

Use Cases: Prevent publicly writeable areas to be abused by spammers looking for link credit. By NOFOLLOW, you let the robot know that you are discounting all outgoing links from this page.

Directive: NOSNIPPET META Tag

Impact : Tells a crawler not to display snippets in the search results for a given page

Use Cases: Present no abstract for the page on Search Results.

Directive: NOARCHIVE / NOCACHE META Tag

Impact : Tells a search engine not to show a "cached" link for a given page

Use Cases: Do not make a copy of the page available to users from the Search Engine cache.

Directive: NOODP META Tag

Impact : Tells a crawler not to use a title and snippet from the Open Directory Project for a given page
Use Cases: Do not use the ODP (Open Directory Project) title and abstract for this page in Search."

In addition to the above there are other directives supported only by Google

UNAVAILABLE_AFTER Meta Tag - Tells a crawler when a page should "expire", i.e., after which date it should not show up in search results.

NOIMAGEINDEX Meta Tag - Tells a crawler not to index images for a given page in search results.

NOTRANSLATE Meta Tag - Tells a crawler not to translate the content on a page into different languages for search results.