Want to control those pesky bots? There are techniques available to help you control what they crawl, and what they index. This article will discuss the basics and point you at some resources for more information.
Robots.txt Overview
The most basic technique is the “robots.txt” file. This file allows you to tell search engine robots what parts of your site that the cannot crawl. To start, you need to create a file called robots.txt, and it must live in the root directory for your domain. This means that if your site is “www.yourdomain.com” the robots.txt file must be located at “www.yourdomain.com/robots.txt”. Do not place it anywhere else, because it will have no effect.
The basic technique is simple. To exclude all bots from your server, structure your robots.txt as follows:
User-agent: * Disallow: /
You can choose to disable only certain bots, simply by specifying the bot name on the User-Agent line, instead of using the “*” to indicate all bots. You can also specify that only certain directories are protected, with a file similar to this one:
User-agent: * Disallow: /cgi-bin/ Disallow: /php/
The definitive definitions for the robots.txt file can be found at this location.
Limitations of Robots.txt
Robots.txt is only obeyed by “well behaved” bots. It is not a solution to prevent your competitor from crawling your site, or some other party from mounting a malicious attack on your domain. You need to protect your self by other means from these types of problems.
In addition, the fact that a search engine bot is not supposed to crawl your page does not mean it will not index it. It still may. Google, for example, will still index a page that it is not supposed to crawl if there is a link to that page from another site. If you look through Google search results you will sometimes see pages in the results that have just the URL with no title or description. That is a sure sign of a page that has been excluded by robots.txt, but that someone else links to the page.
Robots Metatags
The Robots metatags are implemented within each web page. There are two parameters: Index/Noindex and Follow/Nofollow. Index relates to whether or not the page should be indexed. Follow relates to whether or not a page should be analyzed for the links on the page. Like all metatags, this one should show up in the <head> section of your web page. Here is the basic syntax:
<meta name="robots" content="noindex,nofollow">
While you can specify “index” or “follow”, there is no need to do so, as these are the defaults for every page on your site. This is what a search engine will do if it finds no “robots” metatag. Here is the beauty of this metatag: Search engines are not supposed to index this page, even if another page links to it (and Google does obey this rule).
Be careful though, the robots metatag is new, and not all search engines obey it. For example, I learned in a discussion with Matt Cutts (of Google) that if you exclude the crawling of a page using the robots.txt file that the robots metatags are still ignored – and pages will still be indexed even if your robots.txt metatag specifies noindex. So if you truly do not want a page indexed by Google, do not mention it in the robots.txt file and rely on the robots metatags only.
The definitive definitions for the robots metatags can be found here. Read this article for information on metatags and SEO considerations
Have fun controlling those bots!