The robots.txt file is the first thing search engine "spiders" look for when indexing a website. The robots.txt file tells search engine spiders (robots) which files and/or directories they are NOT allowed to index. This helps to prevent incomplete site indexing as well as prevent exposing the files and directories that you don't want "all over the world wide web". You may disallow, for example, "Google Images" from indexing certain directories and pages on your site, but not block "Google" itself from indexing those same files.
User-agent: * # This is ALL robots
Disallow: /cgi-bin/ # This means NO robot can index my "cgi-bin"
Disallow: /myDirectory/secretPage.html # Or my "secretPage.html" inside "myDirectory".
User-agent: Scooter # This is AltaVista's robot
Disallow: /somePage.asp # This means AltaVista is not allowed to index "somePage.asp", but the rest can.
User-agent: Googlebot-Images # This is Google's image search robot.
Disallow: /myImages/ # This means Google images is not allowed to index my directory "myImages".
Disallow: /myPage-Full-Of-Images.html # Or "myPage-Full-Of-Images.html"
User-agent: WebCrawler # This is WebCrawler's robot
Disallow: / # This means WebCrawler is not allowed to index ANY of my site.
The above example is heavily commented to help you.
Your generated code will not be, however, if you want to add comments,
make sure you use a # sign in front of each comment. The robots will ignore
these comments. Make sure there is 1 empty space in front of and after the # sign.
Personally, I would leave out any comments.
You don't want to make a mistake that could anger any "spiders" and keep them
from indexing your site altogether.
A robot is a program that automatically traverses the Web's hypertext structure by retrieving a document, and recursively retrieving all documents that are referenced.
Note that "recursive" here doesn't limit the definition to any specific traversal algorithm; even if a robot applies some heuristic to the selection and order of documents to visit and spaces out requests over a long space of time, it is still a robot.
Normal Web browsers are not robots, because they are operated by a human, and don't automatically retrieve referenced documents (other than inline images).
Web robots are sometimes referred to as Web Wanderers, Web Crawlers, or Spiders. These names are a bit misleading as they give the impression the software itself moves between sites like a virus; this not the case, a robot simply visits sites by requesting documents from them.
Robots - the generic name. Spiders - same as robots, but sounds cooler in the press. Worms - same as robots, although technically a worm is a replicating program, unlike a robot. Web crawlers - same as robots, but note WebCrawler is a specific robot WebAnts - distributed cooperating robots.
User-agent: *
Disallow: /
The "User-agent: *" means this section applies to all robots. The "Disallow: /" tells the robot that it should not visit any pages on the site.
There are two important considerations when using /robots.txt:
robots can ignore your /robots.txt. Especially malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention.
the /robots.txt file is a publicly available file. Anyone can see what sections of your server you don't want robots to use.
You can use a special HTML <META> tag to tell robots not to index the content of a page, and/or not scan it for links to follow.
There are two important considerations when using the robots <META> tag:
robots can ignore your <META> tag. Especially malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention.
the NOFOLLOW directive only applies to links on this page. It's entirely likely that a robot might find the same links on some other page without a NOFOLLOW (perhaps on some other site), and so still arrives at your undesired page.
Don't confuse this NOFOLLOW with the rel="nofollow" link attribute.
How to write a Robots Meta Tag
Where to put it - Like any <META> tag it should be placed in the HEAD section of an HTML page, as in the example above. You should put it in every page on your site, because a robot can encounter a deep link to any page on your site.
What to put into it - The "NAME" attribute must be "ROBOTS".
Valid values for the "CONTENT" attribute are: "INDEX", "NOINDEX", "FOLLOW", "NOFOLLOW". Multiple comma-separated values are allowed, but obviously only some combinations make sense. If there is no robots <META> tag, the default is "INDEX,FOLLOW", so there's no need to spell that out. That leaves:
The rel="nofollow" is an attribute you can set on an HTML <a> link tag, invented by Google, and adopted by others. Those links won't get any credit when Google ranks websites in the search results, thus removing the main incentive behind blog comment spammers robots.
Robots Links
Robots.txt Generator - Free online tool to create a robots.txt file to keep search engines from index all or part...
Web Robots Page - About robots.txt, the Web Robots Exclusion Standard.