The robots.txt file is a simple text file, located in your website’s root directory, that tells a search engine which web pages of your site should be indexed and which web pages should be ignored. The robots.txt file is important for search engine optimization however, many websites do not have a robots.txt file, or have one that contains errors. This can cause a search engine to overlook significant pages on your website, or prevent a search engine from indexing your site entirely. If your site lacks a robots.txt file, some web hosts may put a 404 error page in the empty spot, throwing search engines off track.
The URL of your robots.txt file is http://www.yourwebsite.com/robots.txt. The file can be left completely blank, or it can contain a simple text command consisting of two lines:
User-agent: * (The * is a wildcard indicating that all search engines can read your site.)
Disallow: (Here you list any directories or pages that you do not want the search engine to index. Leaving this blank allows all pages to be searched and placing a “/” allows nothing.)
Things to avoid in a robots.txt file:
There should be no blank spaces or comments in the command, and the names of directories and pages must be written exactly as they appear on your website. Names are case-sensitive and there is no “Allow” command, simply mention files or folders you don’t want indexing. Finally, don’t change the order of the commands and don’t add more than one directory on a single “Disallow” command line.
Working examples of a robots.txt file:
You can use the robots.txt file to block search engines from any private areas of your site that you do not want to appear in search engine results. For example, if you do not want the public to access the technical support pages or contract pages on your site, you would include the following in the robots.txt file:
You can also type in names of specific pages you want to hide from search engines using the following:
You can stop all pages of the site from being indexed by using the following:
You can allow all search engine spiders to index all files and folders of your website using the following:
Sitemap linking in a robots.txt file
There is one final command you can use in a robots.txt file and it has to do with the sitemap.xml file. The sitemap allows a webmaster to inform the search engines of URLs on a website that are available for crawling, when they were last updated, how often it changes, and how important it is in relation to the other URLs of your website.
You can allow the indexing of all pages with all spiders and provide a link to a sitemap file using the following: