The purpose of a robots.txt file, also known as the robots exclusion protocol, is to give webmasters control over what pages robots (commonly called spiders) can crawl and index on their site. A typical robots.txt file, placed on your site’s server, should include your sitemap’s URL and any other parameters you wish to put in place.

If a robot wants to visit a page on your website, before it does so it checks your robots.txt (placed at www.domain.com/robots.txt – case sensitive, if you call it Robots.TXT it won’t work) and sees that your robots.txt file contains the following exclusion:

User-agent: *

Disallow: /
The ‘User-agent: *’ tells the robot that this rule applies to all robots, not only search engine or Google bots.

The ‘Disallow: /’ tells the robots that it is not allowed to visit any pages on this domain. When creating your robots.txt file you should be careful of what parameters you set, as if your robots.txt looks like the above example this means that you website won’t be crawled by Google!

Note: Some robots will ignore your robots.txt file, as it is only a directive and so will still access pages on your site regardless. These are normally malicious bots who may harvest information from your site. Some could be malicious, even if you create a section in your robots.txt file to exclude it from crawling your site, as these robots usually ignore your robots.txt file it would be unsuccessful. Blocking the robot’s IP address could be an option but as these spammers usually use different IP addresses it can be a tiresome process.

Why have a robots.txt file?

Some webmasters think that because they want all robots to be able to crawl their entire site that they don’t need a robots.txt file however this is not the case. Your robots.txt file should contain the location of your sitemap so it’s easier for spiders, especially search engine spiders, to access all the pages on your site. You would also need to have a robots.txt file in place if you are in the process of developing a new site which is LIVE on your server but you don’t want it to be indexed by Google yet. If you are using a robots.txt file, be sure that you understand what you are excluding from being crawled as it only takes one mistake for your entire site to not be crawled!

Limitations of Robots.txt

  • Security

It is important to remember that using your robots.txt file as a means to protect and hide confidential information is not only bad practise but could also breach The Data Protection Act if the information is stored inappropriately. Your robots.txt file can be accessed by anyone, not only robots, so if you have any information on your site that you don’t want to be viewed by anyone other than who it is meant for then the most secure approach would be to password protect the page/document.

  • The instructions in your robots.txt file are directives only
    The instructions that you declare in your robots.txt do not have the ability to control the behaviour of spiders on your site but are able to distinguish which crawlers can and cannot access your site. However whilst legitimate crawlers such as Googlebot and other search engine crawlers will obey the rules you have stated in your robots.txt file, other crawlers may simply ignore the rules in your robots.txt or not look at it at all.
  • The syntax in your robots.txt can be interpreted differently by different crawlers

It is important that when creating your robots.txt file you know the correct syntax for addressing specific web crawlers as directives that are easy for Googlebot to read may not be understood by other web crawlers meaning that they may not be able to follow the instructions you have put in place.

  • The directives in your robots.txt file will not prevent your URL from being referenced on other sites

Google will follow the directives in your robots.txt file meaning that any files you have disallowed won’t be crawled or indexed however this will not remove all traces of your URL from Google altogether. References to your site on other sites such as directories and anchor text on other web pages will still appear in the Google Search results as you cannot makes changes on other sites by using your robots.txt. However, to prevent your URL from appearing anywhere in the Google SERP’s you can used a combination of URL blocking methods such as password protection and adding indexing directive meta tags into your HTML alongside disallowing crawler access in your robots.txt.

Robots.txt Options

You have a series of options when it comes to your robots.txt and what you want it to contain, below are some examples that may help you create yours!

Case Sensitivity
Robots.txt directives are case sensitive so if you disallow /logo-image.gif the directive would block http://www.domain.com/logo-image.gif but http://www.domain.com/Logo-Image.gif would still be accessible to robots.

Allow all robots to crawl your whole site
User-agent: *
Disallow:

Exclude all robots (malicious and Google bots) from your whole site
User-agent: *
Disallow: /

Exclude a specific robot from a specific folder/file on your website
User-agent: Examplebot
Disallow: /no-robots/

Note: You can only have one folder/file per “Disallow:” line, if you have more than one location you want to exclude you will have to add more Disallow lines.

Allow one specific robot and exclude all other robots
User-agent: Googlebot
Disallow:

User-agent: *Disallow: / Exclude a specific robot User-agent: SpamBotDisallow: /

Declaring your sitemap in your robots.txt file
User-agent: *
Disallow:
Sitemap: http://www.domain.com/sitemap.xml

Note: The sitemap declaration needs to be to an absolute URL not a relative URL

Exclude all robots from a whole folder apart from one file/image
User-agent: *
Disallow: /my-photos
Allow: /my-photos/logo.jpg

Robots.txt Wildcard Directive

Search engines such as Google and Bing allow the use of wildcards in robots.txt files so that you don’t have to list a multitude of URL’s because then contain the same characters.

Disallow: *mobile

The above directive would block crawlers accessing any URLs on your website that contains the term ‘mobile’, such as:

  • /mobile
  • /services/mobile-optimisation
  • /blog/importance-of-mobile-ppc-bidding
  • /images/mobile.jpg
  • /phone/mobile34565.html

Another wildcard directive that you can use in your robots.txt is the “$” character.

Disallow: *.gif$

The example directive blocks crawlers from being able to access any URL that contains the file type “.gif”. Wildcards can be extremely powerful and should be used carefully as with the above example, the $ wildcard would block any file paths that also contain “.gif” such as /my-files.gif/blog-posts.

Testing your robots.txt with Webmaster Tools

If you have an account with Webmaster Tools and have verified your URL, you are able to use the robots.txt Tester tool. Using the tool you can test changes to your robots.txt and see the impact before you set it live. You can also see previous versions of your file and see which line in your robots.txt file is blocking a certain page, this can prevent you making mistakes and losing traffic/revenue.

You can also enter a URL to check if it is blocked by a directive in your robots.txt file and easily change it accordingly. The tool can be found in the Crawl dropdown in Webmaster Tools, check yours now!

Meta Robots Tag

In terms of SEO, if you want to block Google from crawling a specific page on your website and indexing it in its search results pages then it is best practice to use a Meta robots tag to tell them that they are allowed to access this page but not show it in the SERPs. Your robots Meta tag should look like this and be placed in the <head> section of your website:

<meta name=”robots” content=”noindex”>

If you want to disallow a crawler from indexing the content on your page and prevent it from following any of the links, your meta robots tag would look like this:

<meta name=”robots” content=”noindex, nofollow”>

An overview of the main meta robots tag commands available:

  • Index – All search engines are able to index the content on this webpage
  • Follow – All search engines are able to crawl through the internal links on the webpage
  • Noindex – will prevent the designated page from being included in the index
  • Nofollow – will prevent Google bots from following any links on the page. Note that this is different to the rel=”nofollow” link attribute.
  • Noarchive – prevents cached versions of the page from showing in the SERPs
  • Nosnippet – prevents the page being cached and descriptions appearing below the page in the SERPs
  • NOODP – prevents the Open Directory Project description for the page replacing the description manually set for this page
  • Noimageindex – prevents Google indexing of the images on the page
  • Notranslate – prevents the page being translated in the Google SERPs

You can use multiple commands in your meta robots tag. If you want to prevent a page on your website from being cached by all search engines and also prevent Open Directory descriptions replacing your current descriptions, you would use the following commands: noarchive and NOODP. Your meta robots tag would look like this:

<meta name=”ROBOTS” content=”NOARCHIVE,NOODP“>

If you want crawlers to not index this webpage but follow the internal links on this page, your meta robots tag would look like this. This is an advised SEO position, because if any links are going to pages you don’t want indexed we still want the link equity from the link to flow through the rest of the site.

<meta name=”robots” content=”noindex, follow”/>

Meta Robots tag vs Robots.txt

In general terms, if you want to deindex a page or directory from Google’s Search Results then we suggest that you use a “Noindex” meta tag rather than a robots.txt directive as by using this method the next time your site is crawled your page will be deindexed, meaning that you won’t have to send a URL removal request. However, you can still use a robots.txt directive coupled with a Webmaster Tools page removal to accomplish this.

Using a meta robots tag also ensures that your link equity is not being lost, with the use of the ‘follow’ command.

Robots.txt files are best for disallowing a whole section of a site, such as a category whereas a meta tag is more efficient at disallowing single files and pages. You could choose to use both a meta robots tag and a robots.txt file as neither has authority over the other, but “noindex” always has authority over “index” requests.