robots.txt

A robots.txt tells search engines bots what to index from your website (or not).

When to use a robots.txt file

Most websites do not need a robots.txt if they rely on search engine traffic (i.e. Google, Bing).

Certain use cases would require the use a robots.txt file. For example:

  • To block the indexing of certain URLs. However, it's recommended to use the noindex meta tag or the X-Robots-Tag header response instead of using robots.txt file.

  • To block the indexing of media files or resources such as PDF documents, PPTX presentations, DOCX documents etc.

    To instruct bots to not index any resources under a /images folder, use the Disallow directive:

      User-agent: *
      Disallow: /images 
  • Have a level of control over requests made by bots when they crawl your website. If you experience issues due to bots requests volume, the robots.txt can be useful to control how fast or often a bot would crawl the website. You can use the Crawl-delay directive for this purpose (note the Limitations of robots.txt):

      User-agent: *
      Crawl-delay: 10

When not to use a robots.txt file

Generally, you shouldn't use a robots.txt file for any of the following use cases:

  • Block the indexing of certain URLs. For example, Google will still index a certain page URL if that page is linked to from another page with descriptive page. It's best to use the noindex meta tag or the X-Robots-Tag header response instead.
  • Trying to block bad bots from indexing your website. Bad bots are more likely to not follow the rules in your robots.txt file.

Example of a robots.txt file

A robots.txt file is very simple.

User-agent: *
Disallow:

In this example, all bots regardless of user-agent they can crawl and index a website. Because the Disallow value is empty, there are no restrictions for the bots to comply with.

The below example instructs "msnbot" (Microsoft MSN bot) to not crawl and index the website:

User-agent: msnbot
Disallow: *

The wildcard * used in the Disallow parameter means that all URLs are disallowed from being crawled.

Common search engine bots

The common search engine bots that comply with the robots.txt are listed below.

CompanyTypeBot user-agent
GoogleGeneralGooglebot
GoogleImagesGooglebot-Image
GoogleMobileGooglebot-Mobile
GoogleNewsGooglebot-News
GoogleVideoGooglebot-Video
GoogleAdSenseMediapartners-Google
GoogleAdWordsAdsBot-Google
BingGeneralbingbot
BingGeneralmsnbot
BingImages & Videomsnbot-media
BingAdsadidxbot
Yahoo!Generalslurp

Limitations of a robots.txt file

There are multiple limitations of a robots.txt file:

  • Complying with the rules in robots.txt is optional
  • Rules and directives may not be supported by all bots

For example, the Crawl-delay directive's value is interpreted differently by each major search engine bot.

User-agent: *
Crawl-delay: 10

Google interprets it how many requests per second Googlebot can make on your website (link)[https://support.google.com/webmasters/answer/48620?hl=en].

Bing interprets it as the size of time window during which Bingbot can crawl the website (link)[https://blogs.bing.com/webmaster/2012/05/03/to-crawl-or-not-to-crawl-that-is-bingbots-question/].

robots.txt FAQ

@TODO