robots.txt
A robots.txt tells search engines bots what to index from your website (or not).
When to use a robots.txt file
Most websites do not need a robots.txt if they rely on search engine traffic (i.e. Google, Bing).
Certain use cases would require the use a robots.txt file. For example:
To block the indexing of certain URLs. However, it's recommended to use the
noindex
meta tag or theX-Robots-Tag
header response instead of usingrobots.txt
file.To block the indexing of media files or resources such as PDF documents, PPTX presentations, DOCX documents etc.
To instruct bots to not index any resources under a
/images
folder, use theDisallow
directive:User-agent: * Disallow: /images
Have a level of control over requests made by bots when they crawl your website. If you experience issues due to bots requests volume, the robots.txt can be useful to control how fast or often a bot would crawl the website. You can use the Crawl-delay directive for this purpose (note the Limitations of robots.txt):
User-agent: * Crawl-delay: 10
When not to use a robots.txt file
Generally, you shouldn't use a robots.txt file for any of the following use cases:
- Block the indexing of certain URLs. For example, Google will still index a certain page URL if that page is linked to from another page with descriptive page. It's best to use the noindex meta tag or the X-Robots-Tag header response instead.
- Trying to block bad bots from indexing your website. Bad bots are more likely to not follow the rules in your robots.txt file.
Example of a robots.txt file
A robots.txt file is very simple.
User-agent: *
Disallow:
In this example, all bots regardless of user-agent they can crawl and index a website. Because the
Disallow
value is empty, there are no restrictions for the bots to comply with.
The below example instructs "msnbot" (Microsoft MSN bot) to not crawl and index the website:
User-agent: msnbot
Disallow: *
The wildcard
*
used in the
Disallow
parameter means that all URLs are disallowed from being crawled.
Common search engine bots
The common search engine bots that comply with the robots.txt are listed below.
Company | Type | Bot user-agent |
---|---|---|
General | Googlebot | |
Images | Googlebot-Image | |
Mobile | Googlebot-Mobile | |
News | Googlebot-News | |
Video | Googlebot-Video | |
AdSense | Mediapartners-Google | |
AdWords | AdsBot-Google | |
Bing | General | bingbot |
Bing | General | msnbot |
Bing | Images & Video | msnbot-media |
Bing | Ads | adidxbot |
Yahoo! | General | slurp |
Limitations of a robots.txt file
There are multiple limitations of a robots.txt file:
- Complying with the rules in robots.txt is optional
- Rules and directives may not be supported by all bots
For example, the
Crawl-delay
directive's value is interpreted differently by each major search engine bot.
User-agent: *
Crawl-delay: 10
Google interprets it how many requests per second Googlebot can make on your website (link)[https://support.google.com/webmasters/answer/48620?hl=en].
Bing interprets it as the size of time window during which Bingbot can crawl the website (link)[https://blogs.bing.com/webmaster/2012/05/03/to-crawl-or-not-to-crawl-that-is-bingbots-question/].
robots.txt FAQ
@TODO