robots.txt best practices

Heads-up: Compliance with robots.txt is optional

An important heads-up: Bots compliance with the rules and directive in a robots.txt is optional.

Good bots are most likely to follow the instructions set in the robots.txt file, while the bad bots would likely ignore the file.

Mention user-agent one by one

Group directives based on user-agents one by one:

User-agent: Googlebot
Disallow: /*.pdf$

User-agent: Googlebot-Image
Disallow: /*.pdf$

Use robots.txt for each origin (domains, subdomains)

A robots.txt file works only for one origin. Websites with multiple subdomains should use separated robots.txt for each subdomain.

The rules in robots.txt for subdomain1.domain.com (hosted at subdomain1.domain.com/robots.txt) apply only for subdomain1.domain.com, it doesn't apply for domain.com.

For example, a website with the main www origin and 2 subdomains: subdomain1 and subdomain2. Each of these 3 origins must have their own robots.txt files:

  • domain.com/robots.txt
  • subdomain1.domain.com/robots.txt
  • subdomain2.domain.com/robots.txt

Use noindex to block indexing instead of a robots.txt

Use noindex to block the indexing of certain URLs instead on relying on the robots.txt file.

For example, Google may still index a certain URL if that URL is pointed to from other pages of your website:

Warning: Don't use a robots.txt file as a means to hide your web pages from Google search results. If other pages point to your page with descriptive text, Google could still index the URL without visiting the page. If you want to block your page from search results, use another method such as password protection or noindex. https://developers.google.com/search/docs/advanced/robots/intro

You can use the noindex meta tag or the X-Robots-Tag.

Conflicting rules when using Allow and Disallow directives

Pay attention whenever you use the Allow and Disallow directives at the same time:

User-agent: *
Allow: /articles
Disallow: /articles/

For Google and Bing search engine bots, the directive with the most characters will be prioritized and followed. In this example, that would be the Disallow directive because it has more characters.

Other search engines may interpret conflicting rules differently. If these bots follow only the first matching directive, that would be the Allow directive from this example.

Use UTF-8 format

To ensure that most search engine bots can read and interpret the directives in the robots.txt file, follow these instructions:

  • robots.txt is saved using the UTF-8 format as a plain text file (.txt)
  • lines are separated by CR, CR/LF or LF

Limit size to be 500 KB

It's recommended to limit the size of the robots.txt file up to 500 KB.