robots.txt best practices
On this page
Heads-up: Compliance with robots.txt is optional
An important heads-up: Bots compliance with the rules and directive in a
robots.txt
is optional.
Good bots are most likely to follow the instructions set in the robots.txt file, while the bad bots would likely ignore the file.
Mention user-agent one by one
Group directives based on user-agents one by one:
User-agent: Googlebot
Disallow: /*.pdf$
User-agent: Googlebot-Image
Disallow: /*.pdf$
Use robots.txt for each origin (domains, subdomains)
A robots.txt file works only for one origin. Websites with multiple subdomains should use separated robots.txt for each subdomain.
The rules in robots.txt for
subdomain1.domain.com
(hosted at
subdomain1.domain.com/robots.txt
) apply only for
subdomain1.domain.com
, it doesn't apply for
domain.com
.
For example, a website with the main
www
origin and 2 subdomains:
subdomain1
and
subdomain2
. Each of these 3 origins must have their own robots.txt files:
- domain.com/robots.txt
- subdomain1.domain.com/robots.txt
- subdomain2.domain.com/robots.txt
Use noindex to block indexing instead of a robots.txt
Use
noindex
to block the indexing of certain URLs instead on relying on the robots.txt file.
For example, Google may still index a certain URL if that URL is pointed to from other pages of your website:
Warning: Don't use a robots.txt file as a means to hide your web pages from Google search results. If other pages point to your page with descriptive text, Google could still index the URL without visiting the page. If you want to block your page from search results, use another method such as password protection or noindex. https://developers.google.com/search/docs/advanced/robots/intro
You can use the
noindex
meta tag or the
X-Robots-Tag
.
Conflicting rules when using Allow and Disallow directives
Pay attention whenever you use the
Allow
and
Disallow
directives at the same time:
User-agent: *
Allow: /articles
Disallow: /articles/
For Google and Bing search engine bots, the directive with the most characters will be prioritized and followed. In this example, that would be the
Disallow
directive because it has more characters.
Other search engines may interpret conflicting rules differently. If these bots follow only the first matching directive, that would be the
Allow
directive from this example.
Use UTF-8 format
To ensure that most search engine bots can read and interpret the directives in the robots.txt file, follow these instructions:
- robots.txt is saved using the UTF-8 format as a plain text file (.txt)
- lines are separated by
CR
,CR/LF
orLF
Limit size to be 500 KB
It's recommended to limit the size of the robots.txt file up to 500 KB.