robots.txt best practices
Follow these robots.txt best practices to control crawl access, avoid indexing conflicts, and manage search engine and AI bot behavior effectively.
- Use noindex Instead of robots.txt to Block Indexing
- Create a Separate robots.txt File for Each Subdomain
- Keep robots.txt File Size Under 500 KiB
- Group Directives by User-agent in robots.txt
- Use the Allow Directive in robots.txt to Create Exceptions
- Save robots.txt in UTF-8 Encoding
- Include a Sitemap Directive in robots.txt
- Do Not Rely on robots.txt for Security
- Use Google Search Console to Test robots.txt Rules
- Control AI Training Crawlers Separately in robots.txt
- Use Crawl-delay Only for Bots That Support It
Use noindex Instead of robots.txt to Block Indexing
robots.txt controls crawling, not indexing. A page blocked by robots.txt can still appear in Google search results if other pages link to it. Google indexes the URL and displays it without a snippet or cached page. Use the
noindex meta robots tag or the
X-Robots-Tag HTTP response header to prevent a page from appearing in search results. The page must remain crawlable so the bot can read the
noindex directive.
Create a Separate robots.txt File for Each Subdomain
robots.txt applies only to the origin (scheme + host + port) where it is hosted. A robots.txt file at
example.com/robots.txt does not apply to
blog.example.com or
shop.example.com. Each subdomain requires its own robots.txt file at its root path. Omitting a robots.txt file for a subdomain means bots assume all paths on that subdomain are crawlable.
Keep robots.txt File Size Under 500 KiB
Google requires crawlers to parse at least 500 kibibytes (512,000 bytes) of a robots.txt file, as specified in RFC 9309. Content beyond this limit is ignored, which may cause Google to treat unprocessed rules as if they do not exist. Consolidate overlapping rules and remove unused directives to stay within the size limit. Sites with thousands of disallowed paths should group rules using wildcard patterns instead of listing each path individually.
Group Directives by User-agent in robots.txt
Organize robots.txt rules by specifying one
User-agent line per group, followed by all
Disallow and
Allow rules for that bot. Bing requires all rules to appear in the named Bingbot section; directives under
User-agent: * are ignored when a specific
bingbot section exists.
User-agent: Googlebot
Disallow: /staging/
Disallow: /*.pdf$
User-agent: bingbot
Disallow: /staging/
Disallow: /*.pdf$
User-agent: *
Disallow: /staging/Repeat shared rules in each bot-specific section. Bots read only the most specific group that matches their User-agent name and ignore the wildcard group when a named group exists.
Use the Allow Directive in robots.txt to Create Exceptions
Combine
Allow and
Disallow in robots.txt to block a directory while permitting access to specific files within it. Googlebot and Bingbot resolve conflicting rules by following the longest (most specific) matching path.
User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.phpThe
Allow directive for
/wp-admin/admin-ajax.php (28 characters) is more specific than the
Disallow for
/wp-admin/ (10 characters), so bots follow the
Allow rule for that path. Other bots may use first-match resolution instead of longest-match, so place
Allow rules before
Disallow rules for the same path prefix when targeting non-Google and non-Bing crawlers.
Save robots.txt in UTF-8 Encoding
Save the robots.txt file as a plain text file encoded in UTF-8. RFC 9309 requires UTF-8 encoding for robots.txt. Separate lines with
CR,
CR/LF, or
LF line endings. Non-UTF-8 encoding may cause bots to misinterpret directives, particularly those containing non-ASCII characters in URL paths.
Include a Sitemap Directive in robots.txt
Declare the full URL of the XML sitemap using the
Sitemap directive in robots.txt. This directive helps bots discover all indexable pages without depending on link crawling alone. The
Sitemap directive is independent of any
User-agent group and can appear anywhere in the file.
Sitemap: https://example.com/sitemap.xmlSubmitting the sitemap through Google Search Console and Bing Webmaster Tools provides additional monitoring, but the robots.txt
Sitemap directive ensures all compliant bots can find it.
Do Not Rely on robots.txt for Security
robots.txt is a public file. Anyone can read
https://example.com/robots.txt and see which paths are disallowed. Listing sensitive paths in robots.txt reveals their existence to attackers. Compliance with robots.txt is voluntary, and malicious bots ignore it. Protect sensitive content with authentication, IP restrictions, server-side access controls, or a web application firewall (WAF).
Use Google Search Console to Test robots.txt Rules
Google Search Console provides a robots.txt tester that validates syntax and checks whether Googlebot can access specific URLs under the current rules. Test every change to robots.txt before deploying it to production. Syntax errors or overly broad
Disallow rules can accidentally block important pages from search indexing and reduce organic traffic.
Control AI Training Crawlers Separately in robots.txt
Block AI training crawlers in robots.txt without affecting traditional search engine indexing. Target AI bots by their specific User-agent names:
GPTBot (OpenAI),
ClaudeBot (Anthropic),
Google-Extended (Google AI training),
Meta-ExternalAgent (Meta), and
PerplexityBot (Perplexity). Blocking
Google-Extended does not affect Googlebot's ability to crawl and index pages for regular search results.
User-agent: GPTBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: ClaudeBot
Disallow: /Review server access logs regularly to identify new AI crawlers that may not respect robots.txt. Some AI bots use generic browser User-agent strings to avoid detection, which makes robots.txt ineffective against them. Use server-side rate limiting or a WAF for crawlers that bypass robots.txt rules.
Use Crawl-delay Only for Bots That Support It
Google does not support the
Crawl-delay directive in robots.txt. Control Googlebot's crawl rate through Google Search Console instead. Bing, Yahoo, and Yandex interpret
Crawl-delay as the minimum number of seconds between consecutive requests from their bots.
User-agent: bingbot
Crawl-delay: 5
User-agent: YandexBot
Crawl-delay: 10Place
Crawl-delay in the bot-specific group, not under
User-agent: *, to avoid applying it to bots that do not recognize the directive.