robots.txt best practices

Follow these robots.txt best practices to control crawl access, avoid indexing conflicts, and manage search engine and AI bot behavior effectively.

Use noindex Instead of robots.txt to Block Indexing

robots.txt controls crawling, not indexing. A page blocked by robots.txt can still appear in Google search results if other pages link to it. Google indexes the URL and displays it without a snippet or cached page. Use the noindex meta robots tag or the X-Robots-Tag HTTP response header to prevent a page from appearing in search results. The page must remain crawlable so the bot can read the noindex directive.

Create a Separate robots.txt File for Each Subdomain

robots.txt applies only to the origin (scheme + host + port) where it is hosted. A robots.txt file at example.com/robots.txt does not apply to blog.example.com or shop.example.com. Each subdomain requires its own robots.txt file at its root path. Omitting a robots.txt file for a subdomain means bots assume all paths on that subdomain are crawlable.

Keep robots.txt File Size Under 500 KiB

Google requires crawlers to parse at least 500 kibibytes (512,000 bytes) of a robots.txt file, as specified in RFC 9309. Content beyond this limit is ignored, which may cause Google to treat unprocessed rules as if they do not exist. Consolidate overlapping rules and remove unused directives to stay within the size limit. Sites with thousands of disallowed paths should group rules using wildcard patterns instead of listing each path individually.

Group Directives by User-agent in robots.txt

Organize robots.txt rules by specifying one User-agent line per group, followed by all Disallow and Allow rules for that bot. Bing requires all rules to appear in the named Bingbot section; directives under User-agent: * are ignored when a specific bingbot section exists.

User-agent: Googlebot
Disallow: /staging/
Disallow: /*.pdf$

User-agent: bingbot
Disallow: /staging/
Disallow: /*.pdf$

User-agent: *
Disallow: /staging/

Repeat shared rules in each bot-specific section. Bots read only the most specific group that matches their User-agent name and ignore the wildcard group when a named group exists.

Use the Allow Directive in robots.txt to Create Exceptions

Combine Allow and Disallow in robots.txt to block a directory while permitting access to specific files within it. Googlebot and Bingbot resolve conflicting rules by following the longest (most specific) matching path.

User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php

The Allow directive for /wp-admin/admin-ajax.php (28 characters) is more specific than the Disallow for /wp-admin/ (10 characters), so bots follow the Allow rule for that path. Other bots may use first-match resolution instead of longest-match, so place Allow rules before Disallow rules for the same path prefix when targeting non-Google and non-Bing crawlers.

Save robots.txt in UTF-8 Encoding

Save the robots.txt file as a plain text file encoded in UTF-8. RFC 9309 requires UTF-8 encoding for robots.txt. Separate lines with CR, CR/LF, or LF line endings. Non-UTF-8 encoding may cause bots to misinterpret directives, particularly those containing non-ASCII characters in URL paths.

Include a Sitemap Directive in robots.txt

Declare the full URL of the XML sitemap using the Sitemap directive in robots.txt. This directive helps bots discover all indexable pages without depending on link crawling alone. The Sitemap directive is independent of any User-agent group and can appear anywhere in the file.

Sitemap: https://example.com/sitemap.xml

Submitting the sitemap through Google Search Console and Bing Webmaster Tools provides additional monitoring, but the robots.txt Sitemap directive ensures all compliant bots can find it.

Do Not Rely on robots.txt for Security

robots.txt is a public file. Anyone can read https://example.com/robots.txt and see which paths are disallowed. Listing sensitive paths in robots.txt reveals their existence to attackers. Compliance with robots.txt is voluntary, and malicious bots ignore it. Protect sensitive content with authentication, IP restrictions, server-side access controls, or a web application firewall (WAF).

Use Google Search Console to Test robots.txt Rules

Google Search Console provides a robots.txt tester that validates syntax and checks whether Googlebot can access specific URLs under the current rules. Test every change to robots.txt before deploying it to production. Syntax errors or overly broad Disallow rules can accidentally block important pages from search indexing and reduce organic traffic.

Control AI Training Crawlers Separately in robots.txt

Block AI training crawlers in robots.txt without affecting traditional search engine indexing. Target AI bots by their specific User-agent names: GPTBot (OpenAI), ClaudeBot (Anthropic), Google-Extended (Google AI training), Meta-ExternalAgent (Meta), and PerplexityBot (Perplexity). Blocking Google-Extended does not affect Googlebot's ability to crawl and index pages for regular search results.

User-agent: GPTBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: ClaudeBot
Disallow: /

Review server access logs regularly to identify new AI crawlers that may not respect robots.txt. Some AI bots use generic browser User-agent strings to avoid detection, which makes robots.txt ineffective against them. Use server-side rate limiting or a WAF for crawlers that bypass robots.txt rules.

Use Crawl-delay Only for Bots That Support It

Google does not support the Crawl-delay directive in robots.txt. Control Googlebot's crawl rate through Google Search Console instead. Bing, Yahoo, and Yandex interpret Crawl-delay as the minimum number of seconds between consecutive requests from their bots.

User-agent: bingbot
Crawl-delay: 5

User-agent: YandexBot
Crawl-delay: 10

Place Crawl-delay in the bot-specific group, not under User-agent: *, to avoid applying it to bots that do not recognize the directive.