robots.txt
robots.txt is a plain text file that controls how search engine bots and AI crawlers access your website using User-agent, Disallow, Allow, and Sitemap directives.
- What robots.txt Does and When to Use It
- Core Concepts of robots.txt
- User-agent Directive in robots.txt
- Disallow Directive in robots.txt
- Allow Directive in robots.txt
- Sitemap Directive in robots.txt
- Wildcard and Pattern Matching in robots.txt
- robots.txt Syntax Reference
- Common Search Engine and AI Bot User-Agents
- robots.txt Directive Reference
- Common Tasks with robots.txt
- How to Create a robots.txt File
- How to Block a Directory with robots.txt
- How to Block AI Training Crawlers with robots.txt
- Limitations of robots.txt
- Related Tools and Guides
robots.txt (Robots Exclusion Protocol) is a plain text file placed at the root of a website that instructs search engine bots and AI crawlers which URL paths they may or may not access.
What robots.txt Does and When to Use It
The robots.txt file controls crawling, not indexing. Search engine bots such as Googlebot and Bingbot read the robots.txt file before requesting any page on the site. The file tells bots which paths are allowed and which are disallowed. RFC 9309, published by the IETF in September 2022, formalized the Robots Exclusion Protocol as an official internet standard.
robots.txt applies only to crawling. A disallowed page can still appear in search results if other pages link to it with descriptive anchor text. To prevent a page from appearing in search results, use the
noindex meta robots tag or the
X-Robots-Tag HTTP header instead.
Use robots.txt when you need to manage crawl budget by blocking low-value pages, prevent bots from accessing resource-heavy paths, block crawling of non-HTML resources such as PDFs or images, declare the location of your XML sitemap, or control access for specific bots like AI training crawlers. Do not use robots.txt to hide sensitive content. Compliance with robots.txt is voluntary, and malicious bots ignore it. Protect sensitive pages with authentication, server-side access rules, or a web application firewall (WAF).
Core Concepts of robots.txt
User-agent Directive in robots.txt
The
User-agent directive in robots.txt specifies which bot the following rules apply to. Set
User-agent: * to target all bots, or use a specific bot name such as
Googlebot or
Bingbot to create rules for a single crawler. The User-agent value is case-insensitive according to RFC 9309.
User-agent: Googlebot
Disallow: /private/Each group of rules in robots.txt starts with a
User-agent line followed by one or more
Disallow or
Allow directives. A robots.txt file can contain multiple groups, each targeting a different bot.
Disallow Directive in robots.txt
The
Disallow directive in robots.txt blocks access to a URL path. Bots matching the
User-agent line must not crawl paths specified by
Disallow. An empty
Disallow: value means nothing is blocked.
User-agent: *
Disallow: /admin/
Disallow: /staging/A
Disallow: / rule blocks all paths on the site. Use this to prevent a specific bot from crawling the entire website.
Allow Directive in robots.txt
The
Allow directive in robots.txt grants access to a URL path within a broader disallowed area. The Allow directive was not part of the original 1994 protocol but is supported by all major search engine bots including Googlebot and Bingbot, and is codified in RFC 9309.
User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.phpWhen
Allow and
Disallow rules conflict, Googlebot and Bingbot follow the most specific (longest) matching rule. Other bots may follow the first matching rule instead.
Sitemap Directive in robots.txt
The
Sitemap directive in robots.txt declares the full URL of the XML sitemap. Search engine bots use this location to discover pages for crawling and indexing. The Sitemap directive is independent of any
User-agent group and can appear anywhere in the file.
Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/sitemap-news.xmlMultiple
Sitemap lines are valid. List separate sitemaps for news, images, or video content as needed.
Wildcard and Pattern Matching in robots.txt
robots.txt supports two special characters for pattern matching. The asterisk
* matches zero or more characters in a path. The dollar sign
$ anchors the match to the end of the URL.
User-agent: *
Disallow: /*.pdf$This robots.txt rule blocks crawling of all URLs ending in
.pdf regardless of directory. The
* matches any path prefix, and the
$ ensures the match applies only to URLs ending with
.pdf.
robots.txt Syntax Reference
Common Search Engine and AI Bot User-Agents
Major search engine bots and AI crawlers that comply with robots.txt directives:
| Company | Type | User-agent |
|---|---|---|
| Web search | Googlebot | |
| Image search | Googlebot-Image | |
| Video search | Googlebot-Video | |
| News | Googlebot-News | |
| AdSense | Mediapartners-Google | |
| Ads crawl | AdsBot-Google | |
| AI training | Google-Extended | |
| Bing | Web search | bingbot |
| Bing | Ads | adidxbot |
| Yahoo | Web search | Slurp |
| Yandex | Web search | YandexBot |
| OpenAI | AI training | GPTBot |
| OpenAI | Search | OAI-SearchBot |
| Anthropic | AI training | ClaudeBot |
| Perplexity | AI search | PerplexityBot |
| Meta | AI training | Meta-ExternalAgent |
| Apple | Siri / Spotlight | Applebot |
Blocking
Google-Extended in robots.txt prevents Google from using your content for AI training (Gemini) but does not affect Googlebot indexing for regular search results. This separation allows sites to maintain search visibility while opting out of AI training data collection.
robots.txt Directive Reference
| Directive | Purpose | Scope | Example |
|---|---|---|---|
User-agent | Identifies which bot the rules apply to | Per group | User-agent: Googlebot |
Disallow | Blocks access to a path | Per group | Disallow: /private/ |
Allow | Grants access to a path within a disallowed area | Per group | Allow: /private/public.html |
Sitemap | Declares the sitemap URL | Global | Sitemap: https://example.com/sitemap.xml |
Crawl-delay | Sets seconds between requests (not supported by Google) | Per group | Crawl-delay: 10 |
Google does not support the
Crawl-delay directive. Control Googlebot's crawl rate through Google Search Console instead. Bing, Yahoo, and Yandex support
Crawl-delay and interpret it as the minimum number of seconds between consecutive requests.
Common Tasks with robots.txt
How to Create a robots.txt File
Create a plain text file named
robots.txt and place it at the root of the website. The file must be accessible at
https://example.com/robots.txt. Save the file in UTF-8 encoding.
User-agent: *
Disallow:
Sitemap: https://example.com/sitemap.xmlThis robots.txt file allows all bots to crawl every page on the site and declares the sitemap location. An empty
Disallow: value means no paths are restricted.
How to Block a Directory with robots.txt
Block a specific directory path to prevent bots from crawling pages under that path.
User-agent: *
Disallow: /admin/The trailing slash in
/admin/ blocks all URLs under the
/admin/ directory. Without the trailing slash,
Disallow: /admin also blocks URLs like
/admin-page because robots.txt uses prefix matching.
How to Block AI Training Crawlers with robots.txt
Block AI training crawlers while keeping the site visible in traditional search results. Target each AI bot by its specific User-agent name.
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: Meta-ExternalAgent
Disallow: /
User-agent: *
Disallow:This robots.txt configuration blocks GPTBot (OpenAI), ClaudeBot (Anthropic), Google-Extended (Google AI training), and Meta-ExternalAgent (Meta) from crawling any page. The final
User-agent: * block with an empty
Disallow allows all other bots, including Googlebot and Bingbot, to crawl the entire site.
Limitations of robots.txt
robots.txt relies on voluntary compliance. Well-behaved bots such as Googlebot, Bingbot, and most major AI crawlers follow robots.txt directives. Malicious bots, scrapers, and unidentified crawlers may ignore the file entirely.
robots.txt cannot prevent indexing. Google may still index a URL blocked by robots.txt if other pages link to it. The indexed result appears without a snippet or cached version. To remove a page from search results, use the
noindex meta tag or the
X-Robots-Tag HTTP header on a crawlable page.
robots.txt directives may be interpreted differently across bots. The
Crawl-delay directive is supported by Bing and Yandex but ignored by Google. Conflicting
Allow and
Disallow rules use longest-match resolution in Googlebot and Bingbot, but other bots may use first-match resolution.
Related Tools and Guides
The
noindex meta robots tag prevents a page from appearing in search results. Unlike robots.txt, the meta tag requires the page to be crawlable so the bot can read the tag. Use
noindex when you need to remove a specific page from the search index rather than blocking crawling. See the
SEO Bots overviewfor a comparison of all crawler control tools.
Google Search Console provides a robots.txt tester that validates syntax and tests whether specific URLs are blocked or allowed. Bing Webmaster Tools offers similar validation for Bingbot interpretation.