robots.txt

robots.txt is a plain text file that controls how search engine bots and AI crawlers access your website using User-agent, Disallow, Allow, and Sitemap directives.

robots.txt (Robots Exclusion Protocol) is a plain text file placed at the root of a website that instructs search engine bots and AI crawlers which URL paths they may or may not access.

What robots.txt Does and When to Use It

The robots.txt file controls crawling, not indexing. Search engine bots such as Googlebot and Bingbot read the robots.txt file before requesting any page on the site. The file tells bots which paths are allowed and which are disallowed. RFC 9309, published by the IETF in September 2022, formalized the Robots Exclusion Protocol as an official internet standard.

robots.txt applies only to crawling. A disallowed page can still appear in search results if other pages link to it with descriptive anchor text. To prevent a page from appearing in search results, use the noindex meta robots tag or the X-Robots-Tag HTTP header instead.

Use robots.txt when you need to manage crawl budget by blocking low-value pages, prevent bots from accessing resource-heavy paths, block crawling of non-HTML resources such as PDFs or images, declare the location of your XML sitemap, or control access for specific bots like AI training crawlers. Do not use robots.txt to hide sensitive content. Compliance with robots.txt is voluntary, and malicious bots ignore it. Protect sensitive pages with authentication, server-side access rules, or a web application firewall (WAF).

Core Concepts of robots.txt

User-agent Directive in robots.txt

The User-agent directive in robots.txt specifies which bot the following rules apply to. Set User-agent: * to target all bots, or use a specific bot name such as Googlebot or Bingbot to create rules for a single crawler. The User-agent value is case-insensitive according to RFC 9309.

User-agent: Googlebot
Disallow: /private/

Each group of rules in robots.txt starts with a User-agent line followed by one or more Disallow or Allow directives. A robots.txt file can contain multiple groups, each targeting a different bot.

Disallow Directive in robots.txt

The Disallow directive in robots.txt blocks access to a URL path. Bots matching the User-agent line must not crawl paths specified by Disallow. An empty Disallow: value means nothing is blocked.

User-agent: *
Disallow: /admin/
Disallow: /staging/

A Disallow: / rule blocks all paths on the site. Use this to prevent a specific bot from crawling the entire website.

Allow Directive in robots.txt

The Allow directive in robots.txt grants access to a URL path within a broader disallowed area. The Allow directive was not part of the original 1994 protocol but is supported by all major search engine bots including Googlebot and Bingbot, and is codified in RFC 9309.

User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php

When Allow and Disallow rules conflict, Googlebot and Bingbot follow the most specific (longest) matching rule. Other bots may follow the first matching rule instead.

Sitemap Directive in robots.txt

The Sitemap directive in robots.txt declares the full URL of the XML sitemap. Search engine bots use this location to discover pages for crawling and indexing. The Sitemap directive is independent of any User-agent group and can appear anywhere in the file.

Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/sitemap-news.xml

Multiple Sitemap lines are valid. List separate sitemaps for news, images, or video content as needed.

Wildcard and Pattern Matching in robots.txt

robots.txt supports two special characters for pattern matching. The asterisk * matches zero or more characters in a path. The dollar sign $ anchors the match to the end of the URL.

User-agent: *
Disallow: /*.pdf$

This robots.txt rule blocks crawling of all URLs ending in .pdf regardless of directory. The * matches any path prefix, and the $ ensures the match applies only to URLs ending with .pdf.

robots.txt Syntax Reference

Common Search Engine and AI Bot User-Agents

Major search engine bots and AI crawlers that comply with robots.txt directives:

CompanyTypeUser-agent
GoogleWeb searchGooglebot
GoogleImage searchGooglebot-Image
GoogleVideo searchGooglebot-Video
GoogleNewsGooglebot-News
GoogleAdSenseMediapartners-Google
GoogleAds crawlAdsBot-Google
GoogleAI trainingGoogle-Extended
BingWeb searchbingbot
BingAdsadidxbot
YahooWeb searchSlurp
YandexWeb searchYandexBot
OpenAIAI trainingGPTBot
OpenAISearchOAI-SearchBot
AnthropicAI trainingClaudeBot
PerplexityAI searchPerplexityBot
MetaAI trainingMeta-ExternalAgent
AppleSiri / SpotlightApplebot

Blocking Google-Extended in robots.txt prevents Google from using your content for AI training (Gemini) but does not affect Googlebot indexing for regular search results. This separation allows sites to maintain search visibility while opting out of AI training data collection.

robots.txt Directive Reference

DirectivePurposeScopeExample
User-agentIdentifies which bot the rules apply toPer groupUser-agent: Googlebot
DisallowBlocks access to a pathPer groupDisallow: /private/
AllowGrants access to a path within a disallowed areaPer groupAllow: /private/public.html
SitemapDeclares the sitemap URLGlobalSitemap: https://example.com/sitemap.xml
Crawl-delaySets seconds between requests (not supported by Google)Per groupCrawl-delay: 10

Google does not support the Crawl-delay directive. Control Googlebot's crawl rate through Google Search Console instead. Bing, Yahoo, and Yandex support Crawl-delay and interpret it as the minimum number of seconds between consecutive requests.

Common Tasks with robots.txt

How to Create a robots.txt File

Create a plain text file named robots.txt and place it at the root of the website. The file must be accessible at https://example.com/robots.txt. Save the file in UTF-8 encoding.

User-agent: *
Disallow:

Sitemap: https://example.com/sitemap.xml

This robots.txt file allows all bots to crawl every page on the site and declares the sitemap location. An empty Disallow: value means no paths are restricted.

How to Block a Directory with robots.txt

Block a specific directory path to prevent bots from crawling pages under that path.

User-agent: *
Disallow: /admin/

The trailing slash in /admin/ blocks all URLs under the /admin/ directory. Without the trailing slash, Disallow: /admin also blocks URLs like /admin-page because robots.txt uses prefix matching.

How to Block AI Training Crawlers with robots.txt

Block AI training crawlers while keeping the site visible in traditional search results. Target each AI bot by its specific User-agent name.

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

User-agent: *
Disallow:

This robots.txt configuration blocks GPTBot (OpenAI), ClaudeBot (Anthropic), Google-Extended (Google AI training), and Meta-ExternalAgent (Meta) from crawling any page. The final User-agent: * block with an empty Disallow allows all other bots, including Googlebot and Bingbot, to crawl the entire site.

Limitations of robots.txt

robots.txt relies on voluntary compliance. Well-behaved bots such as Googlebot, Bingbot, and most major AI crawlers follow robots.txt directives. Malicious bots, scrapers, and unidentified crawlers may ignore the file entirely.

robots.txt cannot prevent indexing. Google may still index a URL blocked by robots.txt if other pages link to it. The indexed result appears without a snippet or cached version. To remove a page from search results, use the noindex meta tag or the X-Robots-Tag HTTP header on a crawlable page.

robots.txt directives may be interpreted differently across bots. The Crawl-delay directive is supported by Bing and Yandex but ignored by Google. Conflicting Allow and Disallow rules use longest-match resolution in Googlebot and Bingbot, but other bots may use first-match resolution.

The noindex meta robots tag prevents a page from appearing in search results. Unlike robots.txt, the meta tag requires the page to be crawlable so the bot can read the tag. Use noindex when you need to remove a specific page from the search index rather than blocking crawling. See the SEO Bots overviewfor a comparison of all crawler control tools.

Google Search Console provides a robots.txt tester that validates syntax and tests whether specific URLs are blocked or allowed. Bing Webmaster Tools offers similar validation for Bingbot interpretation.