The role of robots.txt
A robots.txt file lives at the root of your domain and tells well-behaved crawlers which paths they are allowed to fetch. It is the oldest and simplest tool in the SEO toolbox, dating back to 1994. Despite its age, robots.txt is still the primary way to keep crawlers out of admin pages, draft URLs, search-result pages, and other low-value content that would otherwise dilute your site’s index.
Crucially, robots.txt is not a security mechanism. Bad actors will ignore it. Use it for crawl control, and keep sensitive content behind authentication.
Modern AI crawler controls
In the past two years, a wave of new crawlers has appeared, each scraping the web to train large language models. The robots.txt generator includes presets for the most common ones so you can allow or disallow them with a single toggle: OpenAI GPTBot, Anthropic ClaudeBot and anthropic-ai, Google-Extended (controls Bard / Gemini training without affecting Google Search), PerplexityBot, CCBot (Common Crawl), and Bytedance.
Best practices
- Always start with a global User-agent: * block as a baseline.
- Use Disallow: /admin/ to block sensitive directories.
- Add a Sitemap: directive pointing to your sitemap.xml.
- Test the result with Google’s robots.txt Tester before deploying.
- Remember that Disallow does not de-index already-indexed pages — use a noindex meta tag for that.