💡 Tip of the Day
Use structured data to enhance search listings.
robots.txt is a plain text file that tells crawlers which parts of your site they should or should not request. It is not a security layer, but it is an important signal for responsible bots. A tester helps you verify whether a given User-Agent can fetch a path based on the longest matching rule. This tool accepts the full file, a URL to check, and a crawler name, then shows the matched group, rule lengths, and the final allow or disallow decision.
Quick start - paste, set, test
Paste your robots.txt content, add a URL from your site, and pick a User-Agent. If your file uses multiple groups, make sure the group you expect to apply includes the User-Agent you entered. Click test. The result lists the group that matched, all rules with their match lengths, and the final decision. Remember that the longest match wins and that an Allow can override a Disallow if they tie on length. Keep a note of edge cases you want to check again after deployment.
Groups and precedence - who do rules apply to
Groups begin with one or more User-agent lines followed by Allow and Disallow lines. A crawler looks for the most specific group that mentions it. If no specific group exists, it falls back to the * group. When two rules match the same path, the rule with the longer match decides. If both rules match with equal length, Allow wins. This model is consistent with the published standard and common crawler behavior, and it prevents surprising outcomes when you add narrow exceptions under a broad block.
Wildcards and anchors - how patterns behave
Two special tokens matter in common practice. * matches any sequence of characters. $ anchors a rule to the end of the path. For example, Disallow: /*.pdf$ blocks exactly paths that end with .pdf, while Disallow: /private/ blocks anything under that folder. Keep rules simple so you can reason about them later. Overusing wildcards often creates unintended blocks that are hard to spot in a large site.
Sitemaps, crawl delay, and notes
robots.txt can include Sitemap lines to point crawlers to your sitemaps. Crawl-delay appears in some historical files, but support is inconsistent and should not be relied on for Google. Use server rate controls and caching if load is a concern. For the definitive overview, Google’s documentation on robots.txt behavior and indexing is the most reliable reference when you implement or troubleshoot Google Search - robots.txt. The formal specification published by the IETF is also worth bookmarking for exact language on matching RFC 9309.
Comparison - CMS plugins vs manual rules
Aspect | CMS plugin | Manual file |
---|---|---|
Setup speed | Fast | Fast for simple sites |
Granularity | Limited by UI | Exact control |
Error risk | Lower for basics | Higher without tests |
Versioning | App managed | Git friendly |
Bullet notes - safe patterns you can trust
- Block private and staging paths by prefix rather than file type where possible.
- Allow assets like CSS and JS needed for rendering so crawlers can fetch them.
- Keep the file small and readable - comments help future maintainers.
- Test the most sensitive paths with your intended User-Agent before and after release.
Common pitfalls - avoid silent indexing issues
Blocking crawl to content does not remove it from indexes if the URLs are already known. If removal is the goal, serve a 404 or 410, or use noindex in the page when it is accessible. Blocking assets required for rendering can cause crawlers to misjudge layout or mobile friendliness. Make exceptions for important asset paths. When you mirror production on staging, ensure your staging robots.txt is strict and cannot leak to public search.
Two questions before you ship
First, do your rules express intent in the simplest way possible - clear prefixes and short exceptions rather than overlapping wildcards. Second, if you disabled crawl to sections of the site, are you sure none of those URLs need rendering assets that ride on blocked paths. A five-minute test with this tool can prevent days of diagnosis later.
robots.txt is not a cure-all. It is a polite signpost for crawlers and a small guard against wasted bandwidth. Keep your file clean, lean, and tested. Paired with sitemaps, proper canonical tags, and good server responses, it helps search engines understand how to spend time on your site where it matters most.