Crawl Infrastructure 12 min read

Robots and Sitemap Checker Guide

Use this guide to understand what the Robots and Sitemap Checker is validating, how crawl rules and sitemaps work together, and how to avoid conflicting signals that waste crawl effort or suppress the wrong pages.

Overview

Crawlability is strongest when robots directives, sitemap entries, canonical tags, and indexability intent all point in the same direction. The checker is trying to surface where those signals drift apart and create confusion for crawlers.

Signals to review together

  • robots.txt: Rules should match the public/private boundary you actually intend to expose.
  • sitemap.xml: Only publish URLs you want discovered and validated as indexable candidates.
  • canonical and noindex signals: A sitemap entry should not fight the page-level indexing intent.

Common Crawl-Policy Conflicts

ConflictWhy it causes problemsExpected fix
URL in sitemap but blocked by robotsThe URL is advertised and blocked at the same time.Align discovery and access intent.
Public guide noindexed accidentallyThe page exists but tells crawlers not to keep it.Fix page-level metadata or remove it from the crawl inventory.
Private app routes in sitemapDiscovery signals point crawlers at URLs not meant for indexing.Restrict sitemap generation to the intended public surface.
Canonical points elsewhereThe sitemap URL and the page canonical disagree.Choose the real canonical and align both outputs.

Practical File Examples

Basic robots.txt for a mixed public/private app
User-agent: *
Disallow: /dashboard
Disallow: /admin
Disallow: /settings
Allow: /tools
Allow: /blog
Allow: /help
Sitemap: https://example.com/sitemap.xml
Simple sitemap entry
<url>
  <loc>https://example.com/tools/guides/csp-checker</loc>
  <changefreq>weekly</changefreq>
  <priority>0.6</priority>
</url>

Recommended Remediation Flow

  1. Decide the real public URL inventory Separate public marketing, guides, tools, blog, and legal pages from account and admin routes.
  2. Align robots, sitemap, and page metadata A crawl-facing URL should not simultaneously be blocked, noindexed, or canonicalized elsewhere by mistake.
  3. Remove accidental discovery of private surfaces Make sure sitemaps and crawl rules do not advertise URLs that belong only to logged-in users.
  4. Retest the live outputs Validate robots.txt, sitemap.xml, and representative pages from the same public edge crawlers will see.

Troubleshooting Common Issues

A public page is missing from discovery

The usual causes are sitemap omission, accidental robots blocking, or page-level noindex drift.

  • Check sitemap inclusion first.
  • Compare robots rules against the exact public path.
  • Review canonical and robots tags on the final page response.
A private page keeps appearing in crawl outputs

This often happens when the URL leaked into a sitemap or inherited public metadata from a shared template.

  • Remove the URL from the sitemap source.
  • Verify page-level robots and canonical behavior on the final response.
  • Retest through the public edge rather than a local dev path.

Validation Checklist

Post-fix validation

  • Confirm robots.txt references the intended sitemap and does not block the public pages you want crawled.
  • Verify sitemap.xml only contains URLs that belong on the public surface.
  • Check representative public pages for canonical and robots alignment.
  • Run the Robots and Sitemap Checker again and compare the output to the intended crawl policy.

FAQ

Should every public URL appear in the sitemap?

Not always, but every important indexable URL usually should.

  • Include URLs you genuinely want discovered and maintained.
  • Exclude private, logged-in, or duplicate surfaces.
  • Keep the sitemap tied to the authoritative public route inventory.
Does robots.txt prevent indexing on its own?

Not reliably. It controls crawling, not every indexing pathway.

  • Use page-level metadata when you need explicit noindex intent.
  • Do not rely on robots.txt alone for every privacy decision.
  • Align crawl rules with canonical and metadata signals.