DevLog 20250523: Sitemap and `robots.txt`

#seo #webdev #trick #server

Search engine optimization (SEO) is not just about keywords and HTML metadata! Though those are the most basic things one can do (and can easily improve site visibility), there are other tricks that go a bit deeper—more technical than what ordinary readers see.

I found that webmasters can go to Google Search Console and Bing Webmaster Tools to add sitemaps. This continues our previous discussion on Search Engine Architecture.

Sitemap

A sitemap is simply a file (or sometimes a web page) that tells search engines about the pages on our site.

Better discovery: Search engines won’t have to “guess” at pages.
Faster indexing: New or updated content gets found more quickly when we update the sitemap.
Structured hints: Metadata in an XML sitemap gives crawlers extra clues about how often and how important different pages are.

There are two main flavors:

XML Sitemap (for search engines)
- It’s an XML-formatted file (usually named sitemap.xml) that lives at the website’s root.
- Inside, it lists all of the website’s important URLs, plus optional metadata like:
  - <lastmod> (when the page was last changed)
  - <changefreq> (how often it tends to be updated)
  - <priority> (a hint about which pages we consider most important)
- By submitting this file to Google Search Console or Bing Webmaster Tools, we help crawlers discover and index our pages more efficiently—especially useful if we have a very large site, pages that aren’t well linked internally, or lots of media content.

   <?xml version="1.0" encoding="UTF-8"?>
   <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
     <url>
       <loc>https://www.example.com/</loc>
       <lastmod>2025-05-20</lastmod>
       <changefreq>daily</changefreq>
       <priority>1.0</priority>
     </url>
     <url>
       <loc>https://www.example.com/blog/post-1</loc>
       <lastmod>2025-05-18</lastmod>
       <changefreq>weekly</changefreq>
       <priority>0.8</priority>
     </url>
     <!-- more URLs here -->
   </urlset>

HTML Sitemap (for people)

It’s just a regular web page on the site that lists links to all pages in a human-readable format.
It’s primarily a usability feature—helping visitors (and indirectly search engines) navigate large or complex sites.

robots.txt

The robots.txt file is another “cheat-sheet” to put at the very root of the website (e.g. https://www.example.com/robots.txt) to tell well-behaved web crawlers which parts of the site they’re welcome to explore—and which parts we’d rather keep off-limits.

Privacy & security: Keep staging directories, admin panels, or confidential files out of search results.
Crawl-budget control: On large sites, we can steer crawlers away from low-value pages (like faceted filters), so they focus on important content.
Performance: Reduce server load by preventing bots from hammering resource-heavy sections.

Some notes:

Location matters
- Must live at https://domain.com/robots.txt (exactly).
- Crawlers automatically look here first before they begin crawling website pages.
Basic syntax
- It’s plain text, with directives grouped by User-agent (the crawler’s name).
- Common directives:

 * `Disallow:` — path (or file) we don’t want crawled
 * `Allow:` — exception to a `Disallow:` (supported by Google, Bing, etc.)
 * `Sitemap:` — URL of the XML sitemap

   # Block all crawlers from /private/
   User-agent: *
   Disallow: /private/

   # Allow Googlebot to see /private/public-info.html
   User-agent: Googlebot
   Allow: /private/public-info.html

   # Let everyone know where the sitemap lives
   Sitemap: https://www.example.com/sitemap.xml

User-agents
- * is the wildcard: applies to every crawler.
- We can target specific bots (e.g., User-agent: Googlebot, User-agent: Bingbot) if we need different rules.
Disallow vs. Allow
- Disallow: / — don’t crawl anything on the site.
- Disallow: (empty) — allow everything.
- Allow: /path/to/page.html — lets a crawler index a page that would otherwise be blocked by a broader Disallow:.

A few best practices

robots.txt is publicly visible—don’t use it to hide truly sensitive info (use authentication!).

Test in Google Search Console’s “robots.txt Tester” or Bing Webmaster Tools.
Combine with sitemaps: Always include a Sitemap: line so crawlers can discover all valid URLs easily.

Here is an example from Google: https://www.google.com/robots.txt

Observability should elevate – not hinder – the developer experience.

Is your troubleshooting toolset diminishing code output? With Dynatrace, developers stay in flow while debugging – reducing downtime and getting back to building faster.

Explore Observability for Developers

DEV Community

DevLog 20250523: Sitemap and `robots.txt`

Sitemap

robots.txt

A few best practices

Observability should elevate – not hinder – the developer experience.

Top comments (0)