How to scrape a sitemap and crawl every page

Updated 2026-06-25 · 6 min read

If you want every page of a site rather than the few you can click to, the sitemap is the list you are looking for, and the catch is that /sitemap.xml is often not a flat list of pages. Large sites split their URLs across a sitemap index, where the top file is a list of other sitemap files (<sitemap> entries pointing at sitemap-posts.xml, sitemap-pages.xml, and so on), and each of those holds the actual <url> entries. A parser that expects one flat document reads the index, finds no page URLs, and hands you back nothing.

The solution is to fetch the sitemap and detect which kind it is before reading any URLs. We'll build a small script that fetches the sitemap and parses the XML so we can branch on the root element, distinguishes a sitemap index (a list of other sitemap files, used by large sites to stay under the 50,000-URL-per-file limit) from a flat list of pages and follows each child sitemap when it is an index, and walks the collected URLs through a concurrency pool that holds a fixed number of fetches in flight so we read the pages at a steady rate instead of opening hundreds of sockets at once. It comes to about 70 lines of Node.js with one dependency, fast-xml-parser.

The complete script

// crawl-sitemap.mjs
import { XMLParser } from 'fast-xml-parser'

const SITEMAP_URL = 'https://www.scrapingbee.com/sitemap.xml'
const CONCURRENCY = 5      // how many page fetches run at once
const MAX_PAGES = 50       // stop after this many, so a demo run stays small

// isArray forces <sitemap> and <url> to always parse as arrays,
// even when the document holds exactly one of them. without this a
// one-entry sitemap parses to an object and the .map() below throws.
const parser = new XMLParser({
  ignoreAttributes: true,
  isArray: (name) => name === 'sitemap' || name === 'url'
})

// fetch one XML document and parse it. throws on a non-200 so a
// missing child sitemap surfaces instead of silently parsing an error page.
async function fetchXml(url) {
  const res = await fetch(url, { headers: { 'User-Agent': 'Mozilla/5.0' } })
  if (!res.ok) throw new Error(`${res.status} ${res.statusText} for ${url}`)
  return parser.parse(await res.text())
}

// return the page URLs in a sitemap. if the document is a sitemap index,
// recurse into each child sitemap and concatenate their URLs.
async function collectUrls(sitemapUrl) {
  const doc = await fetchXml(sitemapUrl)

  if (doc.sitemapindex) {
    const children = doc.sitemapindex.sitemap.map((s) => s.loc)
    const nested = await Promise.all(children.map((child) => collectUrls(child)))
    return nested.flat()
  }

  if (doc.urlset) {
    return doc.urlset.url.map((u) => u.loc)
  }

  return []  // neither root element: not a sitemap we recognize
}

// crawl an array of URLs, keeping `limit` fetches in flight at a time.
// each finished fetch frees a slot and the next URL starts, so the
// number of open sockets never exceeds `limit`.
async function crawl(urls, limit, handle) {
  const queue = [...urls]
  const running = new Set()

  while (queue.length || running.size) {
    while (running.size < limit && queue.length) {
      const url = queue.shift()
      const task = handle(url).finally(() => running.delete(task))
      running.add(task)
    }
    if (running.size) await Promise.race(running)
  }
}

const urls = await collectUrls(SITEMAP_URL)
console.log(`Discovered ${urls.length} URLs`)

const pages = urls.slice(0, MAX_PAGES)
await crawl(pages, CONCURRENCY, async (url) => {
  try {
    const res = await fetch(url, { headers: { 'User-Agent': 'Mozilla/5.0' } })
    const html = await res.text()
    console.log(`${res.status}  ${html.length} bytes  ${url}`)
  } catch (err) {
    console.log(`ERR   ${err.message}  ${url}`)
  }
})

bash

npm install fast-xml-parser
node crawl-sitemap.mjs

How it works

Configure the parser to keep single entries as arrays. fast-xml-parser collapses a repeated element with one occurrence into a plain object, so a sitemap with a single <url> would parse to doc.urlset.url being an object, not an array, and .map() would throw. The isArray callback forces sitemap and url to parse as arrays every time, which lets the same .map() handle a one-entry file and a fifty-thousand-entry file. Setting ignoreAttributes: true drops the XML namespace attributes you do not need, so each entry is the plain text of its child elements.

Fetch and throw on a bad status. fetchXml sends a stock desktop browser User-Agent, because some hosts return a block page to the default Node client. It checks res.ok and throws on a non-200, so a child sitemap that 404s raises an error instead of feeding an HTML error page into the XML parser and producing an empty result you have to debug later. One thing this version does not handle is a gzipped sitemap: many sites publish sitemap.xml.gz, and calling res.text() on a gzip body hands the parser binary bytes that fail with "Non-whitespace before first tag", so when the URL ends in .gz decompress first with zlib.gunzipSync(Buffer.from(await res.arrayBuffer())) before parsing.

Branch on the root element. A sitemap index parses to an object with a sitemapindex key; a normal sitemap parses to one with a urlset key. collectUrls checks for sitemapindex first, maps each <sitemap> entry's <loc> to a child URL, and recurses into each child with Promise.all. When the root is urlset it returns the <loc> of each <url>. A document with neither root returns an empty array rather than throwing.

Crawl through a concurrency pool. crawl keeps a Set of in-flight promises. The inner loop tops the set up to limit by shifting URLs off the queue, and Promise.race waits for the next one to finish before the loop refills the empty slot. The result is a steady limit requests at a time across the whole list, which is gentler on the target than Promise.all over every URL and faster than fetching one at a time. Keep limit small, since Promise.all over every URL would open one socket per URL and a 5,000-URL sitemap would earn a 429 or a connection reset. Nothing in fetch consults robots.txt, so read it and check each URL before fetching with a library like robots-parser, honoring any Crawl-delay it sets. Sitemaps also go stale, so check res.status in the handler and skip or record anything that is not 200 instead of treating an error body as page content. A real sitemap can list tens of thousands of pages, so a full run can take hours and lose everything on a crash partway: keep MAX_PAGES low while developing, persist each result to disk or a database keyed by URL as you go, and resume from the last unfinished URL on the next run. Swap the console.log in the handler for your own extraction, such as parsing the HTML or saving it to disk.

Use this when

The site publishes a sitemap and you want broad coverage of its pages: building a search index, mirroring a docs site, feeding an LLM a whole knowledge base, or auditing a site for broken or changed pages. You want the page list the site already maintains, not one you discover by following links.

Skip this when

The site has no sitemap and you must discover pages by following links (crawl from a seed URL and extract links per page instead); the pages render their content with JavaScript so the fetched HTML is an empty shell (render each page with Puppeteer first); the data lives behind a login (handle the session before crawling); or you only need a single page's content rather than the whole site (fetch and extract that one URL).

How to scrape a sitemap and crawl every page ​

The complete script ​

How it works ​

Related guides ​

Skip the code, just get the data Simplescraper turns any website into structured data in seconds.

How to scrape a sitemap and crawl every page

The complete script

How it works

Related guides

Skip the code, just get the data
Simplescraper turns any website into structured data in seconds.