How to scrape a sitemap and crawl every page
If you want every page of a site rather than the few you can click to, the sitemap is the list you are looking for, and the catch is that /sitemap.xml is often not a flat list of pages. Large sites split their URLs across a sitemap index, where the top file is a list of other sitemap files (<sitemap> entries pointing at sitemap-posts.xml, sitemap-pages.xml, and so on), and each of those holds the actual <url> entries. A parser that expects one flat document reads the index, finds no page URLs, and hands you back nothing.
The solution is to fetch the sitemap with the built-in fetch, parse the XML with fast-xml-parser, branch on whether the document is a <sitemapindex> or a <urlset>, and when it is an index fetch each child sitemap and collect its URLs. Once you hold the URL list you crawl it through a small concurrency pool that keeps a fixed number of requests in flight, so you read the pages at a steady rate instead of opening hundreds of sockets at once. It comes to about 70 lines of Node.js with one dependency, fast-xml-parser.
Key terms
- Sitemap index. A sitemap whose root is
<sitemapindex>and whose entries are other sitemap files, not pages, used by large sites to stay under the 50,000-URL-per-file limit. - urlset. A sitemap whose root is
<urlset>and whose<url>entries each carry a<loc>, the actual page URLs you want. <loc>. The element inside both<sitemap>and<url>entries that holds the absolute URL, either of a child sitemap or of a page.- Concurrency pool. A loop that keeps a fixed number of fetches running at once and starts the next URL only as one finishes, instead of firing every request in parallel.
Here is what the script does:
- Fetch the sitemap with the built-in
fetchand parse the XML into an object with fast-xml-parser. - Detect whether the root is a
<sitemapindex>(a list of sitemaps) or a<urlset>(a list of pages). - When the root is an index, fetch each child sitemap and gather the page URLs from all of them.
- Walk the collected URLs through a concurrency pool that holds a fixed number of fetches in flight at once.
The complete script
// crawl-sitemap.mjs
import { XMLParser } from 'fast-xml-parser'
const SITEMAP_URL = 'https://www.scrapingbee.com/sitemap.xml'
const CONCURRENCY = 5 // how many page fetches run at once
const MAX_PAGES = 50 // stop after this many, so a demo run stays small
// isArray forces <sitemap> and <url> to always parse as arrays,
// even when the document holds exactly one of them. Without this a
// one-entry sitemap parses to an object and the .map() below throws.
const parser = new XMLParser({
ignoreAttributes: true,
isArray: (name) => name === 'sitemap' || name === 'url'
})
// Fetch one XML document and parse it. Throws on a non-200 so a
// missing child sitemap surfaces instead of silently parsing an error page.
async function fetchXml(url) {
const res = await fetch(url, { headers: { 'User-Agent': 'Mozilla/5.0' } })
if (!res.ok) throw new Error(`${res.status} ${res.statusText} for ${url}`)
return parser.parse(await res.text())
}
// Return the page URLs in a sitemap. If the document is a sitemap index,
// recurse into each child sitemap and concatenate their URLs.
async function collectUrls(sitemapUrl) {
const doc = await fetchXml(sitemapUrl)
if (doc.sitemapindex) {
const children = doc.sitemapindex.sitemap.map((s) => s.loc)
const nested = await Promise.all(children.map((child) => collectUrls(child)))
return nested.flat()
}
if (doc.urlset) {
return doc.urlset.url.map((u) => u.loc)
}
return [] // neither root element: not a sitemap we recognize
}
// Crawl an array of URLs, keeping `limit` fetches in flight at a time.
// Each finished fetch frees a slot and the next URL starts, so the
// number of open sockets never exceeds `limit`.
async function crawl(urls, limit, handle) {
const queue = [...urls]
const running = new Set()
while (queue.length || running.size) {
while (running.size < limit && queue.length) {
const url = queue.shift()
const task = handle(url).finally(() => running.delete(task))
running.add(task)
}
if (running.size) await Promise.race(running)
}
}
const urls = await collectUrls(SITEMAP_URL)
console.log(`Discovered ${urls.length} URLs`)
const pages = urls.slice(0, MAX_PAGES)
await crawl(pages, CONCURRENCY, async (url) => {
try {
const res = await fetch(url, { headers: { 'User-Agent': 'Mozilla/5.0' } })
const html = await res.text()
console.log(`${res.status} ${html.length} bytes ${url}`)
} catch (err) {
console.log(`ERR ${err.message} ${url}`)
}
})npm install fast-xml-parser
node crawl-sitemap.mjsWhat each step does
Configure the parser to keep single entries as arrays. fast-xml-parser collapses a repeated element with one occurrence into a plain object, so a sitemap with a single <url> would parse to doc.urlset.url being an object, not an array, and .map() would throw. The isArray callback forces sitemap and url to parse as arrays every time, which lets the same .map() handle a one-entry file and a fifty-thousand-entry file. Setting ignoreAttributes: true drops the XML namespace attributes you do not need, so each entry is the plain text of its child elements.
Fetch and throw on a bad status. fetchXml sends a stock desktop browser User-Agent, because some hosts return a block page to the default Node client. It checks res.ok and throws on a non-200, so a child sitemap that 404s raises an error instead of feeding an HTML error page into the XML parser and producing an empty result you have to debug later.
Branch on the root element. A sitemap index parses to an object with a sitemapindex key; a normal sitemap parses to one with a urlset key. collectUrls checks for sitemapindex first, maps each <sitemap> entry's <loc> to a child URL, and recurses into each child with Promise.all. When the root is urlset it returns the <loc> of each <url>. A document with neither root returns an empty array rather than throwing.
Crawl through a concurrency pool. crawl keeps a Set of in-flight promises. The inner loop tops the set up to limit by shifting URLs off the queue, and Promise.race waits for the next one to finish before the loop refills the empty slot. The result is a steady limit requests at a time across the whole list, which is gentler on the target than Promise.all over every URL and faster than fetching one at a time. Swap the console.log in the handler for your own extraction, such as parsing the HTML or saving it to disk.
Gotchas
The sitemap is an index, so a flat parser finds zero pages.
- Issue:
doc.urlsetisundefinedon a large site because the root is<sitemapindex>, a list of other sitemaps, and readingdoc.urlset.urlthrows or returns nothing. - Fix: branch on the root element, and when it is
sitemapindexfetch each child sitemap's<loc>and collect their URLs, ascollectUrlsdoes here.
- Issue:
A single-entry sitemap parses to an object, not an array.
- Issue: fast-xml-parser turns a
<urlset>with one<url>intodoc.urlset.urlbeing an object, so.map()throwsis not a function. - Fix: pass
isArray: (name) => name === 'sitemap' || name === 'url'so those elements always parse as arrays regardless of count.
- Issue: fast-xml-parser turns a
Firing every URL at once gets you rate limited.
- Issue:
await Promise.all(urls.map(fetchPage))opens one socket per URL, so a 5,000-URL sitemap hits the host with 5,000 simultaneous requests and earns a 429 or a connection reset. - Fix: run the URLs through the concurrency pool with a small
limit(5 here), which holds the request rate to a fixed number in flight.
- Issue:
The sitemap can be gzipped.
- Issue: Many sites publish
sitemap.xml.gz, andawait res.text()on a gzip body hands the parser binary bytes, which fails with "Non-whitespace before first tag" or similar. - Fix: when the URL ends in
.gzor theContent-Typeisapplication/gzip, decompress first with the built-inzlib.gunzipSync(Buffer.from(await res.arrayBuffer()))before parsing.
- Issue: Many sites publish
A
robots.txtDisallow does not stop your crawler.- Issue: Nothing in
fetchconsultsrobots.txt, so the script will request paths the site asks crawlers to skip, which is impolite and can get your IP blocked. - Fix: read
robots.txtand check each URL before fetching it, with a library like robots-parser, and honor anyCrawl-delayit sets.
- Issue: Nothing in
The discovered URL count outruns your run.
- Issue: A real sitemap can list tens of thousands of pages, so crawling the whole list in one process can run for hours and lose everything if it crashes partway.
- Fix: cap the run with
MAX_PAGESwhile developing, persist progress as you go (write each result to disk or a database keyed by URL), and resume from the last unfinished URL on the next run.
A page listed in the sitemap returns a non-200.
- Issue: Sitemaps go stale, so some
<loc>entries point at pages that now 404 or 301, and a naive handler treats the error body as page content. - Fix: check
res.statusin the handler (the script logs it) and skip or record anything that is not 200 instead of parsing it as a page.
- Issue: Sitemaps go stale, so some
Use this when
The site publishes a sitemap and you want broad coverage of its pages: building a search index, mirroring a docs site, feeding an LLM a whole knowledge base, or auditing a site for broken or changed pages. You want the page list the site already maintains, not one you discover by following links.
Skip this when
The site has no sitemap and you must discover pages by following links (crawl from a seed URL and extract links per page instead); the pages render their content with JavaScript so the fetched HTML is an empty shell (render each page with Puppeteer first); the data lives behind a login (handle the session before crawling); or you only need a single page's content rather than the whole site (fetch and extract that one URL).