How to scrape data loaded from an XHR/fetch request
If you've tried to scrape a page where the data appears a beat after load, you have probably found that the HTML you fetch is an empty shell: the product list, the comments, the price table are all missing because the browser pulls them in afterward with an XHR or fetch call. Driving a full headless browser to wait for that data works, but it spends a second or two of Chromium startup and rendering on every single page, which adds up fast across a few thousand requests.
The page is already telling you where the data lives. The solution is to capture the network traffic once with Puppeteer to learn the JSON endpoint the page calls, then request that endpoint directly with a plain HTTP client on every later run, so you get the raw JSON without a browser at all. The discovery script is about 30 lines and the direct fetch is about 10.
Key terms
- XHR / fetch request. A background HTTP call the page's JavaScript makes after the document loads, usually to a JSON endpoint, to fill in content the initial HTML does not contain.
- Response interception. Listening to Puppeteer's
responseevent so your code sees every network response the page receives, including the JSON ones. - CDP. The Chrome DevTools Protocol, the same channel the Network tab uses; Puppeteer's
responseevent is a high-level wrapper over it. - node-fetch. A library that brings the browser
fetchAPI to Node, used here to call the endpoint once you know its URL.
Here is what the script does:
- Open the target page in Puppeteer and listen to the
responseevent so it sees every background request the page makes. - Filter those responses down to the ones that return JSON, and log each endpoint URL with its payload so you can read off the one carrying your data.
- Take the endpoint URL you found and re-request it with node-fetch, sending the headers the page sent, and parse the JSON directly.
- Skip Puppeteer entirely on the second script once you know the URL, so later runs are an HTTP round trip instead of a browser session.
The complete script
// find-xhr-endpoint.mjs
import puppeteer from 'puppeteer'
const pageUrl = 'https://quotes.toscrape.com/api/quotes?page=1'
// We load the human page; it fetches the API above in the background.
const humanPage = 'https://quotes.toscrape.com/scroll'
const browser = await puppeteer.launch({ headless: true })
const page = await browser.newPage()
// Collect every JSON response the page receives while it loads.
const jsonResponses = []
page.on('response', async (response) => {
const headers = response.headers()
const contentType = headers['content-type'] || ''
// Keep only responses that actually carry JSON; skip HTML, images, fonts.
if (!contentType.includes('application/json')) return
// response.json() throws on empty or non-JSON bodies, so guard it.
let body
try {
body = await response.json()
} catch {
return
}
jsonResponses.push({
url: response.url(),
status: response.status(),
method: response.request().method(),
sample: body
})
})
await page.goto(humanPage, { waitUntil: 'networkidle2' })
await browser.close()
// Read off the endpoint carrying your data, then call it directly (below).
for (const r of jsonResponses) {
console.log(`${r.status} ${r.method} ${r.url}`)
console.log(JSON.stringify(r.sample).slice(0, 200))
}npm install puppeteer
node find-xhr-endpoint.mjsOnce the discovery script prints the endpoint, drop the browser. This second script is what you run on a schedule:
// fetch-xhr-endpoint.mjs
import fetch from 'node-fetch'
// The endpoint the page called, copied from find-xhr-endpoint.mjs output.
const endpoint = 'https://quotes.toscrape.com/api/quotes?page=1'
const res = await fetch(endpoint, {
headers: {
// Send what the browser sent: a normal UA and an XHR marker.
'User-Agent': 'Mozilla/5.0',
'Accept': 'application/json',
'X-Requested-With': 'XMLHttpRequest'
}
})
if (!res.ok) throw new Error(`Endpoint returned ${res.status}`)
const data = await res.json()
console.log(data.quotes.map((q) => q.text))npm install node-fetch
node fetch-xhr-endpoint.mjsWhat each step does
Load the page humans see, not the API. The discovery script opens /scroll, the page a person visits. That page's own JavaScript fires the background request to /api/quotes. You let the page reveal its endpoint instead of guessing the URL.
Listen to the response event before navigating. Register page.on('response', ...) first, then call page.goto(). Responses that arrive during load only reach handlers that were already attached, so a listener added after navigation misses the early requests.
Filter to JSON by content type. Every page load fires dozens of responses for images, fonts, and tracking pixels. Checking content-type for application/json narrows the log to the few responses that could hold your data. The slice(0, 200) keeps each printout short enough to scan.
Re-request with the headers the page sent. Some endpoints return a different response, or a 403, when the request lacks the X-Requested-With: XMLHttpRequest header or a normal User-Agent. The discovery script shows you the request method and URL; copy the headers the browser used into the direct fetch so the server treats your call the same way.
Run the second script alone. Once you have the endpoint, fetch-xhr-endpoint.mjs never launches Chromium. An HTTP round trip is on the order of 100ms against a browser session that costs a second or more of startup and rendering, which is the speed-up across a batch.
Gotchas
The endpoint needs an auth token or cookie.
- Issue: the direct fetch returns 401 or 403 because the page set a session cookie or attached a
Bearertoken from a login step, and your bare request carries neither. - Fix: read the token from the discovery run. Log
response.request().headers()alongside the URL, copy theauthorizationorcookievalue into your fetch headers, and refresh it when it expires.
- Issue: the direct fetch returns 401 or 403 because the page set a session cookie or attached a
The endpoint is signed or time-limited.
- Issue: the URL carries a
signature,token, orexpiresquery parameter that the page's JavaScript computes per request, so a copied URL works once and then returns 403. - Fix: keep Puppeteer in the loop for these. Intercept the response on each run rather than re-fetching a stale URL, since the signature cannot be reproduced outside the page.
- Issue: the URL carries a
response.json()throws on non-JSON bodies.- Issue: calling
.json()on a redirect, an empty 204, or an HTML error page throws and stops the whole listener. - Fix: wrap the parse in
try/catchandreturnon failure, as the script does, so one bad response does not abort discovery.
- Issue: calling
The data spans multiple paginated calls.
- Issue: the page loads page 1 from
?page=1, then fires?page=2as you scroll, so a single fetch returns only the first slice. - Fix: read the page parameter from the endpoint and loop, incrementing it until the response returns an empty array or a
has_next: falseflag.
- Issue: the page loads page 1 from
networkidle2fires before a late request.- Issue: a request triggered by scrolling or a delayed timer arrives after
networkidle2considers the page settled, so your listener never logs it. - Fix: scroll the page with
page.evaluate(() => window.scrollTo(0, document.body.scrollHeight))or wait withpage.waitForResponse(/api\/quotes/)to hold the browser open until the call fires.
- Issue: a request triggered by scrolling or a delayed timer arrives after
The response is gzip or chunked and arrives empty.
- Issue: reading
response.text()instead ofresponse.json()on a compressed body can hand you bytes Puppeteer has not finished decoding, giving an empty string. - Fix: prefer
response.json(), which waits for the full decoded body, and checkresponse.ok()before trusting the payload.
- Issue: reading
Use this when
The page renders its content from a background JSON call you can see in the Network tab, and that endpoint returns the same data to a direct request. This covers most infinite-scroll lists, comment threads, price widgets, and search-results pages built on a public-facing internal API.
Skip this when
The endpoint signs each request with a per-call token (keep Puppeteer in the loop and intercept the response); the data is rendered server-side into the initial HTML (parse it with cheerio instead); the content comes over a WebSocket rather than XHR (listen for the WebSocket frames); or the page draws to a canvas with no JSON behind it (you need screenshot plus OCR, not an endpoint).