Simplescraper
Skip to content

How to scrape data loaded from an XHR/fetch request

How to scrape data loaded from an XHR/fetch request

Updated 2026-06-25 · 6 min read

If you've tried to scrape a page where the data appears a beat after load, you have probably found that the HTML you fetch is an empty shell: the product list, the comments, the price table are all missing because the browser pulls them in afterward with an XHR or fetch call. Driving a full headless browser to wait for that data works, but it spends a second or two of Chromium startup and rendering on every single page, which adds up fast across a few thousand requests.

The page is already telling you where the data lives. The solution is to capture the network traffic once with Puppeteer to learn the JSON endpoint the page calls, then request that endpoint directly with a plain HTTP client on every later run, so you get the raw JSON without a browser at all. The discovery script is about 30 lines and the direct fetch is about 10.

Key terms

  • XHR / fetch request. A background HTTP call the page's JavaScript makes after the document loads, usually to a JSON endpoint, to fill in content the initial HTML does not contain.
  • Response interception. Listening to Puppeteer's response event so your code sees every network response the page receives, including the JSON ones.
  • CDP. The Chrome DevTools Protocol, the same channel the Network tab uses; Puppeteer's response event is a high-level wrapper over it.
  • node-fetch. A library that brings the browser fetch API to Node, used here to call the endpoint once you know its URL.

Here is what the script does:

  • Open the target page in Puppeteer and listen to the response event so it sees every background request the page makes.
  • Filter those responses down to the ones that return JSON, and log each endpoint URL with its payload so you can read off the one carrying your data.
  • Take the endpoint URL you found and re-request it with node-fetch, sending the headers the page sent, and parse the JSON directly.
  • Skip Puppeteer entirely on the second script once you know the URL, so later runs are an HTTP round trip instead of a browser session.

The complete script

js
// find-xhr-endpoint.mjs
import puppeteer from 'puppeteer'

const pageUrl = 'https://quotes.toscrape.com/api/quotes?page=1'
// We load the human page; it fetches the API above in the background.
const humanPage = 'https://quotes.toscrape.com/scroll'

const browser = await puppeteer.launch({ headless: true })
const page = await browser.newPage()

// Collect every JSON response the page receives while it loads.
const jsonResponses = []

page.on('response', async (response) => {
  const headers = response.headers()
  const contentType = headers['content-type'] || ''

  // Keep only responses that actually carry JSON; skip HTML, images, fonts.
  if (!contentType.includes('application/json')) return

  // response.json() throws on empty or non-JSON bodies, so guard it.
  let body
  try {
    body = await response.json()
  } catch {
    return
  }

  jsonResponses.push({
    url: response.url(),
    status: response.status(),
    method: response.request().method(),
    sample: body
  })
})

await page.goto(humanPage, { waitUntil: 'networkidle2' })
await browser.close()

// Read off the endpoint carrying your data, then call it directly (below).
for (const r of jsonResponses) {
  console.log(`${r.status} ${r.method} ${r.url}`)
  console.log(JSON.stringify(r.sample).slice(0, 200))
}
bash
npm install puppeteer
node find-xhr-endpoint.mjs

Once the discovery script prints the endpoint, drop the browser. This second script is what you run on a schedule:

js
// fetch-xhr-endpoint.mjs
import fetch from 'node-fetch'

// The endpoint the page called, copied from find-xhr-endpoint.mjs output.
const endpoint = 'https://quotes.toscrape.com/api/quotes?page=1'

const res = await fetch(endpoint, {
  headers: {
    // Send what the browser sent: a normal UA and an XHR marker.
    'User-Agent': 'Mozilla/5.0',
    'Accept': 'application/json',
    'X-Requested-With': 'XMLHttpRequest'
  }
})

if (!res.ok) throw new Error(`Endpoint returned ${res.status}`)

const data = await res.json()
console.log(data.quotes.map((q) => q.text))
bash
npm install node-fetch
node fetch-xhr-endpoint.mjs

What each step does

Load the page humans see, not the API. The discovery script opens /scroll, the page a person visits. That page's own JavaScript fires the background request to /api/quotes. You let the page reveal its endpoint instead of guessing the URL.

Listen to the response event before navigating. Register page.on('response', ...) first, then call page.goto(). Responses that arrive during load only reach handlers that were already attached, so a listener added after navigation misses the early requests.

Filter to JSON by content type. Every page load fires dozens of responses for images, fonts, and tracking pixels. Checking content-type for application/json narrows the log to the few responses that could hold your data. The slice(0, 200) keeps each printout short enough to scan.

Re-request with the headers the page sent. Some endpoints return a different response, or a 403, when the request lacks the X-Requested-With: XMLHttpRequest header or a normal User-Agent. The discovery script shows you the request method and URL; copy the headers the browser used into the direct fetch so the server treats your call the same way.

Run the second script alone. Once you have the endpoint, fetch-xhr-endpoint.mjs never launches Chromium. An HTTP round trip is on the order of 100ms against a browser session that costs a second or more of startup and rendering, which is the speed-up across a batch.

Gotchas

  • The endpoint needs an auth token or cookie.

    • Issue: the direct fetch returns 401 or 403 because the page set a session cookie or attached a Bearer token from a login step, and your bare request carries neither.
    • Fix: read the token from the discovery run. Log response.request().headers() alongside the URL, copy the authorization or cookie value into your fetch headers, and refresh it when it expires.
  • The endpoint is signed or time-limited.

    • Issue: the URL carries a signature, token, or expires query parameter that the page's JavaScript computes per request, so a copied URL works once and then returns 403.
    • Fix: keep Puppeteer in the loop for these. Intercept the response on each run rather than re-fetching a stale URL, since the signature cannot be reproduced outside the page.
  • response.json() throws on non-JSON bodies.

    • Issue: calling .json() on a redirect, an empty 204, or an HTML error page throws and stops the whole listener.
    • Fix: wrap the parse in try/catch and return on failure, as the script does, so one bad response does not abort discovery.
  • The data spans multiple paginated calls.

    • Issue: the page loads page 1 from ?page=1, then fires ?page=2 as you scroll, so a single fetch returns only the first slice.
    • Fix: read the page parameter from the endpoint and loop, incrementing it until the response returns an empty array or a has_next: false flag.
  • networkidle2 fires before a late request.

    • Issue: a request triggered by scrolling or a delayed timer arrives after networkidle2 considers the page settled, so your listener never logs it.
    • Fix: scroll the page with page.evaluate(() => window.scrollTo(0, document.body.scrollHeight)) or wait with page.waitForResponse(/api\/quotes/) to hold the browser open until the call fires.
  • The response is gzip or chunked and arrives empty.

    • Issue: reading response.text() instead of response.json() on a compressed body can hand you bytes Puppeteer has not finished decoding, giving an empty string.
    • Fix: prefer response.json(), which waits for the full decoded body, and check response.ok() before trusting the payload.

Use this when

The page renders its content from a background JSON call you can see in the Network tab, and that endpoint returns the same data to a direct request. This covers most infinite-scroll lists, comment threads, price widgets, and search-results pages built on a public-facing internal API.

Skip this when

The endpoint signs each request with a per-call token (keep Puppeteer in the loop and intercept the response); the data is rendered server-side into the initial HTML (parse it with cheerio instead); the content comes over a WebSocket rather than XHR (listen for the WebSocket frames); or the page draws to a canvas with no JSON behind it (you need screenshot plus OCR, not an endpoint).

Skip the code, just get the data

Simplescraper turns any website into structured data in seconds.