How to scrape data loaded from an XHR/fetch request

Updated 2026-06-25 · 6 min read

If you've tried to scrape a page where the data appears a beat after load, you have probably found that the HTML you fetch is an empty shell: the product list, the comments, the price table are all missing because the browser pulls them in afterward with an XHR or fetch call. Driving a full headless browser to wait for that data works, but it spends a second or two of Chromium startup and rendering on every single page, which adds up fast across a few thousand requests.

The page is already telling you where the data lives, in the background HTTP call - the XHR or fetch request - its JavaScript makes after the document loads. The solution is to capture the network traffic once with Puppeteer to learn the JSON endpoint the page calls, then request that endpoint directly with a plain HTTP client on every later run. We'll build a small discovery script that opens the page humans see and listens to every background response so the page reveals its own endpoint instead of us guessing the URL, narrows those responses to the JSON ones so the data endpoint is easy to read off, and then a second script that re-requests that endpoint with node-fetch and the headers the browser sent, skipping the browser entirely on every scheduled run. The discovery script is about 30 lines and the direct fetch is about 10.

The complete script

// find-xhr-endpoint.mjs
import puppeteer from 'puppeteer'

const pageUrl = 'https://quotes.toscrape.com/api/quotes?page=1'
// we load the human page; it fetches the API above in the background.
const humanPage = 'https://quotes.toscrape.com/scroll'

const browser = await puppeteer.launch({ headless: true })
const page = await browser.newPage()

// collect every JSON response the page receives while it loads.
const jsonResponses = []

page.on('response', async (response) => {
  const headers = response.headers()
  const contentType = headers['content-type'] || ''

  // keep only responses that actually carry JSON; skip HTML, images, fonts.
  if (!contentType.includes('application/json')) return

  // response.json() throws on empty or non-JSON bodies, so guard it.
  let body
  try {
    body = await response.json()
  } catch {
    return
  }

  jsonResponses.push({
    url: response.url(),
    status: response.status(),
    method: response.request().method(),
    sample: body
  })
})

await page.goto(humanPage, { waitUntil: 'networkidle2' })
await browser.close()

// read off the endpoint carrying your data, then call it directly (below).
for (const r of jsonResponses) {
  console.log(`${r.status} ${r.method} ${r.url}`)
  console.log(JSON.stringify(r.sample).slice(0, 200))
}

bash

npm install puppeteer
node find-xhr-endpoint.mjs

Once the discovery script prints the endpoint, drop the browser. This second script is what you run on a schedule:

// fetch-xhr-endpoint.mjs
import fetch from 'node-fetch'

// the endpoint the page called, copied from find-xhr-endpoint.mjs output.
const endpoint = 'https://quotes.toscrape.com/api/quotes?page=1'

const res = await fetch(endpoint, {
  headers: {
    // send what the browser sent: a normal UA and an XHR marker.
    'User-Agent': 'Mozilla/5.0',
    'Accept': 'application/json',
    'X-Requested-With': 'XMLHttpRequest'
  }
})

if (!res.ok) throw new Error(`Endpoint returned ${res.status}`)

const data = await res.json()
console.log(data.quotes.map((q) => q.text))

bash

npm install node-fetch
node fetch-xhr-endpoint.mjs

How it works

Load the page humans see, not the API. The discovery script opens /scroll, the page a person visits. That page's own JavaScript fires the background request to /api/quotes. You let the page reveal its endpoint instead of guessing the URL.

Listen to the response event before navigating. Register page.on('response', ...) first, then call page.goto(). Responses that arrive during load only reach handlers that were already attached, so a listener added after navigation misses the early requests. A request triggered by scrolling or a delayed timer can also arrive after networkidle2 considers the page settled, so if your data call comes in late, scroll with page.evaluate(() => window.scrollTo(0, document.body.scrollHeight)) or hold the browser open with page.waitForResponse(/api\/quotes/).

Filter to JSON by content type. Every page load fires dozens of responses for images, fonts, and tracking pixels. Checking content-type for application/json narrows the log to the few responses that could hold your data. The slice(0, 200) keeps each printout short enough to scan. Reading .json() rather than .text() matters here too: on a compressed body, response.text() can hand you bytes Puppeteer has not finished decoding and give an empty string, while response.json() waits for the full decoded body. The try/catch around the parse covers the redirects, empty 204s, and HTML error pages that would otherwise throw and stop the whole listener.

Re-request with the headers the page sent. Some endpoints return a different response, or a 403, when the request lacks the X-Requested-With: XMLHttpRequest header or a normal User-Agent. The discovery script shows you the request method and URL; copy the headers the browser used into the direct fetch so the server treats your call the same way. If the direct fetch still comes back 401 or 403, the page is attaching a session cookie or a Bearer token from a login step, so log response.request().headers() in the discovery run, copy the authorization or cookie value across, and refresh it when it expires. When the URL carries a signature, token, or expires parameter the page computes per request, a copied URL works once and then 403s, so keep Puppeteer in the loop and intercept the response on each run instead. And when the data spans paginated calls (?page=1, then ?page=2 as you scroll), read the page parameter off the endpoint and loop until the response returns an empty array or a has_next: false flag.

Run the second script alone. Once you have the endpoint, fetch-xhr-endpoint.mjs never launches Chromium. An HTTP round trip is on the order of 100ms against a browser session that costs a second or more of startup and rendering, which is the speed-up across a batch.

Use this when

The page renders its content from a background JSON call you can see in the Network tab, and that endpoint returns the same data to a direct request. This covers most infinite-scroll lists, comment threads, price widgets, and search-results pages built on a public-facing internal API.

Skip this when

The endpoint signs each request with a per-call token (keep Puppeteer in the loop and intercept the response); the data is rendered server-side into the initial HTML (parse it with cheerio instead); the content comes over a WebSocket rather than XHR (listen for the WebSocket frames); or the page draws to a canvas with no JSON behind it (you need screenshot plus OCR, not an endpoint).

How to scrape data loaded from an XHR/fetch request ​

The complete script ​

How it works ​

Related guides ​

Skip the code, just get the data Simplescraper turns any website into structured data in seconds.

How to scrape data loaded from an XHR/fetch request

The complete script

How it works

Related guides

Skip the code, just get the data
Simplescraper turns any website into structured data in seconds.