Simplescraper
Skip to content

How to intercept and read network requests in Puppeteer

How to intercept and read network requests in Puppeteer

Updated 2026-06-25 · 6 min read

If you're scraping a page where the data shows up in the browser but never in the HTML you get back, you have probably already opened the Network tab and watched the real payload arrive as a separate XHR or fetch call returning JSON. The page renders from that response, not from the markup, so parsing the DOM gets you a loading spinner and selectors that resolve to nothing.

The fix is to listen to the browser's own traffic instead of the rendered output. Puppeteer surfaces every request and response through page.on('request') and page.on('response'), and for the response bodies those events do not hand you directly, a Chrome DevTools Protocol Network session reads them from the browser. That gives you the same JSON the page consumes, in about 40 lines of Node.js with one library.

Key terms

  • Request interception. Puppeteer's page.on('request') event, which fires for each outgoing request and exposes its URL, method, headers, and post body before the response returns.
  • Response listener. Puppeteer's page.on('response') event, which fires when a response arrives and lets you call response.json() or response.text() on supported response types.
  • CDP session. A Chrome DevTools Protocol channel opened with page.createCDPSession(), the same protocol Puppeteer runs on internally, used here to read response bodies the high-level events do not buffer.
  • XHR / fetch. The two browser APIs a single-page app uses to load data after the initial HTML, the request types you filter for to find the payload feeding the page.

Here is what the script does:

  • Launch headless Chrome with Puppeteer and open a fresh page.
  • Attach a page.on('request') listener that logs the method, URL, and resource type of every outgoing request.
  • Attach a page.on('response') listener that reads the JSON body of XHR and fetch responses as they arrive.
  • Open a Puppeteer CDP Network session for the case where you need a response body the high-level listener cannot buffer.
  • Navigate to the target and let the listeners capture traffic until the network settles.

The complete script

js
// intercept-network.mjs
import puppeteer from 'puppeteer'

const targetUrl = 'https://httpbin.org/anything'

const browser = await puppeteer.launch({ headless: true })
const page = await browser.newPage()

// Collected response payloads, keyed by request URL.
const captured = []

// Fires for every outgoing request. Read URL, method, type, and post body here.
page.on('request', request => {
  console.log(`[request] ${request.method()} ${request.resourceType()} ${request.url()}`)
})

// Fires when a response arrives. Read the body for the request types you care about.
page.on('response', async response => {
  const request = response.request()
  const type = request.resourceType()

  // The data feeding a single-page app almost always arrives as xhr or fetch.
  if (type === 'xhr' || type === 'fetch') {
    const contentType = response.headers()['content-type'] || ''

    if (contentType.includes('application/json')) {
      try {
        const body = await response.json()
        captured.push({ url: response.url(), status: response.status(), body })
        console.log(`[json] ${response.status()} ${response.url()}`)
      } catch {
        // A redirect or a body already consumed elsewhere throws here. Skip it.
      }
    }
  }
})

await page.goto(targetUrl, { waitUntil: 'networkidle0' })

console.log(`Captured ${captured.length} JSON responses`)
console.log(JSON.stringify(captured[0]?.body, null, 2))

await browser.close()
bash
npm install puppeteer
node intercept-network.mjs

What each step does

Launch with headless: true. The default headless mode runs the same Chromium build that Puppeteer drives in headed mode, so the requests the page fires are the requests a browser fires. The network listeners attach to the page, not the launch, so the order here is launch, new page, then wire the listeners before you navigate.

Listen on page.on('request'). This event fires once per outgoing request, before the response comes back. The request object carries method(), url(), resourceType(), headers(), and postData(), which is enough to reconstruct the call as a standalone fetch later. This listener does not block the request, because the script does not call page.setRequestInterception(true); it observes traffic rather than rewriting it.

Listen on page.on('response') and read the body. When a response arrives, response.json() parses a JSON body and response.text() returns the raw string. The filter on resourceType() narrows the flood of image, stylesheet, and font responses down to the xhr and fetch calls that carry a single-page app's data. The try/catch matters because response.json() throws on a redirect or a response whose body has already been consumed.

Wait for networkidle0. Passing waitUntil: 'networkidle0' to page.goto resolves once there have been no network connections for 500ms, which gives the deferred XHR and fetch calls time to fire and land in captured. Without it, goto resolves on the initial document load and the script closes the browser before the data requests run.

Gotchas

  • page.on('response') cannot read every response body.

    • Issue: Calling response.text() on a request that was served from the disk cache, redirected, or already consumed throws Could not load body for this request or Response body is unavailable, and the await rejects.
    • Fix: Wrap the body read in try/catch and skip the failures, or open a CDP session and call Network.getResponseBody against the request id, which reads from the browser's response buffer: const cdp = await page.createCDPSession(); await cdp.send('Network.enable').
  • setRequestInterception(true) stalls the page if you forget to continue.

    • Issue: Enabling interception to modify or block requests means every request now pauses until you act on it, so a page.on('request') handler that does not call request.continue(), request.abort(), or request.respond() hangs page.goto until it times out.
    • Fix: Only enable interception when you need to rewrite or block requests, and when you do, make every code path in the handler end in exactly one of continue, abort, or respond.
  • The data request fires before your listener is attached.

    • Issue: Wiring page.on('response') after page.goto returns means the early requests fired during navigation never reach your handler, so captured comes back short or empty.
    • Fix: Attach every page.on listener before the await page.goto(...) line, so the handlers are live when the first request leaves the browser.
  • networkidle0 never resolves on a page that polls.

    • Issue: A page with an open WebSocket, an analytics heartbeat, or a polling timer keeps at least one connection busy, so the 500ms idle window in networkidle0 is never reached and goto hangs until its timeout.
    • Fix: Switch to waitUntil: 'networkidle2', which allows up to two open connections, or wait for the specific response with page.waitForResponse(res => res.url().includes('/api/data')) instead of waiting for the whole network to settle.
  • Image, font, and stylesheet responses drown out the payload.

    • Issue: A media-heavy page fires dozens of image, font, and stylesheet responses, and reading or logging all of them buries the one xhr response that carries the data you came for.
    • Fix: Filter on request.resourceType() for xhr and fetch as the script does, or match the URL with response.url().includes('/api/') so the handler only acts on the data endpoint.
  • response.json() on a streamed or chunked response can return partial data.

    • Issue: A response sent with Transfer-Encoding: chunked or one that streams over a long-lived connection may resolve response.json() before the full body is buffered, returning a truncated object.
    • Fix: Use page.waitForResponse to grab the finished response object and read it once it has fully arrived, rather than reading inside the high-volume page.on('response') stream.

Use this when

You want the raw JSON a single-page app loads behind its UI, you are reverse-engineering a site's internal API to call it directly, or you are auditing which third-party endpoints a page contacts and what it sends them.

Skip this when

The data is already in the served HTML (parse it with cheerio instead); you need to block or rewrite requests rather than read them (enable page.setRequestInterception(true) and act in the handler); the endpoint is reachable without a browser (call it directly with fetch and the headers you captured); or you need a full session recording for replay (capture a HAR with a CDP listener and a HAR writer).

Skip the code, just get the data

Simplescraper turns any website into structured data in seconds.