How to intercept and read network requests in Puppeteer
If you're scraping a page where the data shows up in the browser but never in the HTML you get back, you have probably already opened the Network tab and watched the real payload arrive as a separate XHR or fetch call returning JSON. The page renders from that response, not from the markup, so parsing the DOM gets you a loading spinner and selectors that resolve to nothing.
The fix is to listen to the browser's own traffic instead of the rendered output. Puppeteer surfaces every request and response through page.on('request') and page.on('response'), and for the response bodies those events do not hand you directly, a Chrome DevTools Protocol Network session reads them from the browser. That gives you the same JSON the page consumes, in about 40 lines of Node.js with one library.
Key terms
- Request interception. Puppeteer's
page.on('request')event, which fires for each outgoing request and exposes its URL, method, headers, and post body before the response returns. - Response listener. Puppeteer's
page.on('response')event, which fires when a response arrives and lets you callresponse.json()orresponse.text()on supported response types. - CDP session. A Chrome DevTools Protocol channel opened with
page.createCDPSession(), the same protocol Puppeteer runs on internally, used here to read response bodies the high-level events do not buffer. - XHR / fetch. The two browser APIs a single-page app uses to load data after the initial HTML, the request types you filter for to find the payload feeding the page.
Here is what the script does:
- Launch headless Chrome with Puppeteer and open a fresh page.
- Attach a
page.on('request')listener that logs the method, URL, and resource type of every outgoing request. - Attach a
page.on('response')listener that reads the JSON body of XHR and fetch responses as they arrive. - Open a Puppeteer CDP
Networksession for the case where you need a response body the high-level listener cannot buffer. - Navigate to the target and let the listeners capture traffic until the network settles.
The complete script
// intercept-network.mjs
import puppeteer from 'puppeteer'
const targetUrl = 'https://httpbin.org/anything'
const browser = await puppeteer.launch({ headless: true })
const page = await browser.newPage()
// Collected response payloads, keyed by request URL.
const captured = []
// Fires for every outgoing request. Read URL, method, type, and post body here.
page.on('request', request => {
console.log(`[request] ${request.method()} ${request.resourceType()} ${request.url()}`)
})
// Fires when a response arrives. Read the body for the request types you care about.
page.on('response', async response => {
const request = response.request()
const type = request.resourceType()
// The data feeding a single-page app almost always arrives as xhr or fetch.
if (type === 'xhr' || type === 'fetch') {
const contentType = response.headers()['content-type'] || ''
if (contentType.includes('application/json')) {
try {
const body = await response.json()
captured.push({ url: response.url(), status: response.status(), body })
console.log(`[json] ${response.status()} ${response.url()}`)
} catch {
// A redirect or a body already consumed elsewhere throws here. Skip it.
}
}
}
})
await page.goto(targetUrl, { waitUntil: 'networkidle0' })
console.log(`Captured ${captured.length} JSON responses`)
console.log(JSON.stringify(captured[0]?.body, null, 2))
await browser.close()npm install puppeteer
node intercept-network.mjsWhat each step does
Launch with headless: true. The default headless mode runs the same Chromium build that Puppeteer drives in headed mode, so the requests the page fires are the requests a browser fires. The network listeners attach to the page, not the launch, so the order here is launch, new page, then wire the listeners before you navigate.
Listen on page.on('request'). This event fires once per outgoing request, before the response comes back. The request object carries method(), url(), resourceType(), headers(), and postData(), which is enough to reconstruct the call as a standalone fetch later. This listener does not block the request, because the script does not call page.setRequestInterception(true); it observes traffic rather than rewriting it.
Listen on page.on('response') and read the body. When a response arrives, response.json() parses a JSON body and response.text() returns the raw string. The filter on resourceType() narrows the flood of image, stylesheet, and font responses down to the xhr and fetch calls that carry a single-page app's data. The try/catch matters because response.json() throws on a redirect or a response whose body has already been consumed.
Wait for networkidle0. Passing waitUntil: 'networkidle0' to page.goto resolves once there have been no network connections for 500ms, which gives the deferred XHR and fetch calls time to fire and land in captured. Without it, goto resolves on the initial document load and the script closes the browser before the data requests run.
Gotchas
page.on('response')cannot read every response body.- Issue: Calling
response.text()on a request that was served from the disk cache, redirected, or already consumed throwsCould not load body for this requestorResponse body is unavailable, and theawaitrejects. - Fix: Wrap the body read in
try/catchand skip the failures, or open a CDP session and callNetwork.getResponseBodyagainst the request id, which reads from the browser's response buffer:const cdp = await page.createCDPSession(); await cdp.send('Network.enable').
- Issue: Calling
setRequestInterception(true)stalls the page if you forget to continue.- Issue: Enabling interception to modify or block requests means every request now pauses until you act on it, so a
page.on('request')handler that does not callrequest.continue(),request.abort(), orrequest.respond()hangspage.gotountil it times out. - Fix: Only enable interception when you need to rewrite or block requests, and when you do, make every code path in the handler end in exactly one of
continue,abort, orrespond.
- Issue: Enabling interception to modify or block requests means every request now pauses until you act on it, so a
The data request fires before your listener is attached.
- Issue: Wiring
page.on('response')afterpage.gotoreturns means the early requests fired during navigation never reach your handler, socapturedcomes back short or empty. - Fix: Attach every
page.onlistener before theawait page.goto(...)line, so the handlers are live when the first request leaves the browser.
- Issue: Wiring
networkidle0never resolves on a page that polls.- Issue: A page with an open WebSocket, an analytics heartbeat, or a polling timer keeps at least one connection busy, so the 500ms idle window in
networkidle0is never reached andgotohangs until its timeout. - Fix: Switch to
waitUntil: 'networkidle2', which allows up to two open connections, or wait for the specific response withpage.waitForResponse(res => res.url().includes('/api/data'))instead of waiting for the whole network to settle.
- Issue: A page with an open WebSocket, an analytics heartbeat, or a polling timer keeps at least one connection busy, so the 500ms idle window in
Image, font, and stylesheet responses drown out the payload.
- Issue: A media-heavy page fires dozens of
image,font, andstylesheetresponses, and reading or logging all of them buries the onexhrresponse that carries the data you came for. - Fix: Filter on
request.resourceType()forxhrandfetchas the script does, or match the URL withresponse.url().includes('/api/')so the handler only acts on the data endpoint.
- Issue: A media-heavy page fires dozens of
response.json()on a streamed or chunked response can return partial data.- Issue: A response sent with
Transfer-Encoding: chunkedor one that streams over a long-lived connection may resolveresponse.json()before the full body is buffered, returning a truncated object. - Fix: Use
page.waitForResponseto grab the finished response object and read it once it has fully arrived, rather than reading inside the high-volumepage.on('response')stream.
- Issue: A response sent with
Use this when
You want the raw JSON a single-page app loads behind its UI, you are reverse-engineering a site's internal API to call it directly, or you are auditing which third-party endpoints a page contacts and what it sends them.
Skip this when
The data is already in the served HTML (parse it with cheerio instead); you need to block or rewrite requests rather than read them (enable page.setRequestInterception(true) and act in the handler); the endpoint is reachable without a browser (call it directly with fetch and the headers you captured); or you need a full session recording for replay (capture a HAR with a CDP listener and a HAR writer).