How to scrape a single-page app (React/Vue) in Node.js
If you fetched a React or Vue page and logged the HTML, you probably got back a near-empty shell: a <div id="root"></div> or <div id="app"></div>, a couple of bundled script tags, and none of the products, posts, or rows you came for. The data is there when you open the page in a browser, but a plain fetch only sees the server's first response, which fires before the framework has run a single line of JavaScript.
The solution is to drive a real Chromium with Puppeteer, let the framework mount and fetch its data, then read the DOM after it has settled. The piece people get wrong is the wait: navigating is not enough, because the markup you want appears a few hundred milliseconds later when an XHR resolves. The script below waits on the rendered content itself, then pulls the records out. It is about 30 lines of Node.js with one library.
Key terms
- Single-page app. A site that ships a minimal HTML shell and builds the page client-side with JavaScript, so the initial response holds almost no content.
- Hydration. The step where React or Vue attaches to server-sent markup and makes it interactive; until it runs, event handlers and client-fetched data are absent.
waitForFunction. A Puppeteer call that polls a predicate inside the page until it returns truthy, which lets you wait for a specific element count instead of a fixed sleep.page.$$eval. A Puppeteer call that runs a function in the page context over every node matching a selector and returns a serializable result to Node.
Here is what the script does:
- Launch headless Chromium with Puppeteer so the page's JavaScript actually executes.
- Navigate and wait for the network to go idle, which covers the framework's initial data fetch.
- Wait with
waitForFunctionuntil the rendered items appear, so you read the DOM after React or Vue has painted, not before. - Pull the fields out of each card with
page.$$evaland print them as JSON.
The complete script
// scrape-spa.mjs
import puppeteer from 'puppeteer'
// A Vue-rendered demo store. The initial HTML is an empty #app shell;
// the product grid is fetched and rendered client-side after load.
const url = 'https://vuejs-demo-store.netlify.app/'
const itemSelector = '.product-card'
const browser = await puppeteer.launch({ headless: true })
try {
const page = await browser.newPage()
// A bare Node User-Agent gets bot-blocked on plenty of sites; send a stock desktop one.
await page.setUserAgent(
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 ' +
'(KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36'
)
// 'networkidle2' resolves once there are <=2 connections for 500ms,
// which usually means the framework's data XHR has come back.
await page.goto(url, { waitUntil: 'networkidle2', timeout: 30000 })
// Network-idle is not the same as content-rendered. Wait for the items
// themselves, polling until at least one matches, capped at 15s.
await page.waitForFunction(
(selector) => document.querySelectorAll(selector).length > 0,
{ timeout: 15000 },
itemSelector
)
// Run in the page context once the DOM has the data, and read each card's fields.
const products = await page.$$eval(itemSelector, (cards) =>
cards.map((card) => ({
name: card.querySelector('.product-name')?.textContent?.trim() ?? null,
price: card.querySelector('.product-price')?.textContent?.trim() ?? null
}))
)
console.log(`Found ${products.length} products`)
console.log(JSON.stringify(products, null, 2))
} finally {
// Close the browser even if a wait times out, so the process exits.
await browser.close()
}npm install puppeteer
node scrape-spa.mjsWhat each step does
Launch with headless: true. Puppeteer downloads its own Chromium on install and runs it without a window. The browser executes the page's bundle the way a user's browser would, which is the whole reason a SPA becomes readable. Wrap everything in try/finally so a timeout still reaches browser.close().
Set a stock desktop User-Agent. Puppeteer's default UA contains HeadlessChrome, which some sites key on to serve a stripped page or a block. Swapping in a regular Chrome-on-macOS string sidesteps the simplest checks. This handles polite servers, not aggressive anti-bot systems, which need more than a header.
Navigate with waitUntil: 'networkidle2'. A SPA's content lands after the framework boots and its first data request returns. networkidle2 holds goto until there have been no more than two open connections for 500ms, which usually lines up with that XHR resolving. The 30-second timeout bounds a page that never quiets down.
Wait for the items with waitForFunction. Network-idle and content-rendered are different moments: idle can fire while the list is still being built into the DOM. The predicate polls inside the page until at least one .product-card exists, so you read after the cards mount. This is the line that separates a reliable SPA scrape from an empty array.
Read fields with page.$$eval. The callback runs in the page, receives every node matching the selector, and returns plain objects back to Node. Optional chaining and a null fallback mean a card missing a price yields null for that field instead of throwing and losing the whole run.
Gotchas
The script returns an empty array even though the page clearly has content.
- Issue: You read the DOM right after
gotoresolves, before React or Vue has rendered the list, sopage.$$eval('.product-card', ...)matches zero nodes. - Fix: Gate the read behind
await page.waitForFunction((sel) => document.querySelectorAll(sel).length > 0, {}, itemSelector)so the extraction runs only after the items exist.
- Issue: You read the DOM right after
networkidle2hangs until the timeout on chatty pages.- Issue: Analytics beacons, websockets, and polling keep connections open, so the network never reaches idle and
gotowaits the full 30 seconds. - Fix: Switch to
waitUntil: 'domcontentloaded'for the navigation and rely onwaitForFunctionfor the real signal, since the content wait is what you actually care about.
- Issue: Analytics beacons, websockets, and polling keep connections open, so the network never reaches idle and
The selector is right in DevTools but matches nothing in Puppeteer.
- Issue: The class name is generated per build (a CSS-modules hash like
.product-card__a8f3c), so the literal string you copied does not survive the next deploy. - Fix: Target a stable attribute instead, for example
[data-testid="product-card"]or a structural selector, rather than a hashed class.
- Issue: The class name is generated per build (a CSS-modules hash like
page.$$evalthrowsReferenceErrorfor a variable from your script.- Issue: The callback runs in the browser context, not Node, so it cannot see outer variables like
itemSelectorunless you pass them. - Fix: Pass extra values as trailing arguments:
page.$$eval(sel, (cards, n) => ..., maxItems). Only serializable values cross the boundary.
- Issue: The callback runs in the browser context, not Node, so it cannot see outer variables like
Content loads only after a scroll or a "load more" click.
- Issue: Many SPAs render the first screen, then fetch the rest on scroll, so a single
waitForFunctionfor one card undercounts the list. - Fix: Scroll in a loop with
page.evaluate(() => window.scrollTo(0, document.body.scrollHeight))and re-check the count until it stops growing. See How to handle "load more" buttons when scraping.
- Issue: Many SPAs render the first screen, then fetch the rest on scroll, so a single
Headless renders differently from the browser you tested in.
- Issue: A narrow default viewport can trigger a mobile layout or lazy-render fewer cards than your wide desktop window showed.
- Fix: Set the viewport explicitly with
await page.setViewport({ width: 1366, height: 900 })beforegotoso the layout matches what you inspected.
Use this when
You need data from a React, Vue, Angular, or other client-rendered page where a plain fetch returns an empty shell, and the content you want lands in the DOM after the framework runs. Product grids, dashboards, infinite feeds, and search results all fit.
Skip this when
The page is server-rendered or static (a plain fetch plus an HTML parser like cheerio is faster and lighter); the data comes from a JSON endpoint you can call directly (hit that XHR and skip the browser entirely); you only need the article text rather than structured fields (run the rendered HTML through Readability and Turndown); the site sits behind aggressive anti-bot defenses (you need a stealth patch and proxies on top of this).