Simplescraper
Skip to content

How to scrape a JavaScript-rendered page in Node.js

How to scrape a JavaScript-rendered page in Node.js

Updated 2026-06-25 · 5 min read

If you've pointed fetch or a parser like Cheerio at a modern site and gotten back an empty shell, you're hitting the gap between the HTML the server sends and the HTML the browser ends up showing. React, Vue, Svelte, and most server-rendered frameworks with client hydration ship a near-empty <div id="root"> and then fill it in with JavaScript after load. Your request only ever sees the first version, so the list, the prices, or the article text you wanted are not in the bytes you got.

The solution is to run the page in a headless browser that executes its JavaScript, wait until the content you want has actually appeared, and only then read the rendered DOM. That is what Puppeteer does: it drives a full Chromium instance over the Chrome DevTools Protocol, so the page renders the way it would for a visitor. It takes about 30 lines of Node.js with one dependency, the puppeteer package.

Key terms

  • Headless browser. A full Chromium instance running without a visible window, driven by code instead of a mouse, so it executes JavaScript exactly as a normal browser does.
  • Hydration. The step where a framework's client-side JavaScript takes a static HTML shell and attaches the live application to it, often replacing or populating the visible content.
  • waitForSelector. A Puppeteer method that pauses until an element matching a CSS selector exists in the DOM, which is how you wait for content the page injects after load.
  • page.evaluate. Runs a function inside the page's own JavaScript context, where document and the rendered DOM are available, and returns the result back to Node.

Here is what the script does:

  • Launch headless Chromium with the puppeteer package and open a new page.
  • Navigate to a page whose content is injected by client-side JavaScript and is not in the initial HTML.
  • Wait for the specific element the page renders so the read does not run against an empty shell.
  • Read the rendered values out of the live DOM with page.evaluate and print them.

The complete script

js
// scrape-rendered-page.mjs
import puppeteer from 'puppeteer'

// quotes.toscrape.com/js renders its quotes with client-side JavaScript.
// A plain fetch of this URL returns an empty container; the browser fills it in.
const url = 'https://quotes.toscrape.com/js/'

const browser = await puppeteer.launch({ headless: true })

try {
  const page = await browser.newPage()

  // domcontentloaded fires once the HTML is parsed; the JS that injects
  // the quotes has not necessarily run yet, so do not read the DOM here.
  await page.goto(url, { waitUntil: 'domcontentloaded' })

  // Wait for the actual content, not a fixed sleep. This resolves the moment
  // the first .quote element exists, and throws after 10s if it never does.
  await page.waitForSelector('.quote', { timeout: 10_000 })

  // Read values out of the rendered DOM, inside the page's own JS context.
  const quotes = await page.evaluate(() =>
    Array.from(document.querySelectorAll('.quote')).map((el) => ({
      text: el.querySelector('.text')?.textContent ?? '',
      author: el.querySelector('.author')?.textContent ?? ''
    }))
  )

  console.log(`Got ${quotes.length} quotes`)
  console.log(quotes)
} finally {
  // Close the browser even if a selector wait or navigation throws,
  // otherwise the Chromium process leaks and the script hangs.
  await browser.close()
}
bash
npm install puppeteer
node scrape-rendered-page.mjs

What each step does

Launch with headless: true. This is the current default and runs Chromium with no visible window. Set it to false while you are debugging so you can watch the page render; switch it back for unattended runs. Do not pass the old 'new' string, which recent Puppeteer versions have dropped.

Navigate with waitUntil: 'domcontentloaded'. This returns once the initial HTML is parsed, which is fast, but the framework's JavaScript may not have populated the page yet. Do not read the DOM at this point. The alternative networkidle0 waits for the network to go quiet, which is heavier and can stall on pages that poll or stream, so a domcontentloaded navigation followed by an explicit element wait is the more predictable pair.

Wait for the rendered element, not a timer. page.waitForSelector('.quote') resolves the instant the first matching element appears, so the read runs against populated content rather than the empty shell. It is bounded by timeout: 10_000, so a page that never renders the selector fails in ten seconds instead of hanging. A fixed setTimeout either wastes time on fast pages or fires too early on slow ones.

Read inside page.evaluate. The callback runs in the page's own context, where document and the hydrated DOM exist. Pull out plain strings and numbers and return them; Puppeteer serializes the result back to Node. DOM nodes themselves cannot cross that boundary, so map them to values before you return.

Gotchas

  • Reading the DOM right after goto returns an empty result.

    • Issue: await page.goto(url) resolves before the client-side JavaScript has injected the content, so page.evaluate runs against the shell and querySelectorAll('.quote') returns an empty list.
    • Fix: put a await page.waitForSelector('.quote') between the navigation and the read, so the read only runs once the element the page renders exists.
  • A fixed page.waitForTimeout is fragile.

    • Issue: sleeping a hardcoded await new Promise(r => setTimeout(r, 3000)) either wastes seconds on a fast render or fires before a slow one finishes, and it breaks differently across machines.
    • Fix: wait on the thing you actually need with waitForSelector, or page.waitForFunction(() => document.querySelectorAll('.quote').length >= 10) when you need a count rather than a first appearance.
  • networkidle0 hangs on streaming or polling pages.

    • Issue: waitUntil: 'networkidle0' waits for zero in-flight requests, so a page with analytics beacons, a chat widget, or a polling endpoint never reaches idle and the navigation times out.
    • Fix: navigate with domcontentloaded and then wait for a concrete selector, which does not depend on the network ever going fully quiet.
  • The site returns a bot-blocked stub instead of the app.

    • Issue: some sites check the User-Agent or other signals and serve a challenge page to the default headless build, so the selector you expect never renders and the wait times out.
    • Fix: set a stock desktop User-Agent with await page.setUserAgent('Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36'); for harder anti-bot fronts see How to patch headless Chrome to avoid detection.
  • The browser process leaks when the script throws.

    • Issue: if waitForSelector or goto throws and browser.close() sits after it without a guard, the Chromium process stays alive and the Node script never exits.
    • Fix: wrap the work in try and call await browser.close() in finally, as the script does, so the browser closes on both the success and the error path.
  • The content is paginated or behind a "load more" button.

    • Issue: waitForSelector('.quote') only confirms the first batch rendered; later items load when you click a control or scroll, and a single read misses them.
    • Fix: drive the interaction first, for example handle the "load more" button or scroll, then read once the new items have rendered.

Use this when

A page renders its content with client-side JavaScript, so a plain HTTP fetch comes back as an empty shell and you need the browser to run the page before you can read it. Single-page apps, hydrated server-rendered sites, and pages that fetch their data over XHR after load all fall here.

Skip this when

The HTML already contains the data on the first request, in which case fetch plus Cheerio is faster and lighter; the values come from a JSON API the page calls, where hitting that endpoint directly skips rendering entirely; you only need the cleaned article text, where a readability pass on the rendered HTML is the better tool; or the page sits behind a login, where you handle authentication before any of this.

Skip the code, just get the data

Simplescraper turns any website into structured data in seconds.