How to scrape a page to PDF in Node.js

Updated 2026-06-24 · 6 min read

If you've tried to turn a scraped web page into a PDF with a library like pdfkit or jsPDF, you've probably ended up with a document that looks nothing like the page. Those tools build a PDF from primitives you draw by hand, so they have no CSS engine and cannot lay out the page's grid, render its web fonts, or run the JavaScript that fills half the content. What is missing is a layout engine, and headless Chrome is one.

The solution is to drive headless Chrome, a full Chrome browser running with no visible window, to the URL, let it render the page exactly as a person would see it, and then ask Chrome's own print engine to paginate that rendered document into PDF. We'll build a small script that launches headless Chrome through Puppeteer so we get a real layout engine instead of hand-drawn primitives, loads the URL and waits for the network to go quiet so late-loading fonts and images are present before the snapshot, forces the print media type so the page renders as it would on paper, and calls page.pdf() with explicit margins, backgrounds, and a page-numbered header and footer before writing the bytes to disk. The output matches Chrome's print preview because it is the same code path. It takes about 40 lines of Node.js with one library, Puppeteer.

The complete script

// scrape-to-pdf.mjs
import puppeteer from 'puppeteer'
import { writeFile } from 'node:fs/promises'

const url = 'https://en.wikipedia.org/wiki/Web_scraping'

const browser = await puppeteer.launch({ headless: true })
const page = await browser.newPage()

// wait until the network has been idle for 500ms, so lazy images and fonts settle.
await page.goto(url, { waitUntil: 'networkidle0', timeout: 60_000 })

// render with print styles, not screen styles, the same as Chrome's "Print" dialog.
await page.emulateMediaType('print')

// 9pt header/footer templates. Chrome injects values into the .pageNumber,
// .totalPages, .title, and .date spans by class name.
const headerTemplate = `
  <div style="font-size:9px; width:100%; padding:0 1cm; color:#666;">
    <span class="title"></span>
  </div>`

const footerTemplate = `
  <div style="font-size:9px; width:100%; padding:0 1cm; color:#666; text-align:right;">
    Page <span class="pageNumber"></span> of <span class="totalPages"></span>
  </div>`

const pdf = await page.pdf({
  format: 'A4',
  printBackground: true,        // off by default; without it, CSS backgrounds vanish
  displayHeaderFooter: true,    // off by default; required for the templates below
  headerTemplate,
  footerTemplate,
  // top/bottom margins must leave room for the header/footer or they overlap the body.
  margin: { top: '2cm', bottom: '2cm', left: '1.5cm', right: '1.5cm' }
})

await writeFile('page.pdf', pdf)
await browser.close()

console.log(`Wrote page.pdf (${pdf.length} bytes)`)

bash

npm install puppeteer
node scrape-to-pdf.mjs

How it works

Launch headless and open a page. puppeteer.launch({ headless: true }) starts a Chrome instance with no visible window. One browser can serve many pages, so reuse the instance if you are converting a batch of URLs rather than launching Chrome per file. In Docker or a minimal Linux host the launch throws because the bundled Chrome is missing shared libraries or the sandbox cannot start as root, so install Chrome's system dependencies and, when running as root in a trusted container, launch with args: ['--no-sandbox'].

Wait for networkidle0, not load. The default load event fires when the initial HTML and its synchronous resources finish, which is too early for pages that stream content over fetch or lazy-load images. networkidle0 waits until there have been zero in-flight network connections for 500ms. That is the difference between a complete PDF and one with blank image boxes. For images that load only on scroll, run a scroll-to-bottom pass in page.evaluate() before calling page.pdf() so they are in the DOM when the snapshot runs.

Emulate the print media type. Sites ship a @media print stylesheet that hides nav bars, expands collapsed sections, and switches to black text on white. page.emulateMediaType('print') makes Chrome honor those rules, so your PDF matches what the site's own print button would produce instead of a screenshot of the screen layout. Skip this call and page.pdf() renders the screen stylesheet, baking sticky nav bars, cookie banners, and dark-mode backgrounds into the document.

Set printBackground and displayHeaderFooter. Both default to false. Without printBackground: true, every CSS background-color and background-image is dropped and code blocks, callouts, and dark themes render as bare text. Without displayHeaderFooter: true, your header and footer templates are silently ignored.

Give the header and footer templates an explicit font size. Chrome's default font size inside headerTemplate and footerTemplate is zero, so a template with no font-size renders as an invisible zero-height strip even when displayHeaderFooter is true. Set an explicit size on a wrapping element, as the font-size:9px on each <div> above does.

Reserve margin space for the header and footer. The templates render into the page margin, not the body. If margin.top and margin.bottom are smaller than the template height, the header prints on top of the first lines of content. The 2cm top and bottom here leaves room for a 9pt single-line header and footer. If the page sets its own paper size with @page { size: ... }, passing format overrides it and can clip or rescale the layout, so pass preferCSSPageSize: true and drop the format option to let the page's own @page rule win.

Use this when

You want a faithful, paginated PDF of a rendered web page, for archiving an article, generating an invoice or report from a live page, snapshotting a dashboard, or feeding a print-quality document into a downstream pipeline.

Skip this when

You only need the article text rather than a paginated document (convert to Markdown instead); you are assembling a PDF from structured fields you already scraped rather than a rendered page (use pdfkit to draw it directly); you need a single flat image of the page rather than multi-page paper layout (capture a full-page screenshot); or you are running at high volume and cannot afford a Chrome process per conversion (render the HTML once and reuse one browser across the batch).

How to scrape a page to PDF in Node.js ​

The complete script ​

How it works ​

Related guides ​

Skip the code, just get the data Simplescraper turns any website into structured data in seconds.

How to scrape a page to PDF in Node.js

The complete script

How it works

Related guides

Skip the code, just get the data
Simplescraper turns any website into structured data in seconds.