How to scrape a page to PDF in Node.js
If you've tried to turn a scraped web page into a PDF with a library like pdfkit or jsPDF, you've probably ended up with a document that looks nothing like the page. Those tools build a PDF from primitives you draw by hand, so they have no CSS engine and cannot lay out the page's grid, render its web fonts, or run the JavaScript that fills half the content. What is missing is a layout engine, and headless Chrome is one.
The solution is to drive headless Chrome to the URL, let it render the page exactly as a person would see it, and then ask Chrome's own print engine to paginate that rendered document into PDF. You get fidelity equal to Chrome's print preview, because it is the same code path, along with control over margins, backgrounds, and a page-numbered header and footer. It takes about 40 lines of Node.js with one library, Puppeteer.
Key terms
- Headless Chrome. A full Chrome browser running with no visible window, driven over the Chrome DevTools Protocol so it renders pages exactly as the desktop browser would.
page.pdf(). Puppeteer's wrapper over the protocol'sPage.printToPDFcommand, which runs Chrome's own print engine to paginate the rendered document.networkidle0. AwaitUntilvalue that resolves only after there have been zero in-flight network requests for 500ms, so late-loading fonts and images are present before the snapshot.- Print media type. The CSS media context a page sees when printed; emulating it makes Chrome apply the site's
@media printrules instead of its on-screen styles. - Header and footer templates. Small HTML fragments Chrome renders into the page margins, with
.pageNumber,.totalPages,.title, and.datespans it fills in by class name.
Here is what the script does:
- Launch headless Chrome through Puppeteer, the canonical Node.js driver for the Chrome DevTools Protocol.
- Load the target URL and wait for the network to go quiet, so late-loading fonts, images, and web components are in the DOM before the snapshot.
- Force the print media type so Chrome renders the page as it would on paper rather than on screen.
- Call
page.pdf()with explicit margins, background printing, and a page-numbered header and footer template. - Write the PDF bytes to disk.
The complete script
// scrape-to-pdf.mjs
import puppeteer from 'puppeteer'
import { writeFile } from 'node:fs/promises'
const url = 'https://en.wikipedia.org/wiki/Web_scraping'
const browser = await puppeteer.launch({ headless: true })
const page = await browser.newPage()
// Wait until the network has been idle for 500ms, so lazy images and fonts settle.
await page.goto(url, { waitUntil: 'networkidle0', timeout: 60_000 })
// Render with print styles, not screen styles, the same as Chrome's "Print" dialog.
await page.emulateMediaType('print')
// 9pt header/footer templates. Chrome injects values into the .pageNumber,
// .totalPages, .title, and .date spans by class name.
const headerTemplate = `
<div style="font-size:9px; width:100%; padding:0 1cm; color:#666;">
<span class="title"></span>
</div>`
const footerTemplate = `
<div style="font-size:9px; width:100%; padding:0 1cm; color:#666; text-align:right;">
Page <span class="pageNumber"></span> of <span class="totalPages"></span>
</div>`
const pdf = await page.pdf({
format: 'A4',
printBackground: true, // off by default; without it, CSS backgrounds vanish
displayHeaderFooter: true, // off by default; required for the templates below
headerTemplate,
footerTemplate,
// Top/bottom margins must leave room for the header/footer or they overlap the body.
margin: { top: '2cm', bottom: '2cm', left: '1.5cm', right: '1.5cm' }
})
await writeFile('page.pdf', pdf)
await browser.close()
console.log(`Wrote page.pdf (${pdf.length} bytes)`)npm install puppeteer
node scrape-to-pdf.mjsWhat each step does
Launch headless and open a page. puppeteer.launch({ headless: true }) starts a Chrome instance with no visible window. One browser can serve many pages, so reuse the instance if you are converting a batch of URLs rather than launching Chrome per file.
Wait for networkidle0, not load. The default load event fires when the initial HTML and its synchronous resources finish, which is too early for pages that stream content over fetch or lazy-load images. networkidle0 waits until there have been zero in-flight network connections for 500ms. That is the difference between a complete PDF and one with blank image boxes.
Emulate the print media type. Sites ship a @media print stylesheet that hides nav bars, expands collapsed sections, and switches to black text on white. page.emulateMediaType('print') makes Chrome honor those rules, so your PDF matches what the site's own print button would produce instead of a screenshot of the screen layout.
Set printBackground and displayHeaderFooter. Both default to false. Without printBackground: true, every CSS background-color and background-image is dropped and code blocks, callouts, and dark themes render as bare text. Without displayHeaderFooter: true, your header and footer templates are silently ignored.
Reserve margin space for the header and footer. The templates render into the page margin, not the body. If margin.top and margin.bottom are smaller than the template height, the header prints on top of the first lines of content. The 2cm top and bottom here leaves room for a 9pt single-line header and footer.
Gotchas
The header and footer print blank.
- Issue: Chrome's default font size inside
headerTemplateandfooterTemplateis zero, so a template with no explicitfont-sizerenders as an invisible zero-height strip even whendisplayHeaderFooteris true. - Fix: set an explicit size on a wrapping element, for example
<div style="font-size:9px">...</div>, on every header and footer template.
- Issue: Chrome's default font size inside
Page backgrounds and colors are missing.
- Issue:
printBackgrounddefaults tofalse, sobackground-color,background-image, and many color fills are stripped from the PDF, which flattens code blocks and themed sections to plain text. - Fix: pass
printBackground: truein thepage.pdf()options.
- Issue:
The header overlaps the first lines of the page.
- Issue: the header and footer draw inside the page margin, so a default or small
margin.toplets the header sit on top of the body text. - Fix: set
margin.topandmargin.bottomlarger than the rendered template height, for example{ top: '2cm', bottom: '2cm' }.
- Issue: the header and footer draw inside the page margin, so a default or small
Lazy-loaded images and fonts are blank in the PDF.
- Issue:
page.goto(url)with the defaultwaitUntil: 'load'resolves before content that streams in over fetch or scrolls into view has arrived, so those regions snapshot empty. - Fix: use
waitUntil: 'networkidle0', and for images that load only on scroll, run a scroll-to-bottom pass inpage.evaluate()before callingpage.pdf().
- Issue:
The PDF uses screen styles instead of print styles.
- Issue: without an explicit media type,
page.pdf()renders the screen stylesheet, so sticky nav bars, cookie banners, and dark-mode backgrounds end up baked into the document. - Fix: call
await page.emulateMediaType('print')beforepage.pdf()so the site's@media printrules take effect.
- Issue: without an explicit media type,
CSS
@pagesize is overridden by the format option.- Issue: if the page defines its own paper size with
@page { size: ... }, passingformatorwidth/heighttopage.pdf()overrides it and can clip or rescale the layout. - Fix: pass
preferCSSPageSize: trueto let the page's own@pagerule win, and drop the conflictingformatoption.
- Issue: if the page defines its own paper size with
Headless Chrome cannot launch in a container.
- Issue: in Docker or a minimal Linux host,
puppeteer.launch()throws because the bundled Chrome is missing shared libraries, or the sandbox cannot start as root. - Fix: install Chrome's system dependencies, and when running as root in a trusted container, launch with
args: ['--no-sandbox'].
- Issue: in Docker or a minimal Linux host,
Use this when
You want a faithful, paginated PDF of a rendered web page, for archiving an article, generating an invoice or report from a live page, snapshotting a dashboard, or feeding a print-quality document into a downstream pipeline.
Skip this when
You only need the article text rather than a paginated document (convert to Markdown instead); you are assembling a PDF from structured fields you already scraped rather than a rendered page (use pdfkit to draw it directly); you need a single flat image of the page rather than multi-page paper layout (capture a full-page screenshot); or you are running at high volume and cannot afford a Chrome process per conversion (render the HTML once and reuse one browser across the batch).