How to scrape data from a Shadow DOM
If you've pointed a selector at a page and got nothing back even though the value is sitting right there in the rendered DOM, the data is probably inside a shadow root. A web component attaches its own isolated DOM tree to a host element, and a plain document.querySelector stops at that boundary: the markup you see in DevTools under a #shadow-root (open) node is invisible to ordinary CSS selectors.
The solution is to use Puppeteer's shadow-aware selectors so the query descends through the shadow boundary instead of stopping at it. Puppeteer ships a >>> deep combinator and a pierce/ prefix that cross open shadow roots, and for the cases those cannot reach you walk element.shadowRoot by hand inside page.evaluate. It takes about 35 lines of Node.js and one open-source library.
Key terms
- Shadow DOM. An isolated DOM subtree attached to a host element with
attachShadow, scoped so its CSS and structure do not leak into or out of the main document. - Shadow root. The top node of that subtree, reachable as
host.shadowRootwhen the host was created with{ mode: 'open' }andnullwhen it was created with{ mode: 'closed' }. - Deep combinator (
>>>). A Puppeteer selector operator that matches elements at any depth across open shadow roots, the shadow-piercing analog of the descendant space. - Pierce selector (
pierce/). A Puppeteer selector prefix that returns every element matching one CSS selector across all open shadow roots in the document.
Here is what the script does:
- Launch headless Chromium with Puppeteer and load a page that renders its content inside an open shadow root.
- Pull text out with the
>>>deep combinator, which Puppeteer resolves across the shadow boundary that plain CSS cannot cross. - Pull the same fields with a
pierce/selector to show the flat alternative when you want every match in one call. - Drop into
page.evaluateand walkhost.shadowRootdirectly for nested roots and attribute reads the combinators do not cover.
The complete script
// scrape-shadow-dom.mjs
import puppeteer from 'puppeteer'
/*
This demo builds its own open shadow DOM with setContent so the script is
deterministic and runs anywhere. To scrape a live site, delete the markup
and the setContent call and use: await page.goto('https://example.com').
The selectors below are unchanged against any open shadow root.
*/
const markup = `
<product-card>
<template shadowrootmode="open">
<h2 class="name">Wireless Headphones</h2>
<span class="price" data-cents="8999">$89.99</span>
</template>
</product-card>`
const browser = await puppeteer.launch({ headless: true })
const page = await browser.newPage()
// setContent parses declarative shadow roots, so the <template> becomes a real open shadow root.
await page.setContent(markup)
// 1. Deep combinator: descend from the host into its open shadow root at any depth.
const name = await page.$eval('product-card >>> .name', el => el.textContent.trim())
// 2. Pierce selector: match across every open shadow root in the document in one call.
const price = await page.$eval('pierce/.price', el => el.textContent.trim())
// 3. Manual walk: read shadowRoot by hand when you need an attribute or a nested root.
const cents = await page.evaluate(() => {
const host = document.querySelector('product-card')
// host.shadowRoot is the ShadowRoot for an open host, or null for a closed one.
return host.shadowRoot.querySelector('.price').getAttribute('data-cents')
})
console.log({ name, price, cents })
await browser.close()npm install puppeteer
node scrape-shadow-dom.mjsWhat each step does
Load the page with an open shadow root. Puppeteer drives headless Chromium, which renders the same shadow DOM a browser does. The demo uses page.setContent with a <template shadowrootmode="open"> so it has a real open shadow root without depending on a third-party site; on a live target you swap in page.goto(url) and the selectors below stay the same.
Read text with the deep combinator. product-card >>> .name selects .name at any depth inside the host's open shadow root. The >>> is Puppeteer's deep descendant combinator; a plain product-card .name returns nothing because standard CSS does not descend into a shadow tree. Use >>>> when you want only the host's immediate shadow root and not roots nested deeper.
Read the same field with a pierce selector. pierce/.price matches every .price across all open shadow roots in one query, with no host prefix. It is the flatter option when the field is unique on the page; the deep combinator is the one to reach for when you need to anchor the match under a specific host.
Walk shadowRoot by hand for attributes and nested roots. Inside page.evaluate you have the live DOM, so host.shadowRoot.querySelector('.price').getAttribute('data-cents') reads the data-cents attribute the text selectors skip. This manual path is also how you reach a shadow root nested inside another shadow root, by chaining .shadowRoot once per level.
Gotchas
A closed shadow root returns nothing.
- Issue: when a component is created with
attachShadow({ mode: 'closed' }),host.shadowRootisnulland neither>>>norpierce/can enter it, so the query comes back empty. - Fix: check the mode in DevTools first (
#shadow-root (closed)). If it is closed, look for the same data in a JSON blob the page already fetched, read it from an XHR response, or overrideElement.prototype.attachShadowbefore the component initializes to force open roots.
- Issue: when a component is created with
Plain
page.$('.name')skips the shadow tree.- Issue: calling
page.$('.name')orpage.$$('.name')with an ordinary CSS selector stops at the shadow boundary and returnsnullor an empty array, because CSS descendant matching does not cross into a shadow root. - Fix: prefix with the host and a deep combinator (
product-card >>> .name) or use thepierce/form (pierce/.name). Reserve bare CSS for light-DOM elements.
- Issue: calling
The selector runs before the component upgrades.
- Issue: custom elements register and attach their shadow root after the initial HTML parses, so a query fired too early sees the bare
<product-card>host with no shadow content yet. - Fix: wait for the inner node with
await page.waitForSelector('product-card >>> .name')before reading, rather than querying immediately after navigation.
- Issue: custom elements register and attach their shadow root after the initial HTML parses, so a query fired too early sees the bare
pierce/returns matches from unrelated roots.- Issue:
pierce/.pricecollects.pricefrom every open shadow root on the page, so on a listing with many cardspage.$$('pierce/.price')mixes results across components with no grouping. - Fix: iterate the hosts first and scope each read, for example
for (const card of await page.$$('product-card')) { await card.$eval(':scope >>> .price', el => el.textContent) }, so each value stays tied to its card.
- Issue:
element.shadowRootis undefined in Node.- Issue: reaching for
host.shadowRootin the Node.js scope throws, because the host is a PuppeteerElementHandleand the property lives on the browser-side DOM object, not the handle. - Fix: read it inside
page.evaluate(() => document.querySelector('product-card').shadowRoot...), where the callback runs in the page andshadowRootresolves on the real element.
- Issue: reaching for
Slotted light-DOM content is not in the shadow root.
- Issue: text passed into a component as children and projected through a
<slot>lives in the host's light DOM, soproduct-card >>> .labelmisses it and you wrongly conclude the value is absent. - Fix: query the slotted node in the light DOM directly (
product-card .label), since deep combinators target the shadow tree and slotted content stays where it was authored.
- Issue: text passed into a component as children and projected through a
Use this when
You are scraping a site built from web components (Polymer, Lit, Stencil, or hand-rolled custom elements) and the values you want render under an open #shadow-root, where a normal selector returns nothing.
Skip this when
The shadow root is closed (look for the data in an XHR response or an inline JSON payload instead); the content is plain light-DOM markup (use ordinary CSS selectors); the page never runs JavaScript (a static fetch plus a parser is lighter than a browser); or you only need the article body as text (run a readability pass).