Simplescraper
Skip to content

How to scrape data from a Shadow DOM

How to scrape data from a Shadow DOM

Updated 2026-06-25 · 6 min read

If you've pointed a selector at a page and got nothing back even though the value is sitting right there in the rendered DOM, the data is probably inside a shadow root. A web component attaches its own isolated DOM tree to a host element, and a plain document.querySelector stops at that boundary: the markup you see in DevTools under a #shadow-root (open) node is invisible to ordinary CSS selectors.

The solution is to use Puppeteer's shadow-aware selectors so the query descends through the shadow boundary instead of stopping at it. Puppeteer ships a >>> deep combinator and a pierce/ prefix that cross open shadow roots, and for the cases those cannot reach you walk element.shadowRoot by hand inside page.evaluate. It takes about 35 lines of Node.js and one open-source library.

Key terms

  • Shadow DOM. An isolated DOM subtree attached to a host element with attachShadow, scoped so its CSS and structure do not leak into or out of the main document.
  • Shadow root. The top node of that subtree, reachable as host.shadowRoot when the host was created with { mode: 'open' } and null when it was created with { mode: 'closed' }.
  • Deep combinator (>>>). A Puppeteer selector operator that matches elements at any depth across open shadow roots, the shadow-piercing analog of the descendant space.
  • Pierce selector (pierce/). A Puppeteer selector prefix that returns every element matching one CSS selector across all open shadow roots in the document.

Here is what the script does:

  • Launch headless Chromium with Puppeteer and load a page that renders its content inside an open shadow root.
  • Pull text out with the >>> deep combinator, which Puppeteer resolves across the shadow boundary that plain CSS cannot cross.
  • Pull the same fields with a pierce/ selector to show the flat alternative when you want every match in one call.
  • Drop into page.evaluate and walk host.shadowRoot directly for nested roots and attribute reads the combinators do not cover.

The complete script

js
// scrape-shadow-dom.mjs
import puppeteer from 'puppeteer'

/*
  This demo builds its own open shadow DOM with setContent so the script is
  deterministic and runs anywhere. To scrape a live site, delete the markup
  and the setContent call and use: await page.goto('https://example.com').
  The selectors below are unchanged against any open shadow root.
*/
const markup = `
  <product-card>
    <template shadowrootmode="open">
      <h2 class="name">Wireless Headphones</h2>
      <span class="price" data-cents="8999">$89.99</span>
    </template>
  </product-card>`

const browser = await puppeteer.launch({ headless: true })
const page = await browser.newPage()

// setContent parses declarative shadow roots, so the <template> becomes a real open shadow root.
await page.setContent(markup)

// 1. Deep combinator: descend from the host into its open shadow root at any depth.
const name = await page.$eval('product-card >>> .name', el => el.textContent.trim())

// 2. Pierce selector: match across every open shadow root in the document in one call.
const price = await page.$eval('pierce/.price', el => el.textContent.trim())

// 3. Manual walk: read shadowRoot by hand when you need an attribute or a nested root.
const cents = await page.evaluate(() => {
  const host = document.querySelector('product-card')
  // host.shadowRoot is the ShadowRoot for an open host, or null for a closed one.
  return host.shadowRoot.querySelector('.price').getAttribute('data-cents')
})

console.log({ name, price, cents })

await browser.close()
bash
npm install puppeteer
node scrape-shadow-dom.mjs

What each step does

Load the page with an open shadow root. Puppeteer drives headless Chromium, which renders the same shadow DOM a browser does. The demo uses page.setContent with a <template shadowrootmode="open"> so it has a real open shadow root without depending on a third-party site; on a live target you swap in page.goto(url) and the selectors below stay the same.

Read text with the deep combinator. product-card >>> .name selects .name at any depth inside the host's open shadow root. The >>> is Puppeteer's deep descendant combinator; a plain product-card .name returns nothing because standard CSS does not descend into a shadow tree. Use >>>> when you want only the host's immediate shadow root and not roots nested deeper.

Read the same field with a pierce selector. pierce/.price matches every .price across all open shadow roots in one query, with no host prefix. It is the flatter option when the field is unique on the page; the deep combinator is the one to reach for when you need to anchor the match under a specific host.

Walk shadowRoot by hand for attributes and nested roots. Inside page.evaluate you have the live DOM, so host.shadowRoot.querySelector('.price').getAttribute('data-cents') reads the data-cents attribute the text selectors skip. This manual path is also how you reach a shadow root nested inside another shadow root, by chaining .shadowRoot once per level.

Gotchas

  • A closed shadow root returns nothing.

    • Issue: when a component is created with attachShadow({ mode: 'closed' }), host.shadowRoot is null and neither >>> nor pierce/ can enter it, so the query comes back empty.
    • Fix: check the mode in DevTools first (#shadow-root (closed)). If it is closed, look for the same data in a JSON blob the page already fetched, read it from an XHR response, or override Element.prototype.attachShadow before the component initializes to force open roots.
  • Plain page.$('.name') skips the shadow tree.

    • Issue: calling page.$('.name') or page.$$('.name') with an ordinary CSS selector stops at the shadow boundary and returns null or an empty array, because CSS descendant matching does not cross into a shadow root.
    • Fix: prefix with the host and a deep combinator (product-card >>> .name) or use the pierce/ form (pierce/.name). Reserve bare CSS for light-DOM elements.
  • The selector runs before the component upgrades.

    • Issue: custom elements register and attach their shadow root after the initial HTML parses, so a query fired too early sees the bare <product-card> host with no shadow content yet.
    • Fix: wait for the inner node with await page.waitForSelector('product-card >>> .name') before reading, rather than querying immediately after navigation.
  • pierce/ returns matches from unrelated roots.

    • Issue: pierce/.price collects .price from every open shadow root on the page, so on a listing with many cards page.$$('pierce/.price') mixes results across components with no grouping.
    • Fix: iterate the hosts first and scope each read, for example for (const card of await page.$$('product-card')) { await card.$eval(':scope >>> .price', el => el.textContent) }, so each value stays tied to its card.
  • element.shadowRoot is undefined in Node.

    • Issue: reaching for host.shadowRoot in the Node.js scope throws, because the host is a Puppeteer ElementHandle and the property lives on the browser-side DOM object, not the handle.
    • Fix: read it inside page.evaluate(() => document.querySelector('product-card').shadowRoot...), where the callback runs in the page and shadowRoot resolves on the real element.
  • Slotted light-DOM content is not in the shadow root.

    • Issue: text passed into a component as children and projected through a <slot> lives in the host's light DOM, so product-card >>> .label misses it and you wrongly conclude the value is absent.
    • Fix: query the slotted node in the light DOM directly (product-card .label), since deep combinators target the shadow tree and slotted content stays where it was authored.

Use this when

You are scraping a site built from web components (Polymer, Lit, Stencil, or hand-rolled custom elements) and the values you want render under an open #shadow-root, where a normal selector returns nothing.

Skip this when

The shadow root is closed (look for the data in an XHR response or an inline JSON payload instead); the content is plain light-DOM markup (use ordinary CSS selectors); the page never runs JavaScript (a static fetch plus a parser is lighter than a browser); or you only need the article body as text (run a readability pass).

Skip the code, just get the data

Simplescraper turns any website into structured data in seconds.