Simplescraper
Skip to content

How to scrape an iframe's contents in Puppeteer

How to scrape an iframe's contents in Puppeteer

Updated 2026-06-18 · 6 min read

If you're scraping a page where the content you want sits inside an iframe, you're probably watching your selectors come back empty even though you can see the text right there in the browser. An iframe is a separate browsing context with its own document, so the parent page's document.querySelector stops at the <iframe> tag and does not reach what's underneath. It gets more complicated when the frame is cross-origin or sandboxed.

The solution is to get a handle to the right frame object first and run your extraction inside it, so the framed document's own DOM is what your code reads. Puppeteer models every iframe as a Frame with the same evaluate and waitForSelector methods as a page, so once you cross into it the work is ordinary scraping. That takes about 40 lines of Node.js with one dependency, the puppeteer package.

Key terms

  • Browsing context. An independent document and window. The page and each iframe are separate browsing contexts.
  • Frame. Puppeteer's handle to one browsing context. The main page is a frame, and every iframe is a child frame.
  • contentFrame(). Called on an <iframe> element handle, it returns the Frame for the document inside.
  • Cross-origin frame. One whose URL does not share the page's origin. The browser blocks parent-page contentDocument access to it, while Puppeteer can work with an attached Frame handle.

Here is what the script does:

  • Launch Chromium with Puppeteer and navigate to a page that embeds an iframe.
  • Get a handle to the <iframe> element, then call elementHandle.contentFrame() to cross into the frame's own document.
  • Wait for a known selector inside the frame so extraction runs only after the framed content has loaded.
  • Run frame.evaluate() to read text and attributes from inside the frame and return plain JavaScript values.
  • Show the alternative path: walk page.frames() and match on frame.url() when you do not have the <iframe> element handle.

The complete script

js
// scrape-iframe.mjs
import puppeteer from 'puppeteer'

const browser = await puppeteer.launch({ headless: true })
const page = await browser.newPage()

// MDN's iframe reference page embeds a live demo in an <iframe>.
await page.goto('https://developer.mozilla.org/en-US/docs/Web/HTML/Element/iframe', {
  waitUntil: 'networkidle2'
})

// 1. Get a handle to the <iframe> element in the parent document.
//    The frame may render after navigation, so wait for the element itself.
const frameElement = await page.waitForSelector('iframe.sample-code-frame', { timeout: 15000 })

// 2. Cross from the <iframe> element into its own document.
//    contentFrame() returns a Frame, which has the same query/evaluate API as a Page.
const frame = await frameElement.contentFrame()
if (!frame) throw new Error('iframe has no attached content frame yet')

// 3. Wait for content *inside the frame*. The parent being idle does not mean
//    the framed document has painted. Wait on a selector the frame owns.
await frame.waitForSelector('body', { timeout: 15000 })

// 4. Extract inside the frame. This runs in the frame's context, so its own
//    document.querySelector sees the framed DOM, not the parent's.
const data = await frame.evaluate(() => {
  const links = [...document.querySelectorAll('a')].map(a => ({
    text: a.textContent.trim(),
    href: a.href
  }))
  return {
    title: document.title,
    bodyText: document.body.innerText.slice(0, 200),
    linkCount: links.length,
    firstLinks: links.slice(0, 3)
  }
})

console.log(JSON.stringify(data, null, 2))

await browser.close()
bash
npm install puppeteer
node scrape-iframe.mjs

What each step does

Get the iframe element, then contentFrame(). page.waitForSelector('iframe...') returns an ElementHandle for the <iframe> tag in the parent. Calling frameElement.contentFrame() on that handle returns the Frame for the document inside. This is the cleanest path when you can target the iframe with a CSS selector. It returns null when no attached frame was found for that element yet, so guard it.

Wait for a selector the frame owns. Puppeteer drives Chromium over the DevTools Protocol, and frame attachment is asynchronous. The parent reaching networkidle2 says nothing about whether the framed document has parsed. Calling frame.waitForSelector('body') (or a more specific element you plan to read) blocks until that node exists inside the frame, so your extraction is not racing the load.

Extract with frame.evaluate(). The callback runs inside the frame's JavaScript context. Its document is the framed document, so document.querySelectorAll('a') returns the frame's links, not the parent's. Whatever you return must be serializable, since it crosses the protocol boundary back to Node as JSON. DOM nodes do not survive the trip; return strings, numbers, and plain objects.

The page.frames() fallback. When you cannot select the <iframe> element, for example an ad slot with no stable class, walk every frame on the page and match on URL. page.frames() returns a flat array of all frames including nested ones, so page.frames().find(f => f.url().includes('embed')) finds a frame by its src without ever touching the parent DOM. Use frame.parentFrame() and frame.childFrames() when you need to recover the hierarchy.

Gotchas

  • Cross-origin frames are fine in Puppeteer, blocked in the browser.

    • Issue: reading an iframe from another origin with page.evaluate(() => iframe.contentDocument) fails on the same-origin policy.
    • Fix: Puppeteer's Frame API talks to each context directly over the DevTools Protocol, so go through contentFrame() or page.frames() instead of reaching into contentDocument from the parent script.
  • The frame loads after navigation.

    • Issue: many iframes inject after the parent's load event, so contentFrame() returns null or the frame's URL is still about:blank.
    • Fix: poll until the target frame appears with await page.waitForFrame(f => f.url().includes('embed'), { timeout: 15000 }). On older Puppeteer without waitForFrame, loop on page.frames() with a short delay until the match shows up.
  • The frame detaches mid-extraction.

    • Issue: if the frame navigates or the parent removes it while frame.evaluate() is running, you get Error: Execution context was destroyed or Attempted to use detached Frame.
    • Fix: re-fetch the frame from page.frames() after any action that could reload it, and wrap the evaluate in a try/catch that retries once on detachment.
  • Nested iframes have no hierarchy in the flat frame list.

    • Issue: page.frames() returns every frame in one flat array, so the parent/child structure you need to reach an iframe inside an iframe is not represented by array nesting.
    • Fix: match the inner frame's URL in the flat page.frames() list, recover structure with frame.parentFrame() and frame.childFrames(), or descend the tree with frame.childFrames() step by step.
  • srcdoc and sandboxed frames behave differently.

    • Issue: a srcdoc frame has a document but its URL is about:srcdoc, so matching by src misses it, and sandbox flags can change which scripts and origins are available inside the frame.
    • Fix: match srcdoc frames on the about:srcdoc URL. Puppeteer access depends on whether a frame is attached and what sandbox flags are set; same-origin policy blocks parent-page contentDocument access, not Puppeteer's Frame handle by itself.
  • The same src appears in multiple frames.

    • Issue: if a page embeds the same widget twice, page.frames().find() returns the first match only.
    • Fix: use page.frames().filter(f => f.url().includes('widget')) and index the one you want, or disambiguate by also checking frame.name().
  • waitUntil: 'networkidle2' is too early for lazy frames.

    • Issue: a frame that loads on scroll or click has not started its request when the parent goes idle, so waiting on the parent finds nothing.
    • Fix: trigger the interaction first (page.click, or page.evaluate(() => window.scrollTo(...))), then wait for the frame.

Use this when

You need data that lives inside an embedded document - a third-party comment widget, an embedded form, a payment or video frame, a live code demo, or an ad slot - and the parent page's selectors return nothing because the content is in a separate browsing context.

Skip this when

The content you want is actually in the main document and you only thought it was framed (check with page.frames().length first); the data arrives over an XHR or fetch the frame makes (intercept the request instead of reading the rendered DOM); the page has no JavaScript-injected frames and a plain HTTP fetch plus an HTML parser would do; or sandbox flags prevent the frame-side script you need, in which case Puppeteer cannot extract that data with this pattern.

Skip the code, just get the data

Simplescraper turns any website into structured data in seconds.