How to scrape an iframe's contents in Puppeteer
If you're scraping a page where the content you want sits inside an iframe, you're probably watching your selectors come back empty even though you can see the text right there in the browser. An iframe is a separate browsing context with its own document, so the parent page's document.querySelector stops at the <iframe> tag and does not reach what's underneath. It gets more complicated when the frame is cross-origin or sandboxed.
The solution is to get a handle to the right frame object first and run your extraction inside it, so the framed document's own DOM is what your code reads. Puppeteer models every iframe as a Frame with the same evaluate and waitForSelector methods as a page, so once you cross into it the work is ordinary scraping. That takes about 40 lines of Node.js with one dependency, the puppeteer package.
Key terms
- Browsing context. An independent document and window. The page and each iframe are separate browsing contexts.
Frame. Puppeteer's handle to one browsing context. The main page is a frame, and every iframe is a child frame.contentFrame(). Called on an<iframe>element handle, it returns theFramefor the document inside.- Cross-origin frame. One whose URL does not share the page's origin. The browser blocks parent-page
contentDocumentaccess to it, while Puppeteer can work with an attachedFramehandle.
Here is what the script does:
- Launch Chromium with Puppeteer and navigate to a page that embeds an iframe.
- Get a handle to the
<iframe>element, then callelementHandle.contentFrame()to cross into the frame's own document. - Wait for a known selector inside the frame so extraction runs only after the framed content has loaded.
- Run
frame.evaluate()to read text and attributes from inside the frame and return plain JavaScript values. - Show the alternative path: walk
page.frames()and match onframe.url()when you do not have the<iframe>element handle.
The complete script
// scrape-iframe.mjs
import puppeteer from 'puppeteer'
const browser = await puppeteer.launch({ headless: true })
const page = await browser.newPage()
// MDN's iframe reference page embeds a live demo in an <iframe>.
await page.goto('https://developer.mozilla.org/en-US/docs/Web/HTML/Element/iframe', {
waitUntil: 'networkidle2'
})
// 1. Get a handle to the <iframe> element in the parent document.
// The frame may render after navigation, so wait for the element itself.
const frameElement = await page.waitForSelector('iframe.sample-code-frame', { timeout: 15000 })
// 2. Cross from the <iframe> element into its own document.
// contentFrame() returns a Frame, which has the same query/evaluate API as a Page.
const frame = await frameElement.contentFrame()
if (!frame) throw new Error('iframe has no attached content frame yet')
// 3. Wait for content *inside the frame*. The parent being idle does not mean
// the framed document has painted. Wait on a selector the frame owns.
await frame.waitForSelector('body', { timeout: 15000 })
// 4. Extract inside the frame. This runs in the frame's context, so its own
// document.querySelector sees the framed DOM, not the parent's.
const data = await frame.evaluate(() => {
const links = [...document.querySelectorAll('a')].map(a => ({
text: a.textContent.trim(),
href: a.href
}))
return {
title: document.title,
bodyText: document.body.innerText.slice(0, 200),
linkCount: links.length,
firstLinks: links.slice(0, 3)
}
})
console.log(JSON.stringify(data, null, 2))
await browser.close()npm install puppeteer
node scrape-iframe.mjsWhat each step does
Get the iframe element, then contentFrame(). page.waitForSelector('iframe...') returns an ElementHandle for the <iframe> tag in the parent. Calling frameElement.contentFrame() on that handle returns the Frame for the document inside. This is the cleanest path when you can target the iframe with a CSS selector. It returns null when no attached frame was found for that element yet, so guard it.
Wait for a selector the frame owns. Puppeteer drives Chromium over the DevTools Protocol, and frame attachment is asynchronous. The parent reaching networkidle2 says nothing about whether the framed document has parsed. Calling frame.waitForSelector('body') (or a more specific element you plan to read) blocks until that node exists inside the frame, so your extraction is not racing the load.
Extract with frame.evaluate(). The callback runs inside the frame's JavaScript context. Its document is the framed document, so document.querySelectorAll('a') returns the frame's links, not the parent's. Whatever you return must be serializable, since it crosses the protocol boundary back to Node as JSON. DOM nodes do not survive the trip; return strings, numbers, and plain objects.
The page.frames() fallback. When you cannot select the <iframe> element, for example an ad slot with no stable class, walk every frame on the page and match on URL. page.frames() returns a flat array of all frames including nested ones, so page.frames().find(f => f.url().includes('embed')) finds a frame by its src without ever touching the parent DOM. Use frame.parentFrame() and frame.childFrames() when you need to recover the hierarchy.
Gotchas
Cross-origin frames are fine in Puppeteer, blocked in the browser.
- Issue: reading an iframe from another origin with
page.evaluate(() => iframe.contentDocument)fails on the same-origin policy. - Fix: Puppeteer's
FrameAPI talks to each context directly over the DevTools Protocol, so go throughcontentFrame()orpage.frames()instead of reaching intocontentDocumentfrom the parent script.
- Issue: reading an iframe from another origin with
The frame loads after navigation.
- Issue: many iframes inject after the parent's
loadevent, socontentFrame()returnsnullor the frame's URL is stillabout:blank. - Fix: poll until the target frame appears with
await page.waitForFrame(f => f.url().includes('embed'), { timeout: 15000 }). On older Puppeteer withoutwaitForFrame, loop onpage.frames()with a short delay until the match shows up.
- Issue: many iframes inject after the parent's
The frame detaches mid-extraction.
- Issue: if the frame navigates or the parent removes it while
frame.evaluate()is running, you getError: Execution context was destroyedorAttempted to use detached Frame. - Fix: re-fetch the frame from
page.frames()after any action that could reload it, and wrap the evaluate in a try/catch that retries once on detachment.
- Issue: if the frame navigates or the parent removes it while
Nested iframes have no hierarchy in the flat frame list.
- Issue:
page.frames()returns every frame in one flat array, so the parent/child structure you need to reach an iframe inside an iframe is not represented by array nesting. - Fix: match the inner frame's URL in the flat
page.frames()list, recover structure withframe.parentFrame()andframe.childFrames(), or descend the tree withframe.childFrames()step by step.
- Issue:
srcdocand sandboxed frames behave differently.- Issue: a
srcdocframe has a document but its URL isabout:srcdoc, so matching bysrcmisses it, and sandbox flags can change which scripts and origins are available inside the frame. - Fix: match
srcdocframes on theabout:srcdocURL. Puppeteer access depends on whether a frame is attached and what sandbox flags are set; same-origin policy blocks parent-pagecontentDocumentaccess, not Puppeteer'sFramehandle by itself.
- Issue: a
The same
srcappears in multiple frames.- Issue: if a page embeds the same widget twice,
page.frames().find()returns the first match only. - Fix: use
page.frames().filter(f => f.url().includes('widget'))and index the one you want, or disambiguate by also checkingframe.name().
- Issue: if a page embeds the same widget twice,
waitUntil: 'networkidle2'is too early for lazy frames.- Issue: a frame that loads on scroll or click has not started its request when the parent goes idle, so waiting on the parent finds nothing.
- Fix: trigger the interaction first (
page.click, orpage.evaluate(() => window.scrollTo(...))), then wait for the frame.
Use this when
You need data that lives inside an embedded document - a third-party comment widget, an embedded form, a payment or video frame, a live code demo, or an ad slot - and the parent page's selectors return nothing because the content is in a separate browsing context.
Skip this when
The content you want is actually in the main document and you only thought it was framed (check with page.frames().length first); the data arrives over an XHR or fetch the frame makes (intercept the request instead of reading the rendered DOM); the page has no JavaScript-injected frames and a plain HTTP fetch plus an HTML parser would do; or sandbox flags prevent the frame-side script you need, in which case Puppeteer cannot extract that data with this pattern.