How to scrape an infinite scroll page in Puppeteer
If you're scraping a feed that keeps loading more rows as you scroll, a plain fetch or a single page.content() only ever gives you the first screen of items. The rest are not in the initial HTML at all. They arrive batch by batch over background requests as the viewport reaches the bottom, so until something does the scrolling, most of the list stays out of reach.
The solution is to drive the scrolling yourself inside a page.evaluate loop, jumping to the bottom and waiting for the page height to grow after each jump, then stopping once the height holds steady across a few tries. After that the whole list sits in the DOM and you read it out in one pass. It takes about 55 lines of Node.js with one open-source library.
Key terms
page.evaluate. A Puppeteer method that runs a function inside the page's own JavaScript context, used here so the scroll happens in the live page wherewindow.scrollToanddocument.body.scrollHeightexist.scrollHeight. The full pixel height of the page's scrollable content, which grows each time a new batch of items is appended. Comparing it before and after a scroll is how the loop knows whether more loaded.- Settle delay. The pause after each scroll that gives the background request time to return and the framework time to append the new rows before the height is measured again.
$$eval. A Puppeteer method that runs a function over every element matching a selector inside the page and returns the result, used here to read the items out of the fully expanded DOM.
Here is what the script does:
- Launch headless Chrome with Puppeteer and open the feed that loads its content as you scroll.
- Scroll to the bottom inside a
page.evaluateloop, pausing after each jump for the next batch to load. - Compare
scrollHeightbefore and after each scroll, stopping once it stops growing across two tries or a scroll ceiling is hit. - Read the items out of the expanded DOM in one pass with
$$eval.
The complete script
// infinite-scroll.mjs
import puppeteer from 'puppeteer'
const url = 'https://www.scrapingcourse.com/infinite-scrolling'
const browser = await puppeteer.launch({ headless: true })
const page = await browser.newPage()
await page.goto(url, { waitUntil: 'networkidle2' })
/* Scroll to the bottom repeatedly until the page stops getting taller.
scrollHeight grows each time a new batch is appended; when it holds
steady across `maxStaleScrolls` tries, the feed is exhausted. */
async function scrollToBottom({ settleMs = 1500, maxScrolls = 100, maxStaleScrolls = 2 } = {}) {
let previousHeight = 0
let staleScrolls = 0
let scrolls = 0
while (staleScrolls < maxStaleScrolls && scrolls < maxScrolls) {
/* Jump to the bottom and read the height from inside the page. */
const currentHeight = await page.evaluate(() => {
window.scrollTo(0, document.body.scrollHeight)
return document.body.scrollHeight
})
/* Give the background request time to return and the rows time to render. */
await new Promise(resolve => setTimeout(resolve, settleMs))
const newHeight = await page.evaluate(() => document.body.scrollHeight)
if (newHeight > currentHeight || currentHeight > previousHeight) {
staleScrolls = 0 /* The page grew, so there is probably more to load. */
} else {
staleScrolls++ /* No growth. Could be a slow batch, so retry before stopping. */
}
previousHeight = newHeight
scrolls++
}
}
await scrollToBottom()
/* The full list is in the DOM now. Read every card in one pass. */
const items = await page.$$eval('.product-item', els =>
els.map(el => ({
name: el.querySelector('.product-name')?.textContent.trim() ?? null,
price: el.querySelector('.product-price')?.textContent.trim() ?? null
}))
)
console.log(`Loaded ${items.length} items`)
console.log(items.slice(0, 5))
await browser.close()npm install puppeteer
node infinite-scroll.mjsWhat each step does
Open the page with networkidle2. Infinite-scroll feeds usually fetch their first batch over a background request after the initial HTML lands. Waiting for the network to settle means the first items are present before the loop starts, so the height measured on the first pass is the real starting height and not the height of an empty shell.
Scroll from inside page.evaluate. The scroll has to happen in the page's own context, because window.scrollTo and document.body.scrollHeight only exist there. Returning scrollHeight from the same call gives you the height the moment after the jump, before the new batch has had time to load, which is the number you compare against later.
Pause for a settle delay after each scroll. The batch request fires when the viewport reaches the bottom, then the framework appends the rows a moment later. The settleMs wait covers both. Set it too low and you measure the height before the new rows render and exit early; 1500ms is a reasonable starting point and you can tune it per site.
Stop on a stale-height count, not the first non-growth. A single scroll that adds no height might just be a slow batch. The staleScrolls counter tolerates two no-growth scrolls before breaking, so a slow response does not end the run with half the feed. The maxScrolls ceiling is the backstop for a feed that loads without end.
Read the items after the loop, not during it. Once scrolling stops, the whole list is in the DOM, so a single $$eval pass over .product-item collects every card. Reading once at the end is simpler than collecting on each scroll and deduplicating, since the appended rows stay in the document.
Gotchas
The settle delay is too short, so the loop exits early.
- Issue: with
settleMsset low, the height is measured before the background request returns and the rows render, so the loop sees no growth and stops with only the first batch or two collected. - Fix: raise
settleMs, or replace the fixed wait withpage.waitForResponsekeyed to the batch request's URL so the wait ends exactly when the data lands instead of after a guess.
- Issue: with
The page scrolls a container, not the window.
- Issue: some feeds put the scrollbar on an inner
<div>withoverflow: auto, sowindow.scrollTomoves nothing anddocument.body.scrollHeightnever changes, ending the loop on the first pass. - Fix: scroll the container instead. Select it and set its
scrollTopto itsscrollHeightinside theevaluatecall, and measure that element'sscrollHeightrather than the body's.
- Issue: some feeds put the scrollbar on an inner
The feed triggers on an element near the bottom, not the exact bottom.
- Issue: an
IntersectionObserversentinel sometimes fires a screen early or only when an element scrolls fully into view, so a hard jump toscrollHeightcan skip past the trigger without firing it. - Fix: scroll in steps with
window.scrollBy(0, window.innerHeight)and a short pause between steps, which crosses every trigger point on the way down instead of leaping over it.
- Issue: an
Loaded images stay blank because they lazy-load on view.
- Issue: rows that scrolled by quickly keep a placeholder
srcand a realdata-src, so readingimg.srcreturns the placeholder for items the viewport passed too fast. - Fix: read the
data-srcattribute, or scroll in smaller steps so each batch sits in view long enough to swap in its image. See How to scrape lazy-loaded images.
- Issue: rows that scrolled by quickly keep a placeholder
The feed truly never ends, so the run never stops.
- Issue: a social or search feed can load for as long as you scroll, turning the loop into a scrape that fills memory with DOM nodes and never reaches a stable height.
- Fix: keep the
maxScrollsceiling and lower it to a count that covers the rows you need, or break onceitems.lengthpasses a target you set for the job.
networkidle2never fires, sogototimes out.- Issue: a page with a persistent socket, a polling request, or an autoplaying video never reaches two idle connections, so
waitUntil: 'networkidle2'waits to the navigation timeout and throws. - Fix: switch to
waitUntil: 'domcontentloaded'and thenawait page.waitForSelector('.product-item')so you proceed once the first items exist rather than waiting on the network to go quiet.
- Issue: a page with a persistent socket, a polling request, or an autoplaying video never reaches two idle connections, so
Use this when
A feed, search result, or catalog appends more items as you scroll toward the bottom, the new rows are not in the initial HTML, and you want the whole list in one pass.
Skip this when
The page loads more behind a button you click instead of on scroll (drive the button in a loop, see How to handle load more buttons when scraping); the same data is available through a paginated API or a ?page=N URL (fetch the JSON directly, which is lighter than driving a browser); the full list is already in the first HTML response (a plain fetch and a parser cover it); or you only need the items above the fold (no scrolling required).