Simplescraper
Skip to content

How to scrape an infinite scroll page in Puppeteer

How to scrape an infinite scroll page in Puppeteer

Updated 2026-06-25 · 6 min read

If you're scraping a feed that keeps loading more rows as you scroll, a plain fetch or a single page.content() only ever gives you the first screen of items. The rest are not in the initial HTML at all. They arrive batch by batch over background requests as the viewport reaches the bottom, so until something does the scrolling, most of the list stays out of reach.

The solution is to drive the scrolling yourself inside a page.evaluate loop, jumping to the bottom and waiting for the page height to grow after each jump, then stopping once the height holds steady across a few tries. After that the whole list sits in the DOM and you read it out in one pass. It takes about 55 lines of Node.js with one open-source library.

Key terms

  • page.evaluate. A Puppeteer method that runs a function inside the page's own JavaScript context, used here so the scroll happens in the live page where window.scrollTo and document.body.scrollHeight exist.
  • scrollHeight. The full pixel height of the page's scrollable content, which grows each time a new batch of items is appended. Comparing it before and after a scroll is how the loop knows whether more loaded.
  • Settle delay. The pause after each scroll that gives the background request time to return and the framework time to append the new rows before the height is measured again.
  • $$eval. A Puppeteer method that runs a function over every element matching a selector inside the page and returns the result, used here to read the items out of the fully expanded DOM.

Here is what the script does:

  • Launch headless Chrome with Puppeteer and open the feed that loads its content as you scroll.
  • Scroll to the bottom inside a page.evaluate loop, pausing after each jump for the next batch to load.
  • Compare scrollHeight before and after each scroll, stopping once it stops growing across two tries or a scroll ceiling is hit.
  • Read the items out of the expanded DOM in one pass with $$eval.

The complete script

js
// infinite-scroll.mjs
import puppeteer from 'puppeteer'

const url = 'https://www.scrapingcourse.com/infinite-scrolling'

const browser = await puppeteer.launch({ headless: true })
const page = await browser.newPage()
await page.goto(url, { waitUntil: 'networkidle2' })

/* Scroll to the bottom repeatedly until the page stops getting taller.
   scrollHeight grows each time a new batch is appended; when it holds
   steady across `maxStaleScrolls` tries, the feed is exhausted. */
async function scrollToBottom({ settleMs = 1500, maxScrolls = 100, maxStaleScrolls = 2 } = {}) {
  let previousHeight = 0
  let staleScrolls = 0
  let scrolls = 0

  while (staleScrolls < maxStaleScrolls && scrolls < maxScrolls) {
    /* Jump to the bottom and read the height from inside the page. */
    const currentHeight = await page.evaluate(() => {
      window.scrollTo(0, document.body.scrollHeight)
      return document.body.scrollHeight
    })

    /* Give the background request time to return and the rows time to render. */
    await new Promise(resolve => setTimeout(resolve, settleMs))

    const newHeight = await page.evaluate(() => document.body.scrollHeight)
    if (newHeight > currentHeight || currentHeight > previousHeight) {
      staleScrolls = 0 /* The page grew, so there is probably more to load. */
    } else {
      staleScrolls++ /* No growth. Could be a slow batch, so retry before stopping. */
    }

    previousHeight = newHeight
    scrolls++
  }
}

await scrollToBottom()

/* The full list is in the DOM now. Read every card in one pass. */
const items = await page.$$eval('.product-item', els =>
  els.map(el => ({
    name: el.querySelector('.product-name')?.textContent.trim() ?? null,
    price: el.querySelector('.product-price')?.textContent.trim() ?? null
  }))
)

console.log(`Loaded ${items.length} items`)
console.log(items.slice(0, 5))

await browser.close()
bash
npm install puppeteer
node infinite-scroll.mjs

What each step does

Open the page with networkidle2. Infinite-scroll feeds usually fetch their first batch over a background request after the initial HTML lands. Waiting for the network to settle means the first items are present before the loop starts, so the height measured on the first pass is the real starting height and not the height of an empty shell.

Scroll from inside page.evaluate. The scroll has to happen in the page's own context, because window.scrollTo and document.body.scrollHeight only exist there. Returning scrollHeight from the same call gives you the height the moment after the jump, before the new batch has had time to load, which is the number you compare against later.

Pause for a settle delay after each scroll. The batch request fires when the viewport reaches the bottom, then the framework appends the rows a moment later. The settleMs wait covers both. Set it too low and you measure the height before the new rows render and exit early; 1500ms is a reasonable starting point and you can tune it per site.

Stop on a stale-height count, not the first non-growth. A single scroll that adds no height might just be a slow batch. The staleScrolls counter tolerates two no-growth scrolls before breaking, so a slow response does not end the run with half the feed. The maxScrolls ceiling is the backstop for a feed that loads without end.

Read the items after the loop, not during it. Once scrolling stops, the whole list is in the DOM, so a single $$eval pass over .product-item collects every card. Reading once at the end is simpler than collecting on each scroll and deduplicating, since the appended rows stay in the document.

Gotchas

  • The settle delay is too short, so the loop exits early.

    • Issue: with settleMs set low, the height is measured before the background request returns and the rows render, so the loop sees no growth and stops with only the first batch or two collected.
    • Fix: raise settleMs, or replace the fixed wait with page.waitForResponse keyed to the batch request's URL so the wait ends exactly when the data lands instead of after a guess.
  • The page scrolls a container, not the window.

    • Issue: some feeds put the scrollbar on an inner <div> with overflow: auto, so window.scrollTo moves nothing and document.body.scrollHeight never changes, ending the loop on the first pass.
    • Fix: scroll the container instead. Select it and set its scrollTop to its scrollHeight inside the evaluate call, and measure that element's scrollHeight rather than the body's.
  • The feed triggers on an element near the bottom, not the exact bottom.

    • Issue: an IntersectionObserver sentinel sometimes fires a screen early or only when an element scrolls fully into view, so a hard jump to scrollHeight can skip past the trigger without firing it.
    • Fix: scroll in steps with window.scrollBy(0, window.innerHeight) and a short pause between steps, which crosses every trigger point on the way down instead of leaping over it.
  • Loaded images stay blank because they lazy-load on view.

    • Issue: rows that scrolled by quickly keep a placeholder src and a real data-src, so reading img.src returns the placeholder for items the viewport passed too fast.
    • Fix: read the data-src attribute, or scroll in smaller steps so each batch sits in view long enough to swap in its image. See How to scrape lazy-loaded images.
  • The feed truly never ends, so the run never stops.

    • Issue: a social or search feed can load for as long as you scroll, turning the loop into a scrape that fills memory with DOM nodes and never reaches a stable height.
    • Fix: keep the maxScrolls ceiling and lower it to a count that covers the rows you need, or break once items.length passes a target you set for the job.
  • networkidle2 never fires, so goto times out.

    • Issue: a page with a persistent socket, a polling request, or an autoplaying video never reaches two idle connections, so waitUntil: 'networkidle2' waits to the navigation timeout and throws.
    • Fix: switch to waitUntil: 'domcontentloaded' and then await page.waitForSelector('.product-item') so you proceed once the first items exist rather than waiting on the network to go quiet.

Use this when

A feed, search result, or catalog appends more items as you scroll toward the bottom, the new rows are not in the initial HTML, and you want the whole list in one pass.

Skip this when

The page loads more behind a button you click instead of on scroll (drive the button in a loop, see How to handle load more buttons when scraping); the same data is available through a paginated API or a ?page=N URL (fetch the JSON directly, which is lighter than driving a browser); the full list is already in the first HTML response (a plain fetch and a parser cover it); or you only need the items above the fold (no scrolling required).

Skip the code, just get the data

Simplescraper turns any website into structured data in seconds.