Simplescraper
Skip to content

How to scrape lazy-loaded images

How to scrape lazy-loaded images

Updated 2026-06-24 · 6 min read

If a page reads img.src back to you as the same 1x1 GIF or base64 blur for each <img> in a gallery, you're probably scraping a page that loads its images lazily. The candidate URLs are sitting in data-src or srcset, and the page's own JavaScript only copies them into src when an image scrolls close to the viewport. Nothing below the fold has scrolled there yet, so those images have not loaded, and a plain read gives you placeholders. This is how most modern image grids and feeds behave.

The solution is to reproduce the scroll behavior and move through the page in steps, so each image's IntersectionObserver fires and the loader swaps the resolved URL into src before you read it, with a data-src fallback for any image whose observer has not run yet. You wait for the network to settle after the last scroll, then read every <img> and filter out the leftover placeholders, so you end with a clean list of resolved image URLs from image elements. It comes to about 55 lines of Node.js with one dependency, Puppeteer.

Key terms

  • Lazy loading. Deferring an image's download until it nears the viewport, which is why an off-screen <img> holds a placeholder instead of its resolved URL.
  • IntersectionObserver. A browser API that fires a callback when an element crosses the viewport boundary, the trigger most lazy-loaders use to swap the candidate URL into src.
  • data-src. A non-rendering attribute where loaders park the image URL until the observer copies it into src, read here as the fallback for images that have not scrolled into view.
  • srcset and currentSrc. srcset lists candidate URLs at different resolutions, and currentSrc is the one the browser actually chose for the current viewport and pixel ratio.
  • Network idle. The condition, checked with waitForNetworkIdle, where no request has fired for a set interval, signalling that scroll-triggered image fetches have finished.

Here is what the script does:

  • Launch headless Chromium with Puppeteer and navigate to a page whose gallery loads images on scroll.
  • Scroll the page in fixed steps from top to bottom, pausing between steps so each image's IntersectionObserver callback runs and the browser starts fetching the resolved file.
  • Wait for the network to go quiet after the last scroll, so in-flight image requests finish before the read.
  • Read every <img> in the page, preferring currentSrc and falling back to data-src, data-original, or the last candidate in srcset when the site orders candidates small to large.
  • Filter out the 1x1 and base64 placeholder URLs that lazy-loaders leave behind, then print the clean list of absolute image URLs.

The complete script

js
// scrape-lazy-images.mjs
import puppeteer from 'puppeteer'

const url = 'https://unsplash.com/s/photos/mountain'

const browser = await puppeteer.launch({ headless: true })
const page = await browser.newPage()

// A predictable viewport matters: IntersectionObserver fires relative to it.
await page.setViewport({ width: 1280, height: 900 })
await page.goto(url, { waitUntil: 'domcontentloaded' })

// Scroll top-to-bottom in steps so each image enters the viewport and its
// IntersectionObserver swaps data-src into src. One jump to the bottom is not
// enough: images in the skipped middle may not intersect, so they may not load.
await page.evaluate(async () => {
  const step = window.innerHeight
  const delay = ms => new Promise(r => setTimeout(r, ms))
  for (let y = 0; y < document.body.scrollHeight; y += step) {
    window.scrollTo(0, y)
    await delay(400)
  }
  window.scrollTo(0, document.body.scrollHeight)
})

// Let the image requests triggered by the final scroll finish.
await page.waitForNetworkIdle({ idleTime: 1000, timeout: 15000 }).catch(() => {})

const images = await page.evaluate(() => {
  const placeholder = /^data:|1x1|blank\.|spacer\.|placeholder/i
  const fromSrcset = srcset =>
    srcset
      .split(',')
      .map(part => part.trim().split(/\s+/)[0])
      .filter(Boolean)
      .pop() || null

  return [...document.querySelectorAll('img')]
    .map(img => {
      const candidate =
        (img.currentSrc && !placeholder.test(img.currentSrc) && img.currentSrc) ||
        img.getAttribute('data-src') ||
        img.getAttribute('data-original') ||
        (img.getAttribute('srcset') && fromSrcset(img.getAttribute('srcset'))) ||
        (img.src && !placeholder.test(img.src) && img.src) ||
        null
      // Resolve relative URLs (data-src is often a root-relative path).
      return candidate ? new URL(candidate, document.baseURI).href : null
    })
    .filter(src => src && !placeholder.test(src))
})

console.log(`Found ${images.length} images`)
console.log([...new Set(images)].join('\n'))

await browser.close()
bash
npm install puppeteer
node scrape-lazy-images.mjs

What each step does

Set predictable viewport dimensions. IntersectionObserver measures against the viewport, so a predictable viewport makes its behavior reproducible. A misconfigured viewport, including an accidental zero-height viewport, can prevent expected intersections. page.setViewport({ width: 1280, height: 900 }) gives the observer a box to test against.

Scroll in steps, not one jump. The loop advances by window.innerHeight and pauses 400ms each time. A single scrollTo(0, scrollHeight) skips the middle of the page, so images between the first and last screen may not intersect the viewport and may stay unloaded. Stepping moves the viewport through the page in order.

Wait for network idle, not a fixed sleep. waitForNetworkIdle({ idleTime: 1000 }) resolves once no request has fired for one second, which is when the images triggered by the last scroll have actually arrived. The .catch(() => {}) swallows the timeout on pages that keep a long-poll or analytics socket open, so a chatty page does not crash the script.

Read with a fallback chain. img.currentSrc is what the browser actually chose for the current viewport and device pixel ratio, so it is the first choice. The chain then falls through data-src, data-original, and the last candidate in srcset when the site orders candidates small to large. Wrapping the result in new URL(candidate, document.baseURI) turns root-relative paths like /photo/123.jpg into absolute URLs you can download.

Gotchas

  • One jump to the bottom loads only the last screen of images.

    • Issue: window.scrollTo(0, document.body.scrollHeight) moves the viewport past middle images without stopping on them, so their IntersectionObserver callbacks may not fire and src can stay a placeholder.
    • Fix: scroll in steps of window.innerHeight with a short pause between each, as the loop does, so images along the scroll path pass through the viewport.
  • The page grows as you scroll.

    • Issue: infinite-scroll galleries append more images when you near the bottom, so document.body.scrollHeight measured once is stale and the loop stops before the current end of the page.
    • Fix: re-read scrollHeight each iteration (the loop condition does) and cap total scroll time or image count so an endless feed cannot loop forever.
  • Reading src too early returns the blur placeholder.

    • Issue: a fast page.evaluate right after the scroll catches images mid-fetch, so img.src is still the low-resolution base64 placeholder the loader showed first.
    • Fix: call waitForNetworkIdle before the read, and keep the placeholder regex so any leftover data: URL is dropped rather than collected.
  • currentSrc is empty for off-screen images.

    • Issue: an image that has not scrolled into view has an empty currentSrc, so reading only currentSrc silently drops it from the results.
    • Fix: fall back to data-src, data-original, and srcset so an unloaded image still yields its declared URL.
  • data-src holds a relative path.

    • Issue: many loaders store data-src="/uploads/large/photo.jpg", and saving that string straight to disk gives a broken URL with no host.
    • Fix: resolve every candidate with new URL(candidate, document.baseURI).href so relative and protocol-relative paths become absolute.
  • Background images do not appear in the img list.

    • Issue: some galleries paint photos with CSS background-image on a <div>, so document.querySelectorAll('img') returns nothing for them.
    • Fix: also read computed styles: getComputedStyle(el).backgroundImage and parse the url(...) value for elements with a known gallery class.

Use this when

You need resolved image URLs from <img> elements in a gallery, product grid, or feed that defers loading until scroll, and the candidate URLs live in data-src, srcset, or a swapped src.

Skip this when

The images are already in the static HTML with final src values, in which case a plain fetch plus cheerio is faster and needs no browser; the URLs come from a JSON or XHR response, in which case intercept that request directly; the page sits behind a login, in which case authenticate first and reuse the session; or you only need to download the files rather than list them, in which case pipe each resolved URL to a write stream.

Skip the code, just get the data

Simplescraper turns any website into structured data in seconds.