How to scrape lazy-loaded images

Updated 2026-06-24 · 6 min read

If a page reads img.src back to you as the same 1x1 GIF or base64 blur for each <img> in a gallery, you're probably scraping a page that loads its images lazily. The candidate URLs are sitting in data-src or srcset, and the page's own JavaScript only copies them into src when an image scrolls close to the viewport. Nothing below the fold has scrolled there yet, so those images have not loaded, and a plain read gives you placeholders. This is how most modern image grids and feeds behave.

The solution is to reproduce the scroll behavior and move through the page in steps. We'll build a small script that drives a headless browser down the page in fixed steps so each image nears the viewport and the page's own loader swaps its resolved URL into src, waits for the network to settle so the scroll-triggered image requests finish before we read, then reads every <img> with a fallback for any image whose loader has not run yet and drops the placeholder URLs that lazy-loaders leave behind, so we end with a clean list of resolved image URLs. It comes to about 55 lines of Node.js with one dependency, Puppeteer.

The complete script

// scrape-lazy-images.mjs
import puppeteer from 'puppeteer'

const url = 'https://unsplash.com/s/photos/mountain'

const browser = await puppeteer.launch({ headless: true })
const page = await browser.newPage()

// a predictable viewport matters: IntersectionObserver fires relative to it.
await page.setViewport({ width: 1280, height: 900 })
await page.goto(url, { waitUntil: 'domcontentloaded' })

// scroll top-to-bottom in steps so each image enters the viewport and its
// IntersectionObserver swaps data-src into src. one jump to the bottom is not
// enough: images in the skipped middle may not intersect, so they may not load.
await page.evaluate(async () => {
  const step = window.innerHeight
  const delay = ms => new Promise(r => setTimeout(r, ms))
  for (let y = 0; y < document.body.scrollHeight; y += step) {
    window.scrollTo(0, y)
    await delay(400)
  }
  window.scrollTo(0, document.body.scrollHeight)
})

// let the image requests triggered by the final scroll finish.
await page.waitForNetworkIdle({ idleTime: 1000, timeout: 15000 }).catch(() => {})

const images = await page.evaluate(() => {
  const placeholder = /^data:|1x1|blank\.|spacer\.|placeholder/i
  const fromSrcset = srcset =>
    srcset
      .split(',')
      .map(part => part.trim().split(/\s+/)[0])
      .filter(Boolean)
      .pop() || null

  return [...document.querySelectorAll('img')]
    .map(img => {
      const candidate =
        (img.currentSrc && !placeholder.test(img.currentSrc) && img.currentSrc) ||
        img.getAttribute('data-src') ||
        img.getAttribute('data-original') ||
        (img.getAttribute('srcset') && fromSrcset(img.getAttribute('srcset'))) ||
        (img.src && !placeholder.test(img.src) && img.src) ||
        null
      // resolve relative URLs (data-src is often a root-relative path).
      return candidate ? new URL(candidate, document.baseURI).href : null
    })
    .filter(src => src && !placeholder.test(src))
})

console.log(`Found ${images.length} images`)
console.log([...new Set(images)].join('\n'))

await browser.close()

bash

npm install puppeteer
node scrape-lazy-images.mjs

How it works

Set predictable viewport dimensions. IntersectionObserver measures against the viewport, so a predictable viewport makes its behavior reproducible. A misconfigured viewport, including an accidental zero-height viewport, can prevent expected intersections. page.setViewport({ width: 1280, height: 900 }) gives the observer a box to test against.

Scroll in steps, not one jump. The loop advances by window.innerHeight and pauses 400ms each time. A single scrollTo(0, scrollHeight) skips the middle of the page, so images between the first and last screen may not intersect the viewport and may stay unloaded. Stepping moves the viewport through the page in order. The loop re-reads scrollHeight each iteration, which matters on infinite-scroll galleries that append more images as you near the bottom, so a height measured once would be stale; cap total scroll time or image count so an endless feed cannot loop forever.

Wait for network idle, not a fixed sleep. waitForNetworkIdle({ idleTime: 1000 }) resolves once no request has fired for one second, which is when the images triggered by the last scroll have actually arrived. Reading too early catches images mid-fetch and returns the blur placeholder, which is why the read waits for idle first. The .catch(() => {}) swallows the timeout on pages that keep a long-poll or analytics socket open, so a chatty page does not crash the script.

Read with a fallback chain. img.currentSrc is what the browser actually chose for the current viewport and device pixel ratio, so it is the first choice, but it is empty for off-screen images, which is why the chain falls through data-src, data-original, and the last candidate in srcset when the site orders candidates small to large, so an unloaded image still yields its declared URL. Wrapping the result in new URL(candidate, document.baseURI) turns root-relative paths like /photo/123.jpg into absolute URLs you can download. Galleries that paint photos with CSS background-image on a <div> never appear in the img list at all, so read getComputedStyle(el).backgroundImage and parse its url(...) value for those elements.

Use this when

You need resolved image URLs from <img> elements in a gallery, product grid, or feed that defers loading until scroll, and the candidate URLs live in data-src, srcset, or a swapped src.

Skip this when

The images are already in the static HTML with final src values, in which case a plain fetch plus cheerio is faster and needs no browser; the URLs come from a JSON or XHR response, in which case intercept that request directly; the page sits behind a login, in which case authenticate first and reuse the session; or you only need to download the files rather than list them, in which case pipe each resolved URL to a write stream.

How to scrape lazy-loaded images ​

The complete script ​

How it works ​

Related guides ​

Skip the code, just get the data Simplescraper turns any website into structured data in seconds.

How to scrape lazy-loaded images

The complete script

How it works

Related guides

Skip the code, just get the data
Simplescraper turns any website into structured data in seconds.