How to scrape lazy-loaded images
If a page reads img.src back to you as the same 1x1 GIF or base64 blur for each <img> in a gallery, you're probably scraping a page that loads its images lazily. The candidate URLs are sitting in data-src or srcset, and the page's own JavaScript only copies them into src when an image scrolls close to the viewport. Nothing below the fold has scrolled there yet, so those images have not loaded, and a plain read gives you placeholders. This is how most modern image grids and feeds behave.
The solution is to reproduce the scroll behavior and move through the page in steps, so each image's IntersectionObserver fires and the loader swaps the resolved URL into src before you read it, with a data-src fallback for any image whose observer has not run yet. You wait for the network to settle after the last scroll, then read every <img> and filter out the leftover placeholders, so you end with a clean list of resolved image URLs from image elements. It comes to about 55 lines of Node.js with one dependency, Puppeteer.
Key terms
- Lazy loading. Deferring an image's download until it nears the viewport, which is why an off-screen
<img>holds a placeholder instead of its resolved URL. IntersectionObserver. A browser API that fires a callback when an element crosses the viewport boundary, the trigger most lazy-loaders use to swap the candidate URL intosrc.data-src. A non-rendering attribute where loaders park the image URL until the observer copies it intosrc, read here as the fallback for images that have not scrolled into view.srcsetandcurrentSrc.srcsetlists candidate URLs at different resolutions, andcurrentSrcis the one the browser actually chose for the current viewport and pixel ratio.- Network idle. The condition, checked with
waitForNetworkIdle, where no request has fired for a set interval, signalling that scroll-triggered image fetches have finished.
Here is what the script does:
- Launch headless Chromium with Puppeteer and navigate to a page whose gallery loads images on scroll.
- Scroll the page in fixed steps from top to bottom, pausing between steps so each image's IntersectionObserver callback runs and the browser starts fetching the resolved file.
- Wait for the network to go quiet after the last scroll, so in-flight image requests finish before the read.
- Read every
<img>in the page, preferringcurrentSrcand falling back todata-src,data-original, or the last candidate insrcsetwhen the site orders candidates small to large. - Filter out the 1x1 and base64 placeholder URLs that lazy-loaders leave behind, then print the clean list of absolute image URLs.
The complete script
// scrape-lazy-images.mjs
import puppeteer from 'puppeteer'
const url = 'https://unsplash.com/s/photos/mountain'
const browser = await puppeteer.launch({ headless: true })
const page = await browser.newPage()
// A predictable viewport matters: IntersectionObserver fires relative to it.
await page.setViewport({ width: 1280, height: 900 })
await page.goto(url, { waitUntil: 'domcontentloaded' })
// Scroll top-to-bottom in steps so each image enters the viewport and its
// IntersectionObserver swaps data-src into src. One jump to the bottom is not
// enough: images in the skipped middle may not intersect, so they may not load.
await page.evaluate(async () => {
const step = window.innerHeight
const delay = ms => new Promise(r => setTimeout(r, ms))
for (let y = 0; y < document.body.scrollHeight; y += step) {
window.scrollTo(0, y)
await delay(400)
}
window.scrollTo(0, document.body.scrollHeight)
})
// Let the image requests triggered by the final scroll finish.
await page.waitForNetworkIdle({ idleTime: 1000, timeout: 15000 }).catch(() => {})
const images = await page.evaluate(() => {
const placeholder = /^data:|1x1|blank\.|spacer\.|placeholder/i
const fromSrcset = srcset =>
srcset
.split(',')
.map(part => part.trim().split(/\s+/)[0])
.filter(Boolean)
.pop() || null
return [...document.querySelectorAll('img')]
.map(img => {
const candidate =
(img.currentSrc && !placeholder.test(img.currentSrc) && img.currentSrc) ||
img.getAttribute('data-src') ||
img.getAttribute('data-original') ||
(img.getAttribute('srcset') && fromSrcset(img.getAttribute('srcset'))) ||
(img.src && !placeholder.test(img.src) && img.src) ||
null
// Resolve relative URLs (data-src is often a root-relative path).
return candidate ? new URL(candidate, document.baseURI).href : null
})
.filter(src => src && !placeholder.test(src))
})
console.log(`Found ${images.length} images`)
console.log([...new Set(images)].join('\n'))
await browser.close()npm install puppeteer
node scrape-lazy-images.mjsWhat each step does
Set predictable viewport dimensions. IntersectionObserver measures against the viewport, so a predictable viewport makes its behavior reproducible. A misconfigured viewport, including an accidental zero-height viewport, can prevent expected intersections. page.setViewport({ width: 1280, height: 900 }) gives the observer a box to test against.
Scroll in steps, not one jump. The loop advances by window.innerHeight and pauses 400ms each time. A single scrollTo(0, scrollHeight) skips the middle of the page, so images between the first and last screen may not intersect the viewport and may stay unloaded. Stepping moves the viewport through the page in order.
Wait for network idle, not a fixed sleep. waitForNetworkIdle({ idleTime: 1000 }) resolves once no request has fired for one second, which is when the images triggered by the last scroll have actually arrived. The .catch(() => {}) swallows the timeout on pages that keep a long-poll or analytics socket open, so a chatty page does not crash the script.
Read with a fallback chain. img.currentSrc is what the browser actually chose for the current viewport and device pixel ratio, so it is the first choice. The chain then falls through data-src, data-original, and the last candidate in srcset when the site orders candidates small to large. Wrapping the result in new URL(candidate, document.baseURI) turns root-relative paths like /photo/123.jpg into absolute URLs you can download.
Gotchas
One jump to the bottom loads only the last screen of images.
- Issue:
window.scrollTo(0, document.body.scrollHeight)moves the viewport past middle images without stopping on them, so their IntersectionObserver callbacks may not fire andsrccan stay a placeholder. - Fix: scroll in steps of
window.innerHeightwith a short pause between each, as the loop does, so images along the scroll path pass through the viewport.
- Issue:
The page grows as you scroll.
- Issue: infinite-scroll galleries append more images when you near the bottom, so
document.body.scrollHeightmeasured once is stale and the loop stops before the current end of the page. - Fix: re-read
scrollHeighteach iteration (the loop condition does) and cap total scroll time or image count so an endless feed cannot loop forever.
- Issue: infinite-scroll galleries append more images when you near the bottom, so
Reading src too early returns the blur placeholder.
- Issue: a fast
page.evaluateright after the scroll catches images mid-fetch, soimg.srcis still the low-resolution base64 placeholder the loader showed first. - Fix: call
waitForNetworkIdlebefore the read, and keep theplaceholderregex so any leftoverdata:URL is dropped rather than collected.
- Issue: a fast
currentSrc is empty for off-screen images.
- Issue: an image that has not scrolled into view has an empty
currentSrc, so reading onlycurrentSrcsilently drops it from the results. - Fix: fall back to
data-src,data-original, andsrcsetso an unloaded image still yields its declared URL.
- Issue: an image that has not scrolled into view has an empty
data-src holds a relative path.
- Issue: many loaders store
data-src="/uploads/large/photo.jpg", and saving that string straight to disk gives a broken URL with no host. - Fix: resolve every candidate with
new URL(candidate, document.baseURI).hrefso relative and protocol-relative paths become absolute.
- Issue: many loaders store
Background images do not appear in the img list.
- Issue: some galleries paint photos with CSS
background-imageon a<div>, sodocument.querySelectorAll('img')returns nothing for them. - Fix: also read computed styles:
getComputedStyle(el).backgroundImageand parse theurl(...)value for elements with a known gallery class.
- Issue: some galleries paint photos with CSS
Use this when
You need resolved image URLs from <img> elements in a gallery, product grid, or feed that defers loading until scroll, and the candidate URLs live in data-src, srcset, or a swapped src.
Skip this when
The images are already in the static HTML with final src values, in which case a plain fetch plus cheerio is faster and needs no browser; the URLs come from a JSON or XHR response, in which case intercept that request directly; the page sits behind a login, in which case authenticate first and reuse the session; or you only need to download the files rather than list them, in which case pipe each resolved URL to a write stream.