Simplescraper
Skip to content

How to scrape paginated results in JavaScript

How to scrape paginated results in JavaScript

Updated 2026-06-25 · 5 min read

If the listing you're scraping spreads its results across numbered pages, fetching page one gives you the first 10 or 20 rows and nothing else. The rest sit behind ?page=2, ?page=3, and so on, and you do not know up front how many pages there are, so you cannot just hardcode a range and call it done.

The fix is a loop that increments the page number, fetches each page, collects its rows, and stops the first time a page comes back with zero rows. That empty page is the end of the data, and it is the stop condition that lets the loop discover the page count instead of guessing it. It takes about 35 lines of Node.js with one open-source library plus the built-in fetch.

Here is what the script does:

  • Build each page URL from a template by substituting the current page number, so the loop walks page/1/, page/2/, and onward.
  • Fetch the page HTML with the built-in fetch and a stock desktop browser User-Agent.
  • Parse the HTML with cheerio and pull the rows out with a CSS selector.
  • Stop the loop when a page returns no rows, and keep a page cap so a misbehaving site cannot spin the loop without end.

The complete script

js
// scrape-paginated.mjs
import * as cheerio from 'cheerio'

// {page} is replaced with the page number on each pass.
// Swap to 'https://example.com/products?page={page}' for query-string pagination.
const urlTemplate = 'https://quotes.toscrape.com/page/{page}/'
const rowSelector = '.quote .text'

// A safety cap so a site that never returns an empty page cannot loop without end.
const maxPages = 50

const results = []

for (let page = 1; page <= maxPages; page++) {
  const url = urlTemplate.replace('{page}', String(page))

  const res = await fetch(url, {
    headers: { 'User-Agent': 'Mozilla/5.0' }
  })

  // A 404 past the last page is also an end-of-data signal on some sites.
  if (res.status === 404) break
  if (!res.ok) throw new Error(`page ${page} returned HTTP ${res.status}`)

  const $ = cheerio.load(await res.text())
  const rows = $(rowSelector).map((_, el) => $(el).text().trim()).get()

  // The stop condition: an empty page means we have walked past the last page.
  if (rows.length === 0) break

  results.push(...rows)
  console.log(`page ${page}: ${rows.length} rows (running total ${results.length})`)
}

console.log(`done: ${results.length} rows across the paginated listing`)
bash
npm install cheerio
node scrape-paginated.mjs

What each step does

Build the URL from a template. The page number is the only part of the URL that changes between requests, so the template carries a {page} placeholder and the loop substitutes the current number with replace. The example target paginates with a path segment (/page/2/); a site that paginates with a query string uses ?page={page} instead, and nothing else in the loop changes.

Fetch with a stock desktop User-Agent. A bare fetch() from Node sends node as its User-Agent, which some servers reject. A normal Mozilla/5.0 string gets the full page back from most sites. This is politeness, not stealth, and a site that blocks bots in earnest blocks harder than a header string.

Parse and select the rows. cheerio.load() parses the HTML into a jQuery-style document, and the CSS selector pulls out the elements you want. The .map().get() pair turns the matched nodes into a plain array of strings, one entry per row on that page.

Stop on an empty page. Once a fetched page yields zero rows, the listing is exhausted and the loop breaks. The maxPages cap is the backstop for sites that wrap around to page one or echo the last page instead of returning an empty one, so the loop is bounded even when the empty-page signal never arrives.

Gotchas

  • The loop never ends because the last page repeats instead of emptying.

    • Issue: Some sites clamp an out-of-range page to the last valid one, so ?page=999 returns the same rows as the final page and the empty-page check never fires.
    • Fix: keep the maxPages cap, and additionally compare each page's rows to the previous page's. Break when JSON.stringify(rows) === JSON.stringify(prevRows), which catches a page that echoes the one before it.
  • The first request you can detect the last page from is the Next button, not an empty page.

    • Issue: Fetching one extra page past the end just to see zero rows wastes a request on large listings.
    • Fix: read the pager instead. On a site with a li.next > a link, stop as soon as that link is absent: if ($('li.next a').length === 0) break after collecting the current page's rows.
  • Rows load over an XHR call, so the fetched HTML has none of them.

    • Issue: fetch only sees the server's initial HTML, so a listing that renders rows client-side with React or Vue hands back an empty shell and the loop stops at page one.
    • Fix: find the JSON endpoint the page calls in the Network tab and paginate that directly, or render each page with Puppeteer first. See How to scrape a JavaScript-rendered page in Node.js.
  • Firing all the page requests at once gets you rate limited.

    • Issue: Swapping the sequential loop for Promise.all over a page range sends every request in the same instant, which many servers answer with 429 Too Many Requests.
    • Fix: keep the requests sequential as written, or cap concurrency and add backoff. See How to rate-limit requests with backoff in JavaScript.
  • The same record shows up on two pages and lands in the results twice.

    • Issue: When the listing is sorted by a value that changes while you scrape (recently updated, price), a row can shift from page two to page three between fetches and get collected on both.
    • Fix: dedupe by a stable key after the loop rather than trusting page boundaries. See How to deduplicate scraped records in JavaScript.
  • Page numbers start at 0, not 1.

    • Issue: A few APIs and listings treat ?page=0 as the first page, so starting the loop at 1 silently skips the first batch of rows.
    • Fix: check the site's first page in a browser. If ?page=0 is the first page, start the loop at page = 0 and adjust the cap.

Use this when

The listing paginates with a page number in the URL (?page=2 or /page/2/) and the rows are present in the server-rendered HTML. This covers most server-rendered catalogs, search result pages, forum indexes, and blog archives.

Skip this when

The page loads more rows on scroll with no page number (use an infinite scroll loop); the next batch appears behind a button click (use a load-more loop); the rows are rendered client-side from an XHR call (paginate the JSON endpoint or render with Puppeteer); the pager uses a cursor token rather than a page number (carry the cursor from each response into the next request).

Skip the code, just get the data

Simplescraper turns any website into structured data in seconds.