Simplescraper
Skip to content

How to scrape an API with pagination in JavaScript

How to scrape an API with pagination in JavaScript

Updated 2026-06-25 · 6 min read

If you found the JSON API behind a site and called it once, you have probably noticed the response only carries the first slice of the data: 20 or 50 items, plus a field like next_cursor, total, or has_more that hints there is more behind it. Hardcoding ?page=2, ?page=3 and so on works until the API switches scheme, returns a duplicate page, or runs past the end and you cannot tell where to stop.

The fix is to read the pagination metadata the API already returns and let it drive a loop that requests the next slice until the API says there is no more. The three schemes you will meet are cursor (follow the opaque next_cursor token), offset (advance a numeric offset by the page size), and page-number (increment page until you have the total). Each is a short loop on top of fetch, around 30 lines, no dependencies beyond Node 18+.

Key terms

  • Cursor pagination. The response carries an opaque token (next_cursor, next, after) that you pass back to fetch the following page; the API stops handing you a token when the data is exhausted.
  • Offset pagination. You ask for items starting at a numeric position (offset=40&limit=20) and advance the offset by the page size each request.
  • Page-number pagination. You request page=1, page=2, and so on, and the response tells you the total count or total page count so you know when to stop.
  • Page size. The number of items one request returns, often capped by the API (commonly 20 to 100) and sometimes settable with limit or per_page.

Here is what the script does:

  • Fetch one page with the global fetch built into Node 18+, parse the JSON body, and collect its items.
  • For a cursor API, read next_cursor from each response and pass it back as ?cursor= until the field comes back empty.
  • For an offset API, advance offset by the page size until a page returns fewer items than the page size.
  • For a page-number API, read the reported total and request pages until you have collected that many records.
  • Guard every loop with a hard page cap so a malformed has_more or a missing stop signal cannot loop without end.

The complete script

js
// paginate-api.mjs

/* A small helper so every request shares timeout and JSON handling.
   AbortSignal.timeout is built into Node 18+. */
async function getJSON(url) {
  const res = await fetch(url, {
    headers: { 'Accept': 'application/json' },
    signal: AbortSignal.timeout(15000)
  })
  if (!res.ok) throw new Error(`${res.status} ${res.statusText} for ${url}`)
  return res.json()
}

/* 1. CURSOR pagination.
   Follow the opaque token until the API stops returning one.
   GitHub, Stripe, Twitter-style APIs work this way. */
async function paginateByCursor(baseUrl, pageCap = 50) {
  const all = []
  let cursor = null
  let page = 0

  while (page < pageCap) {
    const url = cursor
      ? `${baseUrl}?limit=100&cursor=${encodeURIComponent(cursor)}`
      : `${baseUrl}?limit=100`
    const body = await getJSON(url)

    all.push(...body.items)
    page++

    /* The API hands back a fresh token while there is more data,
       and omits it (null / undefined / '') on the final page. */
    cursor = body.next_cursor
    if (!cursor) break
  }

  return all
}

/* 2. OFFSET pagination.
   Advance a numeric offset by the page size. Stop on a short page:
   a page smaller than the size we asked for is the last one. */
async function paginateByOffset(baseUrl, pageSize = 50, pageCap = 200) {
  const all = []
  let offset = 0

  for (let page = 0; page < pageCap; page++) {
    const url = `${baseUrl}?limit=${pageSize}&offset=${offset}`
    const body = await getJSON(url)

    all.push(...body.items)

    /* A full page might still have more behind it, so keep going.
       A short or empty page means we have reached the end. */
    if (body.items.length < pageSize) break
    offset += pageSize
  }

  return all
}

/* 3. PAGE-NUMBER pagination.
   Read the reported total once, then request pages until we have it.
   WordPress, many REST APIs, and search endpoints work this way. */
async function paginateByPageNumber(baseUrl, perPage = 50, pageCap = 200) {
  const all = []
  const first = await getJSON(`${baseUrl}?per_page=${perPage}&page=1`)
  all.push(...first.results)

  /* total_count is the number of records across all pages.
     Math.ceil gives the page count; we already have page 1. */
  const totalPages = Math.min(
    Math.ceil(first.total_count / perPage),
    pageCap
  )

  for (let page = 2; page <= totalPages; page++) {
    const body = await getJSON(`${baseUrl}?per_page=${perPage}&page=${page}`)
    all.push(...body.results)
  }

  return all
}

/* Swap in the API you are paging and call the matching function. */
const records = await paginateByCursor('https://api.example.com/v1/orders')
console.log(`Collected ${records.length} records`)
bash
node paginate-api.mjs

What each step does

Share one request helper. getJSON sets Accept: application/json, wraps every call in a 15-second AbortSignal.timeout, and throws on any non-2xx status so a 429 or 500 surfaces instead of being parsed as JSON. Reuse it across all three loops so the timeout and error handling stay identical.

Cursor: follow the token, stop when it goes empty. Read body.next_cursor after each request and pass it back as ?cursor=, URL-encoding it because cursor tokens are opaque and often contain characters like = or +. The API supplies a token while more data exists and omits it on the last page, so if (!cursor) break is the natural stop. This is the scheme to prefer when offered, because the token stays valid even as new records arrive during the scrape.

Offset: advance by the page size, stop on a short page. Request ?limit=50&offset=0, then offset=50, offset=100, and so on. A page that returns fewer items than pageSize is the last one, which if (body.items.length < pageSize) break catches without needing a total. Offset pagination can skip or repeat records if rows are inserted mid-scrape, so it suits stable datasets better than fast-changing feeds.

Page-number: read the total, compute the page count. The first request returns total_count, and Math.ceil(total_count / perPage) gives how many pages cover it. The loop starts at page 2 because page 1 is already collected. If the API does not return a total, fall back to the offset stop condition: keep requesting until a page comes back shorter than perPage.

Cap every loop. pageCap bounds each function regardless of what the API reports. A next_cursor that points at itself, a total_count that disagrees with reality, or a has_more that never flips would otherwise spin forever; the cap turns that into a bounded request count you can reason about.

Gotchas

  • The loop never stops because the cursor stops changing.

    • Issue: Some APIs return the same next_cursor on the final page instead of dropping it, so if (!cursor) break never fires and the loop reissues the same request until the page cap.
    • Fix: track the previous cursor and break when it repeats: if (cursor === prevCursor) break before assigning the new one.
  • Offset paging silently skips or duplicates rows.

    • Issue: When records are inserted or deleted while you page by offset, the window shifts under you, so row 50 on page 2 may be a row you already saw on page 1 or one you skip entirely.
    • Fix: prefer cursor pagination for live data; if only offset is offered, sort by a stable immutable key (?order_by=id&offset=...) so the window does not move.
  • The field names do not match your code.

    • Issue: The script reads body.items, body.next_cursor, body.results, and body.total_count, but APIs name these data, next, meta.cursor, total, or nest them under pagination, and a wrong path returns undefined so the loop collects nothing.
    • Fix: log the first raw response and read its real shape before wiring the loop: console.log(JSON.stringify(body, null, 2).slice(0, 2000)).
  • You get rate limited partway through.

    • Issue: Firing pages back to back trips the API's per-minute limit, and getJSON throws on the resulting 429, losing every page collected so far.
    • Fix: add a short await new Promise(r => setTimeout(r, 250)) between requests, and on a 429 read the Retry-After header and wait that many seconds before retrying the same page. See How to rate-limit requests with backoff in JavaScript.
  • The page cap hides real data instead of just bounding runaway loops.

    • Issue: A dataset with 60,000 records at 100 per page needs 600 requests, well past the default pageCap of 50 or 200, so the script stops early and you assume you have everything.
    • Fix: size pageCap to the dataset (Math.ceil(expectedTotal / pageSize) + a margin) rather than leaving the default, and log when a loop exits on the cap rather than on the API's stop signal.
  • Cursor tokens break when passed in the URL unencoded.

    • Issue: A token containing +, /, or = gets mangled by the query string, so the API rejects it or returns the wrong page.
    • Fix: always wrap it in encodeURIComponent(cursor) as the script does; for very long tokens that some servers reject in a query string, send the cursor in a request body with a POST instead.

Use this when

You have located a JSON API (often by watching the Network tab) that returns data in pages and you want every record, whether it pages by cursor, offset, or page number. This is the right tool for REST and JSON endpoints that hand back structured data plus pagination metadata.

Skip this when

The data is rendered into HTML rather than served as JSON (parse the HTML with cheerio instead); the "next page" is a button that fires more requests without changing the URL (drive it with Puppeteer, see the infinite-scroll guide); the API exposes a bulk export or a single ?limit=all endpoint (use that and skip paging); or each page needs an auth token that expires mid-scrape (refresh the token inside the loop before it lapses).

Skip the code, just get the data

Simplescraper turns any website into structured data in seconds.