Simplescraper
Skip to content

How to retry failed scrapes with exponential backoff

How to retry failed scrapes with exponential backoff

Updated 2026-06-24 · 5 min read

If you've watched a scrape die on a single transient hiccup, you've probably lost a whole job to a momentary 429, a 503 during a deploy, or one dropped connection that would have worked on a second try. A flat retry loop on a fixed delay does not help much: three tries one second apart hit a struggling server inside three seconds and push it further down, and a fleet of workers all looping on the same delay arrives in synchronized waves that look like an attack. This is a common failure mode at scale, and a spaced-out retry handles it.

The solution is to wrap the fetch in exponential backoff with jitter and retry only the errors worth retrying, so a momentarily overloaded server gets widening room to recover while a permanent 404 or 403 fails fast instead of burning the budget. That takes about 40 lines of Node.js with one open-source library, p-retry.

Key terms

  • Exponential backoff. Spacing each retry on a widening curve (1s, 2s, 4s, 8s) so every attempt gives the server meaningfully more room than the last.
  • Jitter. Randomizing each backoff delay by a fraction so parallel workers spread their retries across the window instead of arriving in lockstep.
  • Transient failure. A momentary error like a 429, a 5xx, or a dropped connection that a retry can plausibly succeed past, as opposed to a permanent 404 or 403.
  • AbortError. A p-retry signal you throw to stop the loop immediately, used to short-circuit retries on a permanent failure.
  • Idempotent. A request you can safely repeat without duplicating a side effect, which is why retries are safe on reads but risky on writes.

Here is what the script does:

  • Fetch a page and treat HTTP 429 and 5xx responses as retryable failures by throwing on them.
  • Retry with p-retry, which spaces attempts on an exponential curve (1s, 2s, 4s, 8s) so a momentarily overloaded server gets time to recover.
  • Add jitter so a hundred parallel scrapers do not all retry on the same tick and hammer the server in lockstep.
  • Filter the errors: retry timeouts and 5xx, give up immediately on a 404 or a 403, because retrying those just wastes time.

The complete script

js
// retry-scrape.mjs
import pRetry, { AbortError } from 'p-retry'

const url = 'https://httpbin.org/status/200'

// HTTP statuses that are worth retrying. A 429 or a 5xx is the server
// asking us to back off; a 404 or 403 will never become a 200 on retry.
const RETRYABLE_STATUS = new Set([408, 425, 429, 500, 502, 503, 504])

async function scrapeOnce(url) {
  const res = await fetch(url, {
    headers: { 'User-Agent': 'Mozilla/5.0' },
    // Cap each attempt so a hung connection fails fast instead of
    // burning the whole retry budget on one stalled request.
    signal: AbortSignal.timeout(15_000)
  })

  if (!res.ok) {
    // AbortError tells p-retry to stop immediately, no more attempts.
    if (!RETRYABLE_STATUS.has(res.status)) {
      throw new AbortError(`Permanent HTTP ${res.status} for ${url}`)
    }
    throw new Error(`Retryable HTTP ${res.status} for ${url}`)
  }

  return res.text()
}

const html = await pRetry(() => scrapeOnce(url), {
  retries: 4,            // up to 4 retries after the first attempt
  factor: 2,            // double the wait each time: 1s, 2s, 4s, 8s
  minTimeout: 1_000,    // first backoff is 1 second
  maxTimeout: 30_000,   // never wait more than 30 seconds between tries
  randomize: true,      // jitter the delays so parallel workers desync
  onFailedAttempt: ({ error, attemptNumber, retriesLeft }) => {
    // p-retry passes a context object: the thrown error plus attempt metadata.
    console.log(`attempt ${attemptNumber} failed: ${error.message} (${retriesLeft} left)`)
  }
})

console.log(`Got ${html.length} bytes`)
bash
npm install p-retry
node retry-scrape.mjs

What each step does

Define what is retryable. RETRYABLE_STATUS is a Set of the status codes where a retry can plausibly succeed: request timeout, too-early, rate-limited, and the 5xx server errors. Everything else is treated as final. Using a lookup set rather than a range check keeps the policy in one place and easy to audit.

Throw to signal a retry. p-retry decides whether to retry by whether your function rejects. A resolved promise means success and stops the loop, so scrapeOnce must throw on a bad status rather than returning the failed response. A fetch that gets a 503 resolves normally with res.ok === false, so you have to inspect the status yourself.

Abort on permanent failures. Throwing a plain Error tells p-retry to retry. Throwing an AbortError tells it to stop now and reject with that error. A 404 or 403 goes through the AbortError branch so the loop exits on the first attempt instead of waiting out the full backoff schedule.

Tune the backoff curve. factor: 2 with minTimeout: 1000 produces 1s, 2s, 4s, 8s. maxTimeout: 30000 clamps the upper end so a high retry count cannot schedule a multi-minute sleep. randomize: true applies the jitter. retries: 4 means five total attempts, the first try plus four retries.

Watch each failure. onFailedAttempt fires after every failed try with a context object carrying the thrown error, the attemptNumber, and retriesLeft, which is where logging belongs. Throwing inside this callback aborts the whole retry, so keep it to observation.

Gotchas

  • A successful fetch with a 500 body is treated as success.

    • Issue: fetch only rejects on a network-level failure, so await fetch(url) resolves normally for a 500 or a 429 and p-retry never sees a reason to retry.
    • Fix: check res.ok (or res.status) inside the retried function and throw on the statuses you want retried, as scrapeOnce does.
  • Retrying a 404 or 403 wastes the whole backoff budget.

    • Issue: treating every non-2xx the same way means a permanent 404 gets retried four times, adding the full 1s + 2s + 4s + 8s of sleep before it fails anyway.
    • Fix: throw an AbortError for non-retryable statuses so p-retry stops on the first attempt instead of running the schedule.
  • No jitter means parallel workers retry in lockstep.

    • Issue: with randomize off, a hundred workers that all failed at the same moment back off by the identical 1s, 2s, 4s and arrive in synchronized bursts that keep the server pinned.
    • Fix: set randomize: true so each worker's delay is spread across the backoff window and the load smooths out.
  • A 429 with a Retry-After header is ignored.

    • Issue: the exponential curve picks its own delays, so when a server explicitly says Retry-After: 30, backing off 1s and retrying immediately earns another 429.
    • Fix: read res.headers.get('retry-after') in the failed branch and sleep that long before throwing, or short-circuit to a longer minTimeout for rate-limit responses.
  • One slow request eats the entire retry budget.

    • Issue: without a per-attempt timeout, a single hung connection can hang for the platform default (often 300 seconds), so the retries never even fire.
    • Fix: pass signal: AbortSignal.timeout(15_000) to each fetch so a stalled attempt aborts and becomes a retryable failure quickly.
  • Non-idempotent writes get duplicated.

    • Issue: wrapping a POST that submits a form or records a job in retry logic can run the side effect more than once if the first call succeeded on the server but the response was lost.
    • Fix: only retry idempotent reads, or make the write idempotent with a request key the server deduplicates on before adding retries.

Use this when

You are scraping at scale and want a single page fetch to survive a transient hiccup, a momentary 429, a 503 during a deploy, or a dropped connection, without failing the whole job or hand-rolling a retry loop per call site.

Skip this when

The failure is permanent and retrying cannot fix it, such as a 404 or a 403 (fix the URL or the auth); you need to cap total parallel load rather than handle one call's failures (use a rate limiter like p-throttle); a host keeps failing and you want to stop sending it traffic entirely (use a circuit breaker like cockatiel or opossum); or the work must persist across a process restart (use a job queue like BullMQ).

Skip the code, just get the data

Simplescraper turns any website into structured data in seconds.