How to resume a scrape after a crash

Updated 2026-06-24 · 6 min read

If you've ever had a long scrape die at hour two and leave you starting over from the first URL, you know exactly how this feels. A run that crawls thousands of pages is likely to get interrupted at some point, by a dropped connection, an out-of-memory kill, or a deploy that restarts the machine, and if the job kept no record of what it finished, all of that progress is gone.

The fix is to checkpoint each URL's outcome to a local SQLite file the moment it completes, so when the process dies at URL 4,000 of 10,000 it picks back up at 4,001 instead of the top. We'll build a small script that records every finished URL and its status to a checkpoint file so a restart knows exactly what is already done, seeds the full worklist once without ever resetting finished rows, queries only the URLs that did not finish so the loop skips completed work, and writes each result and its done status together so a crash never leaves a finished row without its data. It comes out to about 60 lines of Node.js and one native package, better-sqlite3.

The complete script

// resumable-scrape.mjs
import Database from 'better-sqlite3'

const DB_PATH = 'scrape-checkpoint.db'

// the worklist. on a real run this comes from a sitemap, a CSV, or a seed crawl.
const urls = [
  'https://example.com/products/1',
  'https://example.com/products/2',
  'https://example.com/products/3',
  'https://example.com/products/4',
  'https://example.com/products/5'
]

const db = new Database(DB_PATH)

// WAL mode commits each transaction to an append-only log. NORMAL synchronous
// favors scraper throughput while still recovering from application and OS crashes.
db.pragma('journal_mode = WAL')
db.pragma('synchronous = NORMAL')

db.exec(`
  CREATE TABLE IF NOT EXISTS scrape_jobs (
    url        TEXT PRIMARY KEY,
    status     TEXT NOT NULL DEFAULT 'pending',
    result     TEXT,
    error      TEXT,
    updated_at INTEGER
  )
`)

// seed the worklist once. INSERT OR IGNORE keeps existing rows and their status
// untouched, so re-running after a crash never resets finished work back to pending.
const seed = db.prepare(
  `INSERT OR IGNORE INTO scrape_jobs (url, status, updated_at) VALUES (?, 'pending', ?)`
)
const seedAll = db.transaction((list) => {
  for (const url of list) seed.run(url, Date.now())
})
seedAll(urls)

// one row, written whole: result plus the new status in a single statement.
const checkpoint = db.prepare(`
  UPDATE scrape_jobs
  SET status = ?, result = ?, error = ?, updated_at = ?
  WHERE url = ?
`)

// pick up only what did not finish. 'done' rows are skipped on every restart.
const pending = db
  .prepare(`SELECT url FROM scrape_jobs WHERE status IN ('pending', 'error') ORDER BY url`)
  .all()

const total = db.prepare(`SELECT COUNT(*) AS n FROM scrape_jobs`).get().n
console.log(`[resume] ${total - pending.length} of ${total} already done, ${pending.length} to go`)

for (const { url } of pending) {
  try {
    const res = await fetch(url, { headers: { 'User-Agent': 'Mozilla/5.0' } })
    if (!res.ok) throw new Error(`HTTP ${res.status}`)
    const body = await res.text()

    // the scrape itself. swap this line for your real extraction.
    const result = JSON.stringify({ length: body.length, title: extractTitle(body) })

    // result and 'done' status land together, after the network work succeeded.
    checkpoint.run('done', result, null, Date.now(), url)
    console.log(`[ok]    ${url}`)
  } catch (err) {
    // record the failure as 'error' so the next run retries this URL, not the whole list.
    checkpoint.run('error', null, String(err.message), Date.now(), url)
    console.error(`[fail]  ${url}: ${err.message}`)
  }
}

console.log('[resume] worklist drained')
db.close()

function extractTitle(html) {
  const match = html.match(/<title[^>]*>([^<]*)<\/title>/i)
  return match ? match[1].trim() : null
}

bash

npm install better-sqlite3
node resumable-scrape.mjs

How it works

Open the database in WAL mode. journal_mode = WAL makes SQLite write changes to a separate log, which reduces commit overhead for this checkpoint-heavy workload. The file persists between runs, so it is the entire memory of the job. The -wal sidecar only folds back into the main database on a clean shutdown, so call db.close() at the end of a normal run, or it grows without bound across repeated kills; on very long jobs run db.pragma('wal_checkpoint(TRUNCATE)') periodically to fold the log back in.

Seed the worklist with INSERT OR IGNORE. The first run inserts every URL as pending. Every later run executes the same seed, but OR IGNORE skips any URL already in the table, so a restart never overwrites a done row back to pending. A plain INSERT or INSERT OR REPLACE here would wipe finished rows and re-scrape the whole list, which is the trap OR IGNORE avoids. Wrapping the inserts in db.transaction(...) commits all of them as one unit, which is also far faster than 10,000 separate auto-commits.

Select only unfinished rows. The resume query asks for pending and error rows and orders them, so the loop body never sees a URL that already succeeded. Including error means a transient failure (a timeout, a 429) gets retried on the next run rather than abandoned. If you want to stop retrying after N attempts, add an attempts column and filter on it. If you ever add a doing status set before the fetch, a hard kill leaves those rows stuck as doing and neither filter picks them up, so include it in the query (status IN ('pending', 'error', 'doing')); the script above skips the in-progress state entirely and lets pending cover it.

Checkpoint after the network call, not before. The UPDATE runs only once fetch resolved and the body parsed without throwing. Because the result string and the done status are set in the same statement, better-sqlite3 commits them together, so any committed done row includes the stored result. Mark the row done before the scrape instead and a crash between the two leaves a done row with no result that the resume query then skips forever. This stays single-process: SQLite allows one writer at a time and a second scraper on the same .db file blocks and throws SQLITE_BUSY, so parallelize inside one process with a promise pool and move to a server-backed queue when you need multiple machines.

Use this when

You run a long batch scrape from a single process and want it to survive crashes, restarts, deploys, and Ctrl-C without redoing finished work. A local checkpoint file is enough for thousands to low millions of URLs on one machine.

Skip this when

Multiple machines or processes need to pull from one shared worklist, in which case use a server-backed queue such as pg-boss on Postgres or BullMQ on Redis. Skip it for short scrapes that finish in seconds, where re-running from the top is cheaper than maintaining a checkpoint. Skip it when you only need to avoid re-scraping the same URL rather than resume an interrupted run, where a Bloom filter or dedupe set is the lighter tool.

How to resume a scrape after a crash ​

The complete script ​

How it works ​

Related guides ​

Skip the code, just get the data Simplescraper turns any website into structured data in seconds.

How to resume a scrape after a crash

The complete script

How it works

Related guides

Skip the code, just get the data
Simplescraper turns any website into structured data in seconds.