Simplescraper
Skip to content

How to resume a scrape after a crash

How to resume a scrape after a crash

Updated 2026-06-24 · 6 min read

If you've ever had a long scrape die at hour two and leave you starting over from the first URL, you know exactly how this feels. A run that crawls thousands of pages is likely to get interrupted at some point, by a dropped connection, an out-of-memory kill, or a deploy that restarts the machine, and if the job kept no record of what it finished, all of that progress is gone.

The fix is to remember your progress as you go. The solution is to checkpoint each URL's outcome to a local SQLite file the moment it completes, so when the process dies at URL 4,000 of 10,000 it picks back up at 4,001 instead of the top. It takes about 60 lines of Node.js and one native package, better-sqlite3.

Key terms

  • Checkpoint. Recording each URL's outcome and status to the database the moment it finishes, so a restart knows exactly what is already done.
  • WAL mode. A SQLite journal mode (journal_mode = WAL) that writes changes to a separate append-only log, lowering commit overhead and letting a reader inspect progress while the writer runs.
  • synchronous pragma. Controls how aggressively SQLite flushes to disk; NORMAL favors throughput and survives crashes, FULL adds the durability needed for abrupt power loss.
  • Atomic write. Setting the result and the done status in one statement so they commit together, never leaving a done row without its data.
  • INSERT OR IGNORE. A SQLite insert that skips any row whose primary key already exists, so re-seeding the worklist never resets finished rows back to pending.

Here is what the script does:

  • Open a SQLite checkpoint file with better-sqlite3 in WAL mode, so each completed URL is recorded in the checkpoint table.
  • Seed a scrape_jobs table with every URL on the first run, marking each one pending.
  • Query only the pending and error rows, so a restart picks up exactly the URLs that did not finish.
  • Scrape each URL, then write its result and a done status in one transaction, which either lands whole or not at all.
  • Print a resume summary on every start so you can see how many URLs are left before any work happens.

The complete script

js
// resumable-scrape.mjs
import Database from 'better-sqlite3'

const DB_PATH = 'scrape-checkpoint.db'

// The worklist. On a real run this comes from a sitemap, a CSV, or a seed crawl.
const urls = [
  'https://example.com/products/1',
  'https://example.com/products/2',
  'https://example.com/products/3',
  'https://example.com/products/4',
  'https://example.com/products/5'
]

const db = new Database(DB_PATH)

// WAL mode commits each transaction to an append-only log. NORMAL synchronous
// favors scraper throughput while still recovering from application and OS crashes.
db.pragma('journal_mode = WAL')
db.pragma('synchronous = NORMAL')

db.exec(`
  CREATE TABLE IF NOT EXISTS scrape_jobs (
    url        TEXT PRIMARY KEY,
    status     TEXT NOT NULL DEFAULT 'pending',
    result     TEXT,
    error      TEXT,
    updated_at INTEGER
  )
`)

// Seed the worklist once. INSERT OR IGNORE keeps existing rows and their status
// untouched, so re-running after a crash never resets finished work back to pending.
const seed = db.prepare(
  `INSERT OR IGNORE INTO scrape_jobs (url, status, updated_at) VALUES (?, 'pending', ?)`
)
const seedAll = db.transaction((list) => {
  for (const url of list) seed.run(url, Date.now())
})
seedAll(urls)

// One row, written whole: result plus the new status in a single statement.
const checkpoint = db.prepare(`
  UPDATE scrape_jobs
  SET status = ?, result = ?, error = ?, updated_at = ?
  WHERE url = ?
`)

// Pick up only what did not finish. 'done' rows are skipped on every restart.
const pending = db
  .prepare(`SELECT url FROM scrape_jobs WHERE status IN ('pending', 'error') ORDER BY url`)
  .all()

const total = db.prepare(`SELECT COUNT(*) AS n FROM scrape_jobs`).get().n
console.log(`[resume] ${total - pending.length} of ${total} already done, ${pending.length} to go`)

for (const { url } of pending) {
  try {
    const res = await fetch(url, { headers: { 'User-Agent': 'Mozilla/5.0' } })
    if (!res.ok) throw new Error(`HTTP ${res.status}`)
    const body = await res.text()

    // The scrape itself. Swap this line for your real extraction.
    const result = JSON.stringify({ length: body.length, title: extractTitle(body) })

    // Result and 'done' status land together, after the network work succeeded.
    checkpoint.run('done', result, null, Date.now(), url)
    console.log(`[ok]    ${url}`)
  } catch (err) {
    // Record the failure as 'error' so the next run retries this URL, not the whole list.
    checkpoint.run('error', null, String(err.message), Date.now(), url)
    console.error(`[fail]  ${url}: ${err.message}`)
  }
}

console.log('[resume] worklist drained')
db.close()

function extractTitle(html) {
  const match = html.match(/<title[^>]*>([^<]*)<\/title>/i)
  return match ? match[1].trim() : null
}
bash
npm install better-sqlite3
node resumable-scrape.mjs

What each step does

Open the database in WAL mode. journal_mode = WAL makes SQLite write changes to a separate log, which reduces commit overhead for this checkpoint-heavy workload. The file persists between runs, so it is the entire memory of the job.

Seed the worklist with INSERT OR IGNORE. The first run inserts every URL as pending. Every later run executes the same seed, but OR IGNORE skips any URL already in the table, so a restart never overwrites a done row back to pending. Wrapping the inserts in db.transaction(...) commits all of them as one unit, which is also far faster than 10,000 separate auto-commits.

Select only unfinished rows. The resume query asks for pending and error rows and orders them, so the loop body never sees a URL that already succeeded. Including error means a transient failure (a timeout, a 429) gets retried on the next run rather than abandoned. If you want to stop retrying after N attempts, add an attempts column and filter on it.

Checkpoint after the network call, not before. The UPDATE runs only once fetch resolved and the body parsed without throwing. Because the result string and the done status are set in the same statement, better-sqlite3 commits them together, so any committed done row includes the stored result.

Gotchas

  • Writing the checkpoint before the scrape loses data on a crash.

    • Issue: marking a row done first and scraping second means a crash between the two leaves a done row with no result, and the resume query then skips that URL forever.
    • Fix: do the network work first and call checkpoint.run('done', result, ...) only after it succeeds, so the status and the data commit in one transaction.
  • Default rollback-journal mode is slower for this workload.

    • Issue: rollback journal is crash-recoverable, but it has higher commit overhead and weaker read/write concurrency than WAL on a checkpoint-heavy scrape.
    • Fix: set db.pragma('journal_mode = WAL') once after opening so frequent checkpoints stay cheap and readers can inspect progress while the writer keeps running.
  • Re-seeding resets finished rows back to pending.

    • Issue: using a plain INSERT or an INSERT OR REPLACE to seed the worklist overwrites every existing row, wiping the done status so the next run re-scrapes the entire list.
    • Fix: use INSERT OR IGNORE so seeding only adds URLs that are not already tracked, and leaves existing statuses alone.
  • A SIGKILL leaves rows stuck in an in-progress state if you add one.

    • Issue: if you introduce a doing status set before the fetch, a hard kill leaves those rows as doing forever, and neither the done filter nor the pending filter picks them up on restart.
    • Fix: include doing in the resume query (status IN ('pending', 'error', 'doing')), or skip the in-progress state entirely as the script above does and let pending cover it.
  • WAL files need a clean close or they grow without bound.

    • Issue: killing the process repeatedly without db.close() leaves the -wal sidecar file growing across runs, since it only checkpoints back into the main database on a clean shutdown.
    • Fix: call db.close() at the end of a normal run, and run db.pragma('wal_checkpoint(TRUNCATE)') periodically on very long jobs to fold the log back in.
  • Concurrent workers on one SQLite file serialize on writes.

    • Issue: pointing several scraper processes at the same .db file makes them block each other, because SQLite allows one writer at a time and a busy writer throws SQLITE_BUSY.
    • Fix: keep checkpointing single-process and parallelize inside it with a promise pool, or move the worklist to a Postgres or Redis queue once you need multiple machines.

Use this when

You run a long batch scrape from a single process and want it to survive crashes, restarts, deploys, and Ctrl-C without redoing finished work. A local checkpoint file is enough for thousands to low millions of URLs on one machine.

Skip this when

Multiple machines or processes need to pull from one shared worklist, in which case use a server-backed queue such as pg-boss on Postgres or BullMQ on Redis. Skip it for short scrapes that finish in seconds, where re-running from the top is cheaper than maintaining a checkpoint. Skip it when you only need to avoid re-scraping the same URL rather than resume an interrupted run, where a Bloom filter or dedupe set is the lighter tool.

Skip the code, just get the data

Simplescraper turns any website into structured data in seconds.