How to resume a scrape after a crash
If you've ever had a long scrape die at hour two and leave you starting over from the first URL, you know exactly how this feels. A run that crawls thousands of pages is likely to get interrupted at some point, by a dropped connection, an out-of-memory kill, or a deploy that restarts the machine, and if the job kept no record of what it finished, all of that progress is gone.
The fix is to remember your progress as you go. The solution is to checkpoint each URL's outcome to a local SQLite file the moment it completes, so when the process dies at URL 4,000 of 10,000 it picks back up at 4,001 instead of the top. It takes about 60 lines of Node.js and one native package, better-sqlite3.
Key terms
- Checkpoint. Recording each URL's outcome and status to the database the moment it finishes, so a restart knows exactly what is already done.
- WAL mode. A SQLite journal mode (
journal_mode = WAL) that writes changes to a separate append-only log, lowering commit overhead and letting a reader inspect progress while the writer runs. synchronouspragma. Controls how aggressively SQLite flushes to disk;NORMALfavors throughput and survives crashes,FULLadds the durability needed for abrupt power loss.- Atomic write. Setting the result and the
donestatus in one statement so they commit together, never leaving adonerow without its data. INSERT OR IGNORE. A SQLite insert that skips any row whose primary key already exists, so re-seeding the worklist never resets finished rows back topending.
Here is what the script does:
- Open a SQLite checkpoint file with better-sqlite3 in WAL mode, so each completed URL is recorded in the checkpoint table.
- Seed a
scrape_jobstable with every URL on the first run, marking each onepending. - Query only the
pendinganderrorrows, so a restart picks up exactly the URLs that did not finish. - Scrape each URL, then write its result and a
donestatus in one transaction, which either lands whole or not at all. - Print a resume summary on every start so you can see how many URLs are left before any work happens.
The complete script
// resumable-scrape.mjs
import Database from 'better-sqlite3'
const DB_PATH = 'scrape-checkpoint.db'
// The worklist. On a real run this comes from a sitemap, a CSV, or a seed crawl.
const urls = [
'https://example.com/products/1',
'https://example.com/products/2',
'https://example.com/products/3',
'https://example.com/products/4',
'https://example.com/products/5'
]
const db = new Database(DB_PATH)
// WAL mode commits each transaction to an append-only log. NORMAL synchronous
// favors scraper throughput while still recovering from application and OS crashes.
db.pragma('journal_mode = WAL')
db.pragma('synchronous = NORMAL')
db.exec(`
CREATE TABLE IF NOT EXISTS scrape_jobs (
url TEXT PRIMARY KEY,
status TEXT NOT NULL DEFAULT 'pending',
result TEXT,
error TEXT,
updated_at INTEGER
)
`)
// Seed the worklist once. INSERT OR IGNORE keeps existing rows and their status
// untouched, so re-running after a crash never resets finished work back to pending.
const seed = db.prepare(
`INSERT OR IGNORE INTO scrape_jobs (url, status, updated_at) VALUES (?, 'pending', ?)`
)
const seedAll = db.transaction((list) => {
for (const url of list) seed.run(url, Date.now())
})
seedAll(urls)
// One row, written whole: result plus the new status in a single statement.
const checkpoint = db.prepare(`
UPDATE scrape_jobs
SET status = ?, result = ?, error = ?, updated_at = ?
WHERE url = ?
`)
// Pick up only what did not finish. 'done' rows are skipped on every restart.
const pending = db
.prepare(`SELECT url FROM scrape_jobs WHERE status IN ('pending', 'error') ORDER BY url`)
.all()
const total = db.prepare(`SELECT COUNT(*) AS n FROM scrape_jobs`).get().n
console.log(`[resume] ${total - pending.length} of ${total} already done, ${pending.length} to go`)
for (const { url } of pending) {
try {
const res = await fetch(url, { headers: { 'User-Agent': 'Mozilla/5.0' } })
if (!res.ok) throw new Error(`HTTP ${res.status}`)
const body = await res.text()
// The scrape itself. Swap this line for your real extraction.
const result = JSON.stringify({ length: body.length, title: extractTitle(body) })
// Result and 'done' status land together, after the network work succeeded.
checkpoint.run('done', result, null, Date.now(), url)
console.log(`[ok] ${url}`)
} catch (err) {
// Record the failure as 'error' so the next run retries this URL, not the whole list.
checkpoint.run('error', null, String(err.message), Date.now(), url)
console.error(`[fail] ${url}: ${err.message}`)
}
}
console.log('[resume] worklist drained')
db.close()
function extractTitle(html) {
const match = html.match(/<title[^>]*>([^<]*)<\/title>/i)
return match ? match[1].trim() : null
}npm install better-sqlite3
node resumable-scrape.mjsWhat each step does
Open the database in WAL mode. journal_mode = WAL makes SQLite write changes to a separate log, which reduces commit overhead for this checkpoint-heavy workload. The file persists between runs, so it is the entire memory of the job.
Seed the worklist with INSERT OR IGNORE. The first run inserts every URL as pending. Every later run executes the same seed, but OR IGNORE skips any URL already in the table, so a restart never overwrites a done row back to pending. Wrapping the inserts in db.transaction(...) commits all of them as one unit, which is also far faster than 10,000 separate auto-commits.
Select only unfinished rows. The resume query asks for pending and error rows and orders them, so the loop body never sees a URL that already succeeded. Including error means a transient failure (a timeout, a 429) gets retried on the next run rather than abandoned. If you want to stop retrying after N attempts, add an attempts column and filter on it.
Checkpoint after the network call, not before. The UPDATE runs only once fetch resolved and the body parsed without throwing. Because the result string and the done status are set in the same statement, better-sqlite3 commits them together, so any committed done row includes the stored result.
Gotchas
Writing the checkpoint before the scrape loses data on a crash.
- Issue: marking a row
donefirst and scraping second means a crash between the two leaves adonerow with no result, and the resume query then skips that URL forever. - Fix: do the network work first and call
checkpoint.run('done', result, ...)only after it succeeds, so the status and the data commit in one transaction.
- Issue: marking a row
Default rollback-journal mode is slower for this workload.
- Issue: rollback journal is crash-recoverable, but it has higher commit overhead and weaker read/write concurrency than WAL on a checkpoint-heavy scrape.
- Fix: set
db.pragma('journal_mode = WAL')once after opening so frequent checkpoints stay cheap and readers can inspect progress while the writer keeps running.
Re-seeding resets finished rows back to pending.
- Issue: using a plain
INSERTor anINSERT OR REPLACEto seed the worklist overwrites every existing row, wiping thedonestatus so the next run re-scrapes the entire list. - Fix: use
INSERT OR IGNOREso seeding only adds URLs that are not already tracked, and leaves existing statuses alone.
- Issue: using a plain
A SIGKILL leaves rows stuck in an in-progress state if you add one.
- Issue: if you introduce a
doingstatus set before the fetch, a hard kill leaves those rows asdoingforever, and neither thedonefilter nor thependingfilter picks them up on restart. - Fix: include
doingin the resume query (status IN ('pending', 'error', 'doing')), or skip the in-progress state entirely as the script above does and letpendingcover it.
- Issue: if you introduce a
WAL files need a clean close or they grow without bound.
- Issue: killing the process repeatedly without
db.close()leaves the-walsidecar file growing across runs, since it only checkpoints back into the main database on a clean shutdown. - Fix: call
db.close()at the end of a normal run, and rundb.pragma('wal_checkpoint(TRUNCATE)')periodically on very long jobs to fold the log back in.
- Issue: killing the process repeatedly without
Concurrent workers on one SQLite file serialize on writes.
- Issue: pointing several scraper processes at the same
.dbfile makes them block each other, because SQLite allows one writer at a time and a busy writer throwsSQLITE_BUSY. - Fix: keep checkpointing single-process and parallelize inside it with a promise pool, or move the worklist to a Postgres or Redis queue once you need multiple machines.
- Issue: pointing several scraper processes at the same
Use this when
You run a long batch scrape from a single process and want it to survive crashes, restarts, deploys, and Ctrl-C without redoing finished work. A local checkpoint file is enough for thousands to low millions of URLs on one machine.
Skip this when
Multiple machines or processes need to pull from one shared worklist, in which case use a server-backed queue such as pg-boss on Postgres or BullMQ on Redis. Skip it for short scrapes that finish in seconds, where re-running from the top is cheaper than maintaining a checkpoint. Skip it when you only need to avoid re-scraping the same URL rather than resume an interrupted run, where a Bloom filter or dedupe set is the lighter tool.