How to run a scraper on a schedule with Node.js
If you have a scraper you want to run on a cadence, you have probably reached for setInterval and watched it drift off the clock, or worse, seen two runs collide and clobber each other's output. A scheduled job corrupts itself in two ways: a tick fires while the previous run is still going, and a second copy of the process starts from a redeploy or a manual test. Both are common once a scrape outlives a single hand-run, and a named-timezone cron plus a cross-process lock closes off both.
The solution is to fire the scrape on a cron expression pinned to a named timezone and guard the work with a cross-process lock, so the cadence stays true to the clock and only one run ever touches the data at a time. That gives you a schedule that lives in code next to the scraper, in about 60 lines of Node.js with node-cron and proper-lockfile.
Key terms
- Cron expression. A compact schedule string like
*/15 * * * *that node-cron evaluates against the clock to fire the job on a fixed cadence. - Cross-process lock. A lock held as a real file on disk, so a second process can see it and skip rather than run the same job twice.
staletimeout. The age after whichproper-lockfiletreats a lock as abandoned, so a run that crashed without releasing does not block the schedule forever.noOverlap. A node-cron option that skips a tick when the previous run is still going, guarding against overlap inside a single process.- Named timezone. An explicit zone like
Etc/UTCpassed to the scheduler so the same expression fires at the same real moment on every machine.
Here is what the script does:
- Register a cron expression with node-cron so the scrape function runs on a fixed cadence, in a timezone you name explicitly.
- Hold an OS-level lock with
proper-lockfileso a second process, a manual run, or a slow previous tick cannot run the same job twice at once. - Skip the tick instead of stacking it when the previous run is still going, using node-cron's
noOverlapoption plus the lock as a second layer. - Catch every error inside the job so one failed scrape never kills the scheduler and stops every future run.
The complete script
// scheduled-scraper.mjs
import cron from 'node-cron'
import lockfile from 'proper-lockfile'
import { writeFile, mkdir } from 'node:fs/promises'
const LOCK_TARGET = './data' // the directory the lock protects
const CRON_EXPR = '*/15 * * * *' // every 15 minutes
const TIMEZONE = 'Etc/UTC' // name it; never trust the host's local time
await mkdir(LOCK_TARGET, { recursive: true })
/* The actual scrape. Real work goes here; this fetches one page and saves it. */
async function scrapeOnce() {
const res = await fetch('https://news.ycombinator.com/', {
headers: { 'User-Agent': 'Mozilla/5.0' }
})
if (!res.ok) {
throw new Error('upstream returned ' + res.status)
}
const html = await res.text()
const stamp = new Date().toISOString().replace(/[:.]/g, '-')
await writeFile(LOCK_TARGET + '/hn-' + stamp + '.html', html)
console.log('[scrape] saved ' + html.length + ' bytes at ' + stamp)
}
/* Wrap the scrape in a cross-process lock so only one run touches the data dir. */
async function runGuarded() {
let release
try {
release = await lockfile.lock(LOCK_TARGET, { stale: 10 * 60 * 1000 })
} catch {
console.log('[scrape] another run holds the lock, skipping this tick')
return
}
try {
await scrapeOnce()
} catch (err) {
/* Swallow the error here. Throwing out of a cron callback would not stop
the scheduler, but logging it keeps the next tick clean and visible. */
console.error('[scrape] run failed: ' + err.message)
} finally {
await release()
}
}
const task = cron.schedule(CRON_EXPR, runGuarded, {
name: 'hn-scrape',
timezone: TIMEZONE,
noOverlap: true // node-cron skips a tick if the last run is still going
})
console.log('[scheduler] hn-scrape armed: ' + CRON_EXPR + ' (' + TIMEZONE + ')')
console.log('[scheduler] next run: ' + task.getNextRun()?.toISOString())
/* Stop cleanly on Ctrl+C so an in-flight run can release its lock. */
process.on('SIGINT', async () => {
console.log('[scheduler] stopping')
await task.stop()
process.exit(0)
})npm install node-cron proper-lockfile
node scheduled-scraper.mjsWhat each step does
Name the timezone, do not infer it. cron.schedule reads the host's local time unless you pass timezone. A server in UTC and a laptop in Berlin run the same expression at different real moments. Passing 'Etc/UTC' (or whatever you actually mean) makes the schedule reproducible across machines.
Take the lock before scraping, release it in finally. proper-lockfile.lock creates a lock directory next to the target and resolves only if no live lock exists. The stale: 10 * 60 * 1000 option means a lock older than ten minutes is treated as abandoned, so a crashed run does not block the schedule forever. The release() in the finally block runs whether the scrape succeeds or throws.
Skip, do not queue, on a held lock. When lock() rejects because another run holds it, the catch logs and returns. The tick is dropped, not buffered. For a scraper you want the freshest data on the next clean tick, not a backlog of stale runs piling up.
Keep errors inside the callback. A throw that escapes the cron callback does not crash node-cron, but it does go unlogged and the failure is invisible. Catching it, logging err.message, and moving on means one 500 from the target site does not silently break the next run.
Stop on SIGINT. task.stop() lets an in-flight run finish and release its lock before the process exits. Killing the process mid-scrape leaves the lock on disk until the stale window expires.
Gotchas
setIntervaldrifts off the schedule.- Issue:
setInterval(fn, 900000)counts from when the last callback returned, so any run that takes real time shifts every later run, and the job slowly desyncs from the wall clock. - Fix: use a cron expression. node-cron computes the next fire time from the clock, so a slow run does not push the following one.
- Issue:
A long run overlaps the next tick.
- Issue: if a scrape takes longer than the interval, the next tick starts a second run while the first is still writing, and both clobber the same files.
- Fix: pass
noOverlap: truetocron.schedule. node-cron skips the new tick while the previous run is unfinished instead of stacking it.
A second process runs the same job.
- Issue:
noOverlaponly guards one process. A redeploy that leaves the old process alive, or a manualnode scheduled-scraper.mjsto test, gives you two schedulers firing the same scrape at once. - Fix: hold a
proper-lockfilelock around the work. The lock is a file on disk, so any process can see it, and the second run'slock()rejects and skips.
- Issue:
A crashed run leaves the lock stuck.
- Issue: if the process dies mid-scrape (OOM,
kill -9, power loss), the lock file stays on disk and every future tick skips forever. - Fix: set a
staletimeout, here10 * 60 * 1000. After ten minutesproper-lockfiletreats the lock as abandoned and the next tick reclaims it.
- Issue: if the process dies mid-scrape (OOM,
The schedule jumps an hour on a daylight-saving change.
- Issue: a job set for
0 2 * * *in a DST zone either runs twice or skips entirely on the two switch days, because 2am happens twice or not at all. - Fix: schedule in
'Etc/UTC', which has no DST, and convert to local time only for display. Reserve a named local timezone for jobs that genuinely must track local business hours.
- Issue: a job set for
The whole schedule dies with the process.
- Issue: node-cron lives inside one Node process. If the machine reboots or the process is killed, nothing restarts it, and the scrape silently stops running.
- Fix: supervise the process with
systemd,pm2, or a container restart policy so it comes back after a crash or reboot. node-cron schedules; it does not keep itself alive.
Per-second crons silently do not work.
- Issue: a six-field expression like
*/30 * * * * *(every 30 seconds) is parsed, but very tight intervals collide with scrape duration and the lock skip drops most ticks. - Fix: keep the cadence well above the time one scrape takes. If you need sub-minute work, measure the run first and size the interval to leave headroom.
- Issue: a six-field expression like
Use this when
You have one Node.js process that should scrape a site on a cadence, the job is idempotent enough that skipping an occasional tick is fine, and you want the schedule to live in code next to the scraper rather than in a separate system.
Skip this when
A plain OS cron entry or a hosted scheduler is the better tool. Reach for crontab, a systemd timer, or a GitHub Actions schedule when the scrape is a short one-shot script rather than a daemon you keep running. Use a queue-backed runner like BullMQ or pg-boss when ticks must persist across restarts and survive a missed window. Move to a managed scheduler such as Cloud Scheduler or EventBridge when you are scaling past one machine and need the cadence outside any single process.