Simplescraper
Skip to content

How to run a scraper on a schedule with Node.js

How to run a scraper on a schedule with Node.js

Updated 2026-06-24 · 6 min read

If you have a scraper you want to run on a cadence, you have probably reached for setInterval and watched it drift off the clock, or worse, seen two runs collide and clobber each other's output. A scheduled job corrupts itself in two ways: a tick fires while the previous run is still going, and a second copy of the process starts from a redeploy or a manual test. Both are common once a scrape outlives a single hand-run, and a named-timezone cron plus a cross-process lock closes off both.

The solution is to fire the scrape on a cron expression pinned to a named timezone and guard the work with a cross-process lock, so the cadence stays true to the clock and only one run ever touches the data at a time. That gives you a schedule that lives in code next to the scraper, in about 60 lines of Node.js with node-cron and proper-lockfile.

Key terms

  • Cron expression. A compact schedule string like */15 * * * * that node-cron evaluates against the clock to fire the job on a fixed cadence.
  • Cross-process lock. A lock held as a real file on disk, so a second process can see it and skip rather than run the same job twice.
  • stale timeout. The age after which proper-lockfile treats a lock as abandoned, so a run that crashed without releasing does not block the schedule forever.
  • noOverlap. A node-cron option that skips a tick when the previous run is still going, guarding against overlap inside a single process.
  • Named timezone. An explicit zone like Etc/UTC passed to the scheduler so the same expression fires at the same real moment on every machine.

Here is what the script does:

  • Register a cron expression with node-cron so the scrape function runs on a fixed cadence, in a timezone you name explicitly.
  • Hold an OS-level lock with proper-lockfile so a second process, a manual run, or a slow previous tick cannot run the same job twice at once.
  • Skip the tick instead of stacking it when the previous run is still going, using node-cron's noOverlap option plus the lock as a second layer.
  • Catch every error inside the job so one failed scrape never kills the scheduler and stops every future run.

The complete script

js
// scheduled-scraper.mjs
import cron from 'node-cron'
import lockfile from 'proper-lockfile'
import { writeFile, mkdir } from 'node:fs/promises'

const LOCK_TARGET = './data'        // the directory the lock protects
const CRON_EXPR = '*/15 * * * *'    // every 15 minutes
const TIMEZONE = 'Etc/UTC'          // name it; never trust the host's local time

await mkdir(LOCK_TARGET, { recursive: true })

/* The actual scrape. Real work goes here; this fetches one page and saves it. */
async function scrapeOnce() {
  const res = await fetch('https://news.ycombinator.com/', {
    headers: { 'User-Agent': 'Mozilla/5.0' }
  })

  if (!res.ok) {
    throw new Error('upstream returned ' + res.status)
  }

  const html = await res.text()
  const stamp = new Date().toISOString().replace(/[:.]/g, '-')
  await writeFile(LOCK_TARGET + '/hn-' + stamp + '.html', html)
  console.log('[scrape] saved ' + html.length + ' bytes at ' + stamp)
}

/* Wrap the scrape in a cross-process lock so only one run touches the data dir. */
async function runGuarded() {
  let release
  try {
    release = await lockfile.lock(LOCK_TARGET, { stale: 10 * 60 * 1000 })
  } catch {
    console.log('[scrape] another run holds the lock, skipping this tick')
    return
  }

  try {
    await scrapeOnce()
  } catch (err) {
    /* Swallow the error here. Throwing out of a cron callback would not stop
       the scheduler, but logging it keeps the next tick clean and visible. */
    console.error('[scrape] run failed: ' + err.message)
  } finally {
    await release()
  }
}

const task = cron.schedule(CRON_EXPR, runGuarded, {
  name: 'hn-scrape',
  timezone: TIMEZONE,
  noOverlap: true            // node-cron skips a tick if the last run is still going
})

console.log('[scheduler] hn-scrape armed: ' + CRON_EXPR + ' (' + TIMEZONE + ')')
console.log('[scheduler] next run: ' + task.getNextRun()?.toISOString())

/* Stop cleanly on Ctrl+C so an in-flight run can release its lock. */
process.on('SIGINT', async () => {
  console.log('[scheduler] stopping')
  await task.stop()
  process.exit(0)
})
bash
npm install node-cron proper-lockfile
node scheduled-scraper.mjs

What each step does

Name the timezone, do not infer it. cron.schedule reads the host's local time unless you pass timezone. A server in UTC and a laptop in Berlin run the same expression at different real moments. Passing 'Etc/UTC' (or whatever you actually mean) makes the schedule reproducible across machines.

Take the lock before scraping, release it in finally. proper-lockfile.lock creates a lock directory next to the target and resolves only if no live lock exists. The stale: 10 * 60 * 1000 option means a lock older than ten minutes is treated as abandoned, so a crashed run does not block the schedule forever. The release() in the finally block runs whether the scrape succeeds or throws.

Skip, do not queue, on a held lock. When lock() rejects because another run holds it, the catch logs and returns. The tick is dropped, not buffered. For a scraper you want the freshest data on the next clean tick, not a backlog of stale runs piling up.

Keep errors inside the callback. A throw that escapes the cron callback does not crash node-cron, but it does go unlogged and the failure is invisible. Catching it, logging err.message, and moving on means one 500 from the target site does not silently break the next run.

Stop on SIGINT. task.stop() lets an in-flight run finish and release its lock before the process exits. Killing the process mid-scrape leaves the lock on disk until the stale window expires.

Gotchas

  • setInterval drifts off the schedule.

    • Issue: setInterval(fn, 900000) counts from when the last callback returned, so any run that takes real time shifts every later run, and the job slowly desyncs from the wall clock.
    • Fix: use a cron expression. node-cron computes the next fire time from the clock, so a slow run does not push the following one.
  • A long run overlaps the next tick.

    • Issue: if a scrape takes longer than the interval, the next tick starts a second run while the first is still writing, and both clobber the same files.
    • Fix: pass noOverlap: true to cron.schedule. node-cron skips the new tick while the previous run is unfinished instead of stacking it.
  • A second process runs the same job.

    • Issue: noOverlap only guards one process. A redeploy that leaves the old process alive, or a manual node scheduled-scraper.mjs to test, gives you two schedulers firing the same scrape at once.
    • Fix: hold a proper-lockfile lock around the work. The lock is a file on disk, so any process can see it, and the second run's lock() rejects and skips.
  • A crashed run leaves the lock stuck.

    • Issue: if the process dies mid-scrape (OOM, kill -9, power loss), the lock file stays on disk and every future tick skips forever.
    • Fix: set a stale timeout, here 10 * 60 * 1000. After ten minutes proper-lockfile treats the lock as abandoned and the next tick reclaims it.
  • The schedule jumps an hour on a daylight-saving change.

    • Issue: a job set for 0 2 * * * in a DST zone either runs twice or skips entirely on the two switch days, because 2am happens twice or not at all.
    • Fix: schedule in 'Etc/UTC', which has no DST, and convert to local time only for display. Reserve a named local timezone for jobs that genuinely must track local business hours.
  • The whole schedule dies with the process.

    • Issue: node-cron lives inside one Node process. If the machine reboots or the process is killed, nothing restarts it, and the scrape silently stops running.
    • Fix: supervise the process with systemd, pm2, or a container restart policy so it comes back after a crash or reboot. node-cron schedules; it does not keep itself alive.
  • Per-second crons silently do not work.

    • Issue: a six-field expression like */30 * * * * * (every 30 seconds) is parsed, but very tight intervals collide with scrape duration and the lock skip drops most ticks.
    • Fix: keep the cadence well above the time one scrape takes. If you need sub-minute work, measure the run first and size the interval to leave headroom.

Use this when

You have one Node.js process that should scrape a site on a cadence, the job is idempotent enough that skipping an occasional tick is fine, and you want the schedule to live in code next to the scraper rather than in a separate system.

Skip this when

A plain OS cron entry or a hosted scheduler is the better tool. Reach for crontab, a systemd timer, or a GitHub Actions schedule when the scrape is a short one-shot script rather than a daemon you keep running. Use a queue-backed runner like BullMQ or pg-boss when ticks must persist across restarts and survive a missed window. Move to a managed scheduler such as Cloud Scheduler or EventBridge when you are scaling past one machine and need the cadence outside any single process.

Skip the code, just get the data

Simplescraper turns any website into structured data in seconds.