How to run a scraper on a schedule with Node.js

Updated 2026-06-24 · 6 min read

If you have a scraper you want to run on a cadence, you have probably reached for setInterval and watched it drift off the clock, or worse, seen two runs collide and clobber each other's output. A scheduled job corrupts itself in two ways: a tick fires while the previous run is still going, and a second copy of the process starts from a redeploy or a manual test. Both are common once a scrape outlives a single hand-run, and a named-timezone cron plus a cross-process lock closes off both.

The solution is to fire the scrape on a cron expression, a schedule string like */15 * * * * that node-cron evaluates against the clock, and pin it to a named timezone so the cadence stays true on every machine. We'll build a small script that registers that cron so the scrape runs on a fixed cadence, holds a cross-process lock on disk so a second process or a slow previous tick can't run the job twice at once, skips a tick instead of stacking it when the last run is still going, and catches every error inside the job so one failed scrape never kills the scheduler. It comes out to about 60 lines of Node.js with node-cron and proper-lockfile.

The complete script

// scheduled-scraper.mjs
import cron from 'node-cron'
import lockfile from 'proper-lockfile'
import { writeFile, mkdir } from 'node:fs/promises'

const LOCK_TARGET = './data'        // the directory the lock protects
const CRON_EXPR = '*/15 * * * *'    // every 15 minutes
const TIMEZONE = 'Etc/UTC'          // name it; never trust the host's local time

await mkdir(LOCK_TARGET, { recursive: true })

/* the actual scrape. real work goes here; this fetches one page and saves it. */
async function scrapeOnce() {
  const res = await fetch('https://news.ycombinator.com/', {
    headers: { 'User-Agent': 'Mozilla/5.0' }
  })

  if (!res.ok) {
    throw new Error('upstream returned ' + res.status)
  }

  const html = await res.text()
  const stamp = new Date().toISOString().replace(/[:.]/g, '-')
  await writeFile(LOCK_TARGET + '/hn-' + stamp + '.html', html)
  console.log('[scrape] saved ' + html.length + ' bytes at ' + stamp)
}

/* wrap the scrape in a cross-process lock so only one run touches the data dir. */
async function runGuarded() {
  let release
  try {
    release = await lockfile.lock(LOCK_TARGET, { stale: 10 * 60 * 1000 })
  } catch {
    console.log('[scrape] another run holds the lock, skipping this tick')
    return
  }

  try {
    await scrapeOnce()
  } catch (err) {
    /* swallow the error here. throwing out of a cron callback would not stop
       the scheduler, but logging it keeps the next tick clean and visible. */
    console.error('[scrape] run failed: ' + err.message)
  } finally {
    await release()
  }
}

const task = cron.schedule(CRON_EXPR, runGuarded, {
  name: 'hn-scrape',
  timezone: TIMEZONE,
  noOverlap: true            // node-cron skips a tick if the last run is still going
})

console.log('[scheduler] hn-scrape armed: ' + CRON_EXPR + ' (' + TIMEZONE + ')')
console.log('[scheduler] next run: ' + task.getNextRun()?.toISOString())

/* stop cleanly on Ctrl+C so an in-flight run can release its lock. */
process.on('SIGINT', async () => {
  console.log('[scheduler] stopping')
  await task.stop()
  process.exit(0)
})

bash

npm install node-cron proper-lockfile
node scheduled-scraper.mjs

How it works

Name the timezone, do not infer it. cron.schedule reads the host's local time unless you pass timezone, so a server in UTC and a laptop in Berlin run the same expression at different real moments, and a cron expression beats setInterval here because node-cron computes the next fire time from the clock instead of from when the last callback returned, so a slow run never drifts the schedule off the wall clock. Passing 'Etc/UTC' (or whatever you actually mean) makes the schedule reproducible across machines, and it also sidesteps daylight-saving: a job set for 0 2 * * * in a DST zone runs twice or not at all on the two switch days because 2am happens twice or never, so schedule in 'Etc/UTC' and convert to local time only for display.

Take the lock before scraping, release it in finally. proper-lockfile.lock creates a lock directory next to the target and resolves only if no live lock exists, which is the layer that stops a second process from running the same job: noOverlap guards one process, but a redeploy that leaves the old process alive or a manual node scheduled-scraper.mjs gives you two schedulers, and the lock is a file every process can see, so the second run's lock() rejects and skips. The stale: 10 * 60 * 1000 option means a lock older than ten minutes is treated as abandoned, so a run that dies mid-scrape (OOM, kill -9, power loss) does not leave the lock stuck and block every future tick. The release() in the finally block runs whether the scrape succeeds or throws.

Skip, do not queue, on a held lock. When lock() rejects because another run holds it, the catch logs and returns. The tick is dropped, not buffered. For a scraper you want the freshest data on the next clean tick, not a backlog of stale runs piling up. Keep the cadence well above the time one scrape takes, since a tight sub-minute interval collides with scrape duration and the lock skip drops most ticks; measure the run first and size the interval to leave headroom.

Keep errors inside the callback. A throw that escapes the cron callback does not crash node-cron, but it does go unlogged and the failure is invisible. Catching it, logging err.message, and moving on means one 500 from the target site does not silently break the next run.

Stop on SIGINT. task.stop() lets an in-flight run finish and release its lock before the process exits. Killing the process mid-scrape leaves the lock on disk until the stale window expires. node-cron only schedules inside one Node process and does not keep itself alive, so supervise the process with systemd, pm2, or a container restart policy to bring the schedule back after a reboot or crash.

Use this when

You have one Node.js process that should scrape a site on a cadence, the job is idempotent enough that skipping an occasional tick is fine, and you want the schedule to live in code next to the scraper rather than in a separate system.

Skip this when

A plain OS cron entry or a hosted scheduler is the better tool. Reach for crontab, a systemd timer, or a GitHub Actions schedule when the scrape is a short one-shot script rather than a daemon you keep running. Use a queue-backed runner like BullMQ or pg-boss when ticks must persist across restarts and survive a missed window. Move to a managed scheduler such as Cloud Scheduler or EventBridge when you are scaling past one machine and need the cadence outside any single process.

How to run a scraper on a schedule with Node.js ​

The complete script ​

How it works ​

Related guides ​

Skip the code, just get the data Simplescraper turns any website into structured data in seconds.

How to run a scraper on a schedule with Node.js

The complete script

How it works

Related guides

Skip the code, just get the data
Simplescraper turns any website into structured data in seconds.