How to scrape concurrently with a promise pool in Node.js

Updated 2026-06-25 · 6 min read

If you have a few thousand URLs to scrape, you have probably tried Promise.all(urls.map(fetch)) and watched it either get you rate limited within the first few hundred requests, or exhaust the machine's sockets and start throwing ECONNRESET. Firing every request at once is the obvious first move, and it is the one that breaks once the list is more than a handful of URLs, because nothing is holding the number of open connections down.

The fix is to keep a fixed number of fetches in flight at a time, the concurrency, separate from the total number of URLs. We'll build a small script that walks the URL list with a cursor so a request starts only when a slot is free, holds a bounded set of in-flight promises and tops it back up to the concurrency limit as each one finishes, wakes the moment any single fetch settles instead of waiting for the whole batch, and records each result or error against its input index so one failing URL doesn't reject the run and the output stays in input order. It takes about 50 lines of plain Node.js, with nothing to install.

The complete script

// promise-pool.mjs

/* run `worker` over every item in `items`, keeping at most `concurrency`
   calls in flight at once. results come back in input order. a worker that
   throws is recorded as { ok: false, error } rather than rejecting the run,
   so one bad URL does not abort the batch. */
async function promisePool(items, concurrency, worker) {
  const results = new Array(items.length)
  const running = new Set()
  let cursor = 0

  while (cursor < items.length || running.size > 0) {
    /* top the set up to the concurrency ceiling. each task captures its own
       index so the result lands in the right slot, and removes itself from
       the running set as it settles so the next pass can refill that slot. */
    while (cursor < items.length && running.size < concurrency) {
      const index = cursor++
      const task = (async () => {
        try {
          results[index] = { ok: true, value: await worker(items[index], index) }
        } catch (error) {
          results[index] = { ok: false, error }
        }
      })()
      running.add(task)
      task.finally(() => running.delete(task))
    }

    /* wait for the next task to settle before refilling. without this the
       while-loop would spin. Promise.race wakes on the first settle. */
    if (running.size > 0) await Promise.race(running)
  }

  return results
}

const urls = Array.from({ length: 50 }, (_, i) => `https://httpbin.org/anything?n=${i}`)

const results = await promisePool(urls, 8, async (url) => {
  const response = await fetch(url, { headers: { 'User-Agent': 'Mozilla/5.0' } })
  if (!response.ok) throw new Error(`${response.status} for ${url}`)
  const body = await response.text()
  return { url, status: response.status, bytes: body.length }
})

const ok = results.filter(r => r.ok)
const failed = results.filter(r => !r.ok)
console.log(`Done. ${ok.length} ok, ${failed.length} failed`)
for (const r of failed) console.log(`[fail] ${r.error.message}`)

bash

node promise-pool.mjs

How it works

Walk the list with a cursor, not .map(). Mapping the whole array starts every request immediately, which is the failure mode this page exists to avoid: a 5,000-URL list opens 5,000 connections in the same tick, tripping server rate limits and exhausting the local file-descriptor budget with ECONNRESET or EMFILE. The cursor index advances only when the inner loop has room under the concurrency ceiling, so a request begins only when a slot is free.

Capture the index before the task starts. const index = cursor++ reads and increments in one step, so the async task closes over its own fixed index. Without this, every task would close over the same shared cursor and write to the wrong slot, which is the classic loop-variable-in-a-closure bug. The result lands in results[index], keeping the output in input order even though the requests finish out of order.

Add to the set, then remove on settle. running.add(task) puts the live promise in the pool, and task.finally(() => running.delete(task)) removes it the instant it settles. The Set is the live count of open requests, so running.size < concurrency is the gate that decides whether to start another. Note that this caps how many requests run at once, not how many per second, so a concurrency of 8 against an endpoint that answers in 50ms still sends roughly 160 requests a second and can trip a per-second quota; to bound the rate as well, add a token-bucket limiter inside the worker, covered in How to rate-limit requests with backoff in JavaScript.

Race the running set to refill. await Promise.race(running) resolves as soon as any one in-flight task settles, so the loop wakes, the just-settled task is already gone from the set via its finally, and the next pass tops the set back up to the ceiling. Racing the set is what makes the pool refill one slot at a time instead of waiting for the whole batch to drain. A request to a hung endpoint holds its slot until it settles, so give fetch an AbortSignal.timeout(ms) to reject a stalled request and free its slot before a few stuck URLs starve the pool.

Record errors instead of throwing. The worker is wrapped in a try/catch that writes { ok: false, error } to the slot, so a 500 or a dropped connection on one URL does not reject the pool and lose the other forty-nine results. The caller reads r.ok to split successes from failures at the end.

Use this when

You have a known list of URLs or jobs to scrape and you want a fixed number running at once, in plain Node with nothing to install, while keeping the results in input order and the failures isolated per item.

Skip this when

You want the same pattern without writing it yourself, where the canonical p-limit wraps each call in a limiter and p-map maps an iterable at a concurrency cap; you need requests-per-second rather than a concurrency cap, where a token-bucket rate limiter is the right tool; the work must survive a process restart, where a Redis-backed queue like BullMQ persists the jobs; or the bottleneck is CPU-bound parsing rather than network waiting, where a worker-thread pool like piscina moves the work off the main loop.

How to scrape concurrently with a promise pool in Node.js ​

The complete script ​

How it works ​

Related guides ​

Skip the code, just get the data Simplescraper turns any website into structured data in seconds.

How to scrape concurrently with a promise pool in Node.js

The complete script

How it works

Related guides

Skip the code, just get the data
Simplescraper turns any website into structured data in seconds.