How to scrape and download files in Node.js

Updated 2026-06-24 · 6 min read

If you've ever pulled a file URL off a page and saved it with Buffer.from(await res.arrayBuffer()), you're probably watching it work fine on small images and then die on the first big PDF or video. That approach loads the whole file into memory before a single byte reaches disk, so on a small VM or a Lambda the process runs out of heap and gets killed, and nothing is saved. The ceiling is the available heap: any file larger than that takes the process down.

The solution is to stream the response body straight to the filesystem one chunk at a time, instead of buffering the whole file in memory first. We'll build a small script that streams each chunk to disk so memory stays flat no matter how large the file is and a dropped connection cleans up after itself instead of leaving a half-written file, checks the response status before touching the body so a failed request never gets saved, and reads the right filename and extension from the response headers so a .pdf that is actually an HTML error page is caught early. It comes to about 40 lines of Node.js with native fetch and two small helpers, content-disposition and mime-types.

The complete script

// download-file.mjs
import { createWriteStream } from 'node:fs'
import { mkdir } from 'node:fs/promises'
import { Readable } from 'node:stream'
import { pipeline } from 'node:stream/promises'
import { basename, extname, join } from 'node:path'
import contentDisposition from 'content-disposition'
import mime from 'mime-types'

const url = 'https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf'
const outDir = './downloads'

const res = await fetch(url, {
  headers: { 'User-Agent': 'Mozilla/5.0' },
  redirect: 'follow'
})

// a 200 with an error page still has a body, so reject anything non-2xx up front.
if (!res.ok) {
  throw new Error(`Download failed: ${res.status} ${res.statusText} for ${url}`)
}

// prefer the server's stated filename, then the URL path, then a generic fallback.
const disposition = res.headers.get('content-disposition')
let filename = disposition ? contentDisposition.parse(disposition).parameters.filename : null
if (!filename) filename = basename(new URL(res.url).pathname) || 'download'

// if the filename has no extension, derive one from the real Content-Type.
if (!extname(filename)) {
  const ext = mime.extension(res.headers.get('content-type') || '')
  if (ext) filename = `${filename}.${ext}`
}

await mkdir(outDir, { recursive: true })
const outPath = join(outDir, filename)

// stream the web ReadableStream to disk. the bytes are never all held in memory.
await pipeline(Readable.fromWeb(res.body), createWriteStream(outPath))

console.log(`Saved ${outPath}`)

bash

npm install content-disposition mime-types
node download-file.mjs

How it works

Fetch with a browser User-Agent and follow redirects. A bare fetch from Node sends an empty or node-flavored User-Agent that some CDNs reject. redirect: 'follow' is the default, but stating it is a reminder that download links routinely 302 to a signed storage URL, and res.url then holds the final address you want to name the file from. fetch has no default timeout either, so a server that accepts the connection and then sends nothing leaves the download pending forever; pass signal: AbortSignal.timeout(30000) to abort a stalled request after 30 seconds.

Reject non-2xx before reading the body. A 403 or a "file not found" page still returns a readable body. Without the res.ok guard you stream an HTML error page to report.pdf and only discover it when something downstream tries to open it. For the trickier 200-with-HTML case, verify res.headers.get('content-type') matches what you expect before saving.

Parse Content-Disposition for the server-stated filename. When a server sends attachment; filename="Q3-report.pdf", that is the name a browser would use. The content-disposition library handles the quoting and the filename* UTF-8 extended form correctly, which a hand-rolled split on = does not. When the header is absent, basename(new URL(res.url).pathname) is the next best source, and going through URL drops the query string and strips path traversal sequences that a raw URL string would carry into your output path. If many files share a generic server name like download.pdf, prefix the saved name with a short hash of the URL so a second download does not silently overwrite the first.

Backfill the extension from Content-Type. Plenty of download URLs end in /download or a bare ID with no extension. mime.extension('application/pdf') returns pdf, so the saved file opens in the right application instead of as an unknown blob.

Stream with pipeline, not pipe. pipeline from node:stream/promises returns a promise that resolves when the write completes and rejects on any error in the chain, cleaning up every stream it touches. The older readable.pipe(writable) does not forward errors or close the destination on failure, which leaves a truncated file and an open descriptor when a connection drops mid-transfer; on a rejection, delete the partial file in a catch with unlink(outPath).

Use this when

You have a direct file URL scraped from a page (a PDF, image, CSV, ZIP, or media asset) and you want it on disk with the right name and extension, including files large enough that buffering them in memory is not an option.

Skip this when

The link only appears after a client-side click or JavaScript build of a blob URL (drive the download through Puppeteer instead); the file sits behind a login or signed-URL flow that needs a full browser session (capture the session first); you want the page's article text rather than a binary asset (convert the HTML to Markdown); or you need thousands of files at once where a concurrency-limited queue with retries matters more than the single-file mechanics shown here.

How to scrape and download files in Node.js ​

The complete script ​

How it works ​

Related guides ​

Skip the code, just get the data Simplescraper turns any website into structured data in seconds.

How to scrape and download files in Node.js

The complete script

How it works

Related guides

Skip the code, just get the data
Simplescraper turns any website into structured data in seconds.