How to scrape and download files in Node.js
If you've ever pulled a file URL off a page and saved it with Buffer.from(await res.arrayBuffer()), you're probably watching it work fine on small images and then die on the first big PDF or video. That approach loads the whole file into memory before a single byte reaches disk, so on a small VM or a Lambda the process runs out of heap and gets killed, and nothing is saved. The ceiling is the available heap: any file larger than that takes the process down.
The solution is to stream the response body straight to the filesystem one chunk at a time, so memory stays flat no matter how large the file is and a dropped connection cleans up after itself instead of leaving a half-written file. You also read the right filename and extension from the response headers before touching the body, so a .pdf that is actually an HTML error page is caught early. It comes to about 40 lines of Node.js with native fetch and two small helpers, content-disposition and mime-types.
Key terms
ReadableStream. The web-standard stream thatfetchexposes asres.body, delivering the response in chunks rather than as one in-memory blob.pipeline. Thenode:stream/promiseshelper that pumps one stream into another, awaits completion, and destroys every stream in the chain if any link errors.- Backpressure. The flow-control signal a slow destination sends upstream so the source pauses, which is what keeps a streamed download from outrunning the disk.
Content-Disposition. A response header that states the server's intended filename, including afilename*UTF-8 extended form, which the script reads for the server-stated download name.- MIME type. The
Content-Typevalue such asapplication/pdfthat names the file's format, used here to backfill an extension when the filename has none.
Here is what the script does:
- Fetch the file URL with native
fetchand a normal browser User-Agent, then check the response status before reading a single byte. - Read the filename from the
Content-Dispositionheader with thecontent-dispositionlibrary, falling back to the URL path when the server does not send one. - Confirm the extension against the server's
Content-Typeusingmime-types, so a.pdfthat is actually HTML does not slip through. - Stream the response body to disk with Node's
stream/promisespipeline, which handles backpressure and closes the file handle on error.
The complete script
// download-file.mjs
import { createWriteStream } from 'node:fs'
import { mkdir } from 'node:fs/promises'
import { Readable } from 'node:stream'
import { pipeline } from 'node:stream/promises'
import { basename, extname, join } from 'node:path'
import contentDisposition from 'content-disposition'
import mime from 'mime-types'
const url = 'https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf'
const outDir = './downloads'
const res = await fetch(url, {
headers: { 'User-Agent': 'Mozilla/5.0' },
redirect: 'follow'
})
// A 200 with an error page still has a body, so reject anything non-2xx up front.
if (!res.ok) {
throw new Error(`Download failed: ${res.status} ${res.statusText} for ${url}`)
}
// Prefer the server's stated filename, then the URL path, then a generic fallback.
const disposition = res.headers.get('content-disposition')
let filename = disposition ? contentDisposition.parse(disposition).parameters.filename : null
if (!filename) filename = basename(new URL(res.url).pathname) || 'download'
// If the filename has no extension, derive one from the real Content-Type.
if (!extname(filename)) {
const ext = mime.extension(res.headers.get('content-type') || '')
if (ext) filename = `${filename}.${ext}`
}
await mkdir(outDir, { recursive: true })
const outPath = join(outDir, filename)
// Stream the web ReadableStream to disk. The bytes are never all held in memory.
await pipeline(Readable.fromWeb(res.body), createWriteStream(outPath))
console.log(`Saved ${outPath}`)npm install content-disposition mime-types
node download-file.mjsWhat each step does
Fetch with a browser User-Agent and follow redirects. A bare fetch from Node sends an empty or node-flavored User-Agent that some CDNs reject. redirect: 'follow' is the default, but stating it is a reminder that download links routinely 302 to a signed storage URL, and res.url then holds the final address you want to name the file from.
Reject non-2xx before reading the body. A 403 or a "file not found" page still returns a readable body. Without the res.ok guard you stream an HTML error page to report.pdf and only discover it when something downstream tries to open it.
Parse Content-Disposition for the server-stated filename. When a server sends attachment; filename="Q3-report.pdf", that is the name a browser would use. The content-disposition library handles the quoting and the filename* UTF-8 extended form correctly, which a hand-rolled split on = does not. When the header is absent, basename of the URL path is the next best source.
Backfill the extension from Content-Type. Plenty of download URLs end in /download or a bare ID with no extension. mime.extension('application/pdf') returns pdf, so the saved file opens in the right application instead of as an unknown blob.
Stream with pipeline, not pipe. pipeline from node:stream/promises returns a promise that resolves when the write completes and rejects on any error in the chain, cleaning up every stream it touches. The older readable.pipe(writable) does not forward errors or close the destination on failure, which is how partial files and leaked descriptors happen.
Gotchas
Buffering the whole response crashes on large files.
- Issue:
Buffer.from(await res.arrayBuffer())loads the entire file into memory, so a multi-hundred-megabyte download triggers an out-of-memory kill or aRangeError: Array buffer allocation failed. - Fix: stream the body with
pipeline(Readable.fromWeb(res.body), createWriteStream(outPath)), which holds only one chunk at a time.
- Issue:
A failed request still writes a file.
- Issue: Servers answer a
403or a missing-file request with a200HTML page, so streaming the body unconditionally saves that error page under your intended filename. - Fix: check
if (!res.ok) throw ...before touchingres.body, and for the200-with-HTML case verifyres.headers.get('content-type')matches what you expect.
- Issue: Servers answer a
The filename from the URL is wrong or unsafe.
- Issue: URLs carry query strings (
file.pdf?token=abc) and path traversal sequences (../../etc/passwd), so naming the file from a raw URL string produces a broken or dangerous path. - Fix: read the name from
Content-Dispositionfirst, and when falling back to the URL usebasename(new URL(res.url).pathname), which drops the query and strips directory components.
- Issue: URLs carry query strings (
Two downloads collide on the same name.
- Issue: Many files share a server filename like
download.pdforimage.jpg, so a second download silently overwrites the first. - Fix: prefix the saved name with a counter or a short hash of the URL, for example
`${createHash('sha1').update(url).digest('hex').slice(0, 8)}-${filename}`.
- Issue: Many files share a server filename like
pipeleaves a partial file on a dropped connection.- Issue: With
res.body.pipe(createWriteStream(outPath)), a network error mid-transfer rejects nothing and leaves a truncated file plus an open file descriptor. - Fix: use
pipelinefromnode:stream/promises, which rejects on the error and destroys both streams, then delete the partial file in acatchwithunlink(outPath).
- Issue: With
A slow or stalled server hangs the process forever.
- Issue:
fetchhas no default timeout, so a server that accepts the connection and then sends nothing leaves the download pending indefinitely. - Fix: pass
signal: AbortSignal.timeout(30000)tofetchso the request aborts after 30 seconds andpipelinerejects.
- Issue:
The file is behind a login or a referer check.
- Issue: Download links on authenticated pages return the login HTML when the request lacks the session cookie or the originating
Referer, so you save the wrong content. - Fix: send the cookies and
Refereryou captured from the page in thefetchheaders, or drive the download through Puppeteer where the browser session already holds them.
- Issue: Download links on authenticated pages return the login HTML when the request lacks the session cookie or the originating
Use this when
You have a direct file URL scraped from a page (a PDF, image, CSV, ZIP, or media asset) and you want it on disk with the right name and extension, including files large enough that buffering them in memory is not an option.
Skip this when
The link only appears after a client-side click or JavaScript build of a blob URL (drive the download through Puppeteer instead); the file sits behind a login or signed-URL flow that needs a full browser session (capture the session first); you want the page's article text rather than a binary asset (convert the HTML to Markdown); or you need thousands of files at once where a concurrency-limited queue with retries matters more than the single-file mechanics shown here.