Simplescraper
Skip to content

How to scrape and download files in Node.js

How to scrape and download files in Node.js

Updated 2026-06-24 · 6 min read

If you've ever pulled a file URL off a page and saved it with Buffer.from(await res.arrayBuffer()), you're probably watching it work fine on small images and then die on the first big PDF or video. That approach loads the whole file into memory before a single byte reaches disk, so on a small VM or a Lambda the process runs out of heap and gets killed, and nothing is saved. The ceiling is the available heap: any file larger than that takes the process down.

The solution is to stream the response body straight to the filesystem one chunk at a time, so memory stays flat no matter how large the file is and a dropped connection cleans up after itself instead of leaving a half-written file. You also read the right filename and extension from the response headers before touching the body, so a .pdf that is actually an HTML error page is caught early. It comes to about 40 lines of Node.js with native fetch and two small helpers, content-disposition and mime-types.

Key terms

  • ReadableStream. The web-standard stream that fetch exposes as res.body, delivering the response in chunks rather than as one in-memory blob.
  • pipeline. The node:stream/promises helper that pumps one stream into another, awaits completion, and destroys every stream in the chain if any link errors.
  • Backpressure. The flow-control signal a slow destination sends upstream so the source pauses, which is what keeps a streamed download from outrunning the disk.
  • Content-Disposition. A response header that states the server's intended filename, including a filename* UTF-8 extended form, which the script reads for the server-stated download name.
  • MIME type. The Content-Type value such as application/pdf that names the file's format, used here to backfill an extension when the filename has none.

Here is what the script does:

  • Fetch the file URL with native fetch and a normal browser User-Agent, then check the response status before reading a single byte.
  • Read the filename from the Content-Disposition header with the content-disposition library, falling back to the URL path when the server does not send one.
  • Confirm the extension against the server's Content-Type using mime-types, so a .pdf that is actually HTML does not slip through.
  • Stream the response body to disk with Node's stream/promises pipeline, which handles backpressure and closes the file handle on error.

The complete script

js
// download-file.mjs
import { createWriteStream } from 'node:fs'
import { mkdir } from 'node:fs/promises'
import { Readable } from 'node:stream'
import { pipeline } from 'node:stream/promises'
import { basename, extname, join } from 'node:path'
import contentDisposition from 'content-disposition'
import mime from 'mime-types'

const url = 'https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf'
const outDir = './downloads'

const res = await fetch(url, {
  headers: { 'User-Agent': 'Mozilla/5.0' },
  redirect: 'follow'
})

// A 200 with an error page still has a body, so reject anything non-2xx up front.
if (!res.ok) {
  throw new Error(`Download failed: ${res.status} ${res.statusText} for ${url}`)
}

// Prefer the server's stated filename, then the URL path, then a generic fallback.
const disposition = res.headers.get('content-disposition')
let filename = disposition ? contentDisposition.parse(disposition).parameters.filename : null
if (!filename) filename = basename(new URL(res.url).pathname) || 'download'

// If the filename has no extension, derive one from the real Content-Type.
if (!extname(filename)) {
  const ext = mime.extension(res.headers.get('content-type') || '')
  if (ext) filename = `${filename}.${ext}`
}

await mkdir(outDir, { recursive: true })
const outPath = join(outDir, filename)

// Stream the web ReadableStream to disk. The bytes are never all held in memory.
await pipeline(Readable.fromWeb(res.body), createWriteStream(outPath))

console.log(`Saved ${outPath}`)
bash
npm install content-disposition mime-types
node download-file.mjs

What each step does

Fetch with a browser User-Agent and follow redirects. A bare fetch from Node sends an empty or node-flavored User-Agent that some CDNs reject. redirect: 'follow' is the default, but stating it is a reminder that download links routinely 302 to a signed storage URL, and res.url then holds the final address you want to name the file from.

Reject non-2xx before reading the body. A 403 or a "file not found" page still returns a readable body. Without the res.ok guard you stream an HTML error page to report.pdf and only discover it when something downstream tries to open it.

Parse Content-Disposition for the server-stated filename. When a server sends attachment; filename="Q3-report.pdf", that is the name a browser would use. The content-disposition library handles the quoting and the filename* UTF-8 extended form correctly, which a hand-rolled split on = does not. When the header is absent, basename of the URL path is the next best source.

Backfill the extension from Content-Type. Plenty of download URLs end in /download or a bare ID with no extension. mime.extension('application/pdf') returns pdf, so the saved file opens in the right application instead of as an unknown blob.

Stream with pipeline, not pipe. pipeline from node:stream/promises returns a promise that resolves when the write completes and rejects on any error in the chain, cleaning up every stream it touches. The older readable.pipe(writable) does not forward errors or close the destination on failure, which is how partial files and leaked descriptors happen.

Gotchas

  • Buffering the whole response crashes on large files.

    • Issue: Buffer.from(await res.arrayBuffer()) loads the entire file into memory, so a multi-hundred-megabyte download triggers an out-of-memory kill or a RangeError: Array buffer allocation failed.
    • Fix: stream the body with pipeline(Readable.fromWeb(res.body), createWriteStream(outPath)), which holds only one chunk at a time.
  • A failed request still writes a file.

    • Issue: Servers answer a 403 or a missing-file request with a 200 HTML page, so streaming the body unconditionally saves that error page under your intended filename.
    • Fix: check if (!res.ok) throw ... before touching res.body, and for the 200-with-HTML case verify res.headers.get('content-type') matches what you expect.
  • The filename from the URL is wrong or unsafe.

    • Issue: URLs carry query strings (file.pdf?token=abc) and path traversal sequences (../../etc/passwd), so naming the file from a raw URL string produces a broken or dangerous path.
    • Fix: read the name from Content-Disposition first, and when falling back to the URL use basename(new URL(res.url).pathname), which drops the query and strips directory components.
  • Two downloads collide on the same name.

    • Issue: Many files share a server filename like download.pdf or image.jpg, so a second download silently overwrites the first.
    • Fix: prefix the saved name with a counter or a short hash of the URL, for example `${createHash('sha1').update(url).digest('hex').slice(0, 8)}-${filename}`.
  • pipe leaves a partial file on a dropped connection.

    • Issue: With res.body.pipe(createWriteStream(outPath)), a network error mid-transfer rejects nothing and leaves a truncated file plus an open file descriptor.
    • Fix: use pipeline from node:stream/promises, which rejects on the error and destroys both streams, then delete the partial file in a catch with unlink(outPath).
  • A slow or stalled server hangs the process forever.

    • Issue: fetch has no default timeout, so a server that accepts the connection and then sends nothing leaves the download pending indefinitely.
    • Fix: pass signal: AbortSignal.timeout(30000) to fetch so the request aborts after 30 seconds and pipeline rejects.
  • The file is behind a login or a referer check.

    • Issue: Download links on authenticated pages return the login HTML when the request lacks the session cookie or the originating Referer, so you save the wrong content.
    • Fix: send the cookies and Referer you captured from the page in the fetch headers, or drive the download through Puppeteer where the browser session already holds them.

Use this when

You have a direct file URL scraped from a page (a PDF, image, CSV, ZIP, or media asset) and you want it on disk with the right name and extension, including files large enough that buffering them in memory is not an option.

Skip this when

The link only appears after a client-side click or JavaScript build of a blob URL (drive the download through Puppeteer instead); the file sits behind a login or signed-URL flow that needs a full browser session (capture the session first); you want the page's article text rather than a binary asset (convert the HTML to Markdown); or you need thousands of files at once where a concurrency-limited queue with retries matters more than the single-file mechanics shown here.

Skip the code, just get the data

Simplescraper turns any website into structured data in seconds.