How to scrape behind a proxy in Puppeteer

Updated 2026-06-25 · 5 min read

If you're scraping the same site on a loop from one machine, you have probably already watched it start working and then quietly stop: the first dozen requests return real HTML, then the responses turn into 403s, a captcha interstitial, or an empty body. The site has tied the pattern to your IP and started rate limiting or blocking it, and every retry from that address makes the block stickier.

The fix is to send each request from a different IP by routing the browser through a proxy, a relay that forwards your request so the site sees its address instead of yours. We'll build a small script that reads the proxy host, port, and credentials from environment variables so no secrets sit in the source, launches Puppeteer pointed at one proxy from a pool, answers the proxy's authentication challenge before the first navigation, and confirms the exit IP before cycling the next page onto the next proxy so requests spread across many addresses instead of one. It takes about 40 lines of Node.js with Puppeteer and nothing else.

The complete script

// scrape-behind-proxy.mjs
import puppeteer from 'puppeteer'

/* the proxy pool. in production read these from env or a secrets store; the
   values below are placeholders so nothing real is committed. each entry is
   one upstream proxy: host:port plus the credentials it expects. */
const proxyPool = [
  {
    server: process.env.PROXY_1_SERVER ?? 'proxy-a.example.com:8000',
    username: process.env.PROXY_1_USER ?? 'PROXY_USERNAME',
    password: process.env.PROXY_1_PASS ?? 'PROXY_PASSWORD'
  },
  {
    server: process.env.PROXY_2_SERVER ?? 'proxy-b.example.com:8000',
    username: process.env.PROXY_2_USER ?? 'PROXY_USERNAME',
    password: process.env.PROXY_2_PASS ?? 'PROXY_PASSWORD'
  }
]

/* the pages we want, one proxy assigned per page by index. */
const targets = [
  'https://httpbin.org/ip',
  'https://httpbin.org/ip'
]

for (let i = 0; i < targets.length; i++) {
  const url = targets[i]
  /* pick the next proxy in the pool, wrapping around if there are more
     targets than proxies. */
  const proxy = proxyPool[i % proxyPool.length]

  /* the host:port goes on the launch flag. credentials must NOT go here:
     Chromium ignores user:pass in --proxy-server, which is why the next
     step exists. */
  const browser = await puppeteer.launch({
    headless: true,
    args: [`--proxy-server=${proxy.server}`]
  })

  try {
    const page = await browser.newPage()

    /* answer the proxy's 407 challenge. this must run before goto(), or the
       first navigation fails with ERR_INVALID_AUTH_CREDENTIALS. */
    await page.authenticate({
      username: proxy.username,
      password: proxy.password
    })

    const response = await page.goto(url, {
      waitUntil: 'domcontentloaded',
      timeout: 30000
    })

    /* httpbin.org/ip returns the caller's IP as JSON. reading it back proves
       the request left through the proxy and not your own address. */
    const body = await response.text()
    console.log(`[proxy] ${proxy.server} -> ${body.trim()}`)
  } catch (err) {
    /* a dead proxy throws ERR_PROXY_CONNECTION_FAILED or times out here. log
       it and let the loop move on to the next target and proxy. */
    console.error(`[proxy] ${proxy.server} failed: ${err.message}`)
  } finally {
    await browser.close()
  }
}

bash

npm install puppeteer
node scrape-behind-proxy.mjs

How it works

Keep credentials out of the source. The pool reads each proxy's host, port, username, and password from environment variables and falls back to obvious placeholders. Set PROXY_1_SERVER, PROXY_1_USER, and the rest in your shell or a .env loader before running, so the real values never land in the file or a commit.

Pass the host and port on the launch flag. args: ['--proxy-server=host:port'] points every request the browser makes at that proxy. The credentials are deliberately absent here. Chromium drops a user:pass@ prefix in the proxy URL, so writing --proxy-server=http://user:pass@host:8000 looks right but sends unauthenticated requests that the proxy answers with 407, which is why you authenticate in the next step instead. A SOCKS endpoint is the one case where the flag needs more than host:port: a bare address is treated as HTTP, so pass --proxy-server=socks5://host:port, and since Chromium cannot authenticate to SOCKS at all, run a local authenticated forwarder and point Chromium at that when the SOCKS proxy needs credentials.

Authenticate before the first navigation. page.authenticate({ username, password }) registers the credentials Puppeteer replies with when the proxy returns HTTP 407 Proxy Authentication Required. Call it on the page before goto(). If the first navigation runs first, it fails with ERR_INVALID_AUTH_CREDENTIALS and the page never loads. Authentication is per page, so repeat it on every page you open.

Verify the exit IP, then rotate. Loading httpbin.org/ip echoes back the address the request came from, which confirms traffic left through the proxy. The loop assigns proxyPool[i % proxyPool.length] so each page launches on the next proxy and the requests spread across the pool instead of one IP. Each target gets its own browser inside a try/catch, so a dead proxy that throws ERR_PROXY_CONNECTION_FAILED or times out is logged and skipped rather than aborting the whole run.

Watch for the leaks the proxy does not cover. The proxy carries the page's HTTP traffic, but a site's JavaScript can still ask the browser for its real addresses over WebRTC, which bypasses the proxy; launch with --force-webrtc-ip-handling-policy=disable_non_proxied_udp or block the WebRTC APIs in an init script when the target fingerprints that way. The other cost is speed: puppeteer.launch() runs roughly 300-700ms each time, so a fresh browser per page dominates the runtime at thousands of pages. Keep one browser per proxy and open many pages on it, or run a small set of browsers concurrently with a promise pool sized to your proxy count.

Use this when

You are running Puppeteer against a site that rate limits or blocks by IP, and you have a pool of authenticated HTTP proxies you want to rotate across so requests spread over many addresses.

Skip this when

You only need plain HTTP requests rather than a browser (route fetch or undici through the proxy with an agent instead); the block is fingerprint-based rather than IP-based (patch the headless browser first); the site sits behind a Cloudflare interactive challenge (a proxy alone will not clear it); or you need a different IP for individual requests within one page rather than per page (intercept and re-route at the request level).

How to scrape behind a proxy in Puppeteer ​

The complete script ​

How it works ​

Related guides ​

Skip the code, just get the data Simplescraper turns any website into structured data in seconds.

How to scrape behind a proxy in Puppeteer

The complete script

How it works

Related guides

Skip the code, just get the data
Simplescraper turns any website into structured data in seconds.