How to scrape behind a proxy in Puppeteer
If you're scraping the same site on a loop from one machine, you have probably already watched it start working and then quietly stop: the first dozen requests return real HTML, then the responses turn into 403s, a captcha interstitial, or an empty body. The site has tied the pattern to your IP and started rate limiting or blocking it, and every retry from that address makes the block stickier.
The fix is to send each request from a different IP by routing the browser through a proxy, and to rotate across a pool of them so no single address carries the whole load. Most paid proxies require a username and password, and Chromium cannot take those credentials in the proxy URL, so you pass the host and port with the --proxy-server launch flag and answer the auth challenge with page.authenticate(). It takes about 40 lines of Node.js with Puppeteer and nothing else.
Key terms
- Proxy server. A relay that forwards your request to the target and returns the response, so the site sees the proxy's IP instead of yours.
--proxy-serverflag. The Chromium launch argument that points the whole browser at one proxy host and port for every request it makes.page.authenticate(). A Puppeteer method that supplies a username and password when the proxy answers a request with HTTP 407 Proxy Authentication Required.- Rotating pool. A list of proxies the script cycles through, one per page, so requests spread across many IPs instead of hammering the target from one.
Here is what the script does:
- Read the proxy host, port, and credentials from environment variables, so no secrets sit in the source.
- Launch Puppeteer with
--proxy-serverpointed at one proxy from the pool. - Call
page.authenticate()with the username and password before the first navigation, so the proxy's 407 challenge is answered automatically. - Confirm the exit IP by loading an IP-echo endpoint, then cycle the next page onto the next proxy in the pool.
The complete script
// scrape-behind-proxy.mjs
import puppeteer from 'puppeteer'
/* The proxy pool. In production read these from env or a secrets store; the
values below are placeholders so nothing real is committed. Each entry is
one upstream proxy: host:port plus the credentials it expects. */
const proxyPool = [
{
server: process.env.PROXY_1_SERVER ?? 'proxy-a.example.com:8000',
username: process.env.PROXY_1_USER ?? 'PROXY_USERNAME',
password: process.env.PROXY_1_PASS ?? 'PROXY_PASSWORD'
},
{
server: process.env.PROXY_2_SERVER ?? 'proxy-b.example.com:8000',
username: process.env.PROXY_2_USER ?? 'PROXY_USERNAME',
password: process.env.PROXY_2_PASS ?? 'PROXY_PASSWORD'
}
]
/* The pages we want, one proxy assigned per page by index. */
const targets = [
'https://httpbin.org/ip',
'https://httpbin.org/ip'
]
for (let i = 0; i < targets.length; i++) {
const url = targets[i]
/* Pick the next proxy in the pool, wrapping around if there are more
targets than proxies. */
const proxy = proxyPool[i % proxyPool.length]
/* The host:port goes on the launch flag. Credentials must NOT go here:
Chromium ignores user:pass in --proxy-server, which is why the next
step exists. */
const browser = await puppeteer.launch({
headless: true,
args: [`--proxy-server=${proxy.server}`]
})
try {
const page = await browser.newPage()
/* Answer the proxy's 407 challenge. This must run before goto(), or the
first navigation fails with ERR_INVALID_AUTH_CREDENTIALS. */
await page.authenticate({
username: proxy.username,
password: proxy.password
})
const response = await page.goto(url, {
waitUntil: 'domcontentloaded',
timeout: 30000
})
/* httpbin.org/ip returns the caller's IP as JSON. Reading it back proves
the request left through the proxy and not your own address. */
const body = await response.text()
console.log(`[proxy] ${proxy.server} -> ${body.trim()}`)
} catch (err) {
/* A dead proxy throws ERR_PROXY_CONNECTION_FAILED or times out here. Log
it and let the loop move on to the next target and proxy. */
console.error(`[proxy] ${proxy.server} failed: ${err.message}`)
} finally {
await browser.close()
}
}npm install puppeteer
node scrape-behind-proxy.mjsWhat each step does
Keep credentials out of the source. The pool reads each proxy's host, port, username, and password from environment variables and falls back to obvious placeholders. Set PROXY_1_SERVER, PROXY_1_USER, and the rest in your shell or a .env loader before running, so the real values never land in the file or a commit.
Pass the host and port on the launch flag. args: ['--proxy-server=host:port'] points every request the browser makes at that proxy. The credentials are deliberately absent here. Chromium drops a user:pass@ prefix in the proxy URL, so authenticating through the flag does not work and you have to use the next step.
Authenticate before the first navigation. page.authenticate({ username, password }) registers the credentials Puppeteer replies with when the proxy returns HTTP 407. Call it on the page before goto(). If the first navigation runs first, it fails with ERR_INVALID_AUTH_CREDENTIALS and the page never loads.
Verify the exit IP, then rotate. Loading httpbin.org/ip echoes back the address the request came from, which confirms traffic left through the proxy. The loop assigns proxyPool[i % proxyPool.length] so each page launches on the next proxy and the requests spread across the pool instead of one IP.
Gotchas
Credentials in the proxy URL are silently ignored.
- Issue: Writing
--proxy-server=http://user:pass@host:8000looks right, but Chromium strips theuser:pass@part and sends unauthenticated requests, so the proxy answers every one with 407 and pages never load. - Fix: Put only
host:porton the flag and supplyusernameandpasswordthroughpage.authenticate().
- Issue: Writing
Authentication after the first goto fails.
- Issue: Calling
page.goto()beforepage.authenticate()triggers the proxy's 407 with no handler registered, and the navigation rejects withERR_INVALID_AUTH_CREDENTIALS. - Fix: Call
page.authenticate()on the new page first, then navigate. Authentication is per page, so repeat it on every page you open.
- Issue: Calling
One bad proxy aborts the whole run.
- Issue: A dead or overloaded proxy makes
page.goto()throwERR_PROXY_CONNECTION_FAILEDor hit the timeout, and an unhandled throw stops the loop before the remaining targets run. - Fix: Wrap the per-target body in try/catch, log the failure, and continue to the next proxy. The example isolates each target in its own browser so a crash does not poison later iterations.
- Issue: A dead or overloaded proxy makes
HTTPS pages still leak your real IP through WebRTC.
- Issue: The proxy carries HTTP traffic, but a site's JavaScript can ask the browser for its local and public addresses over WebRTC, which bypasses the proxy and exposes the host IP.
- Fix: Launch with
--force-webrtc-ip-handling-policy=disable_non_proxied_udp, or block the WebRTC APIs in an init script, when the target fingerprints with WebRTC.
Relaunching a browser per page is slow at scale.
- Issue:
puppeteer.launch()costs roughly 300-700ms each time, so a fresh browser per target dominates the runtime once you are processing thousands of pages. - Fix: Keep one browser per proxy and open many pages on it, or run a small set of browsers concurrently with a promise pool sized to your proxy count.
- Issue:
A SOCKS proxy needs a scheme and cannot authenticate this way.
- Issue: Passing a bare
host:portassumes HTTP; a SOCKS5 endpoint without the scheme is treated as HTTP and the connection fails, and Chromium does not support SOCKS proxy authentication at all. - Fix: Use
--proxy-server=socks5://host:portfor SOCKS, and for SOCKS that requires credentials, run a local authenticated forwarder and point Chromium at that instead.
- Issue: Passing a bare
Use this when
You are running Puppeteer against a site that rate limits or blocks by IP, and you have a pool of authenticated HTTP proxies you want to rotate across so requests spread over many addresses.
Skip this when
You only need plain HTTP requests rather than a browser (route fetch or undici through the proxy with an agent instead); the block is fingerprint-based rather than IP-based (patch the headless browser first); the site sits behind a Cloudflare interactive challenge (a proxy alone will not clear it); or you need a different IP for individual requests within one page rather than per page (intercept and re-route at the request level).