How to scrape a page behind a login in Playwright

Updated 2026-06-25 · 6 min read

If you're scraping pages that sit behind a sign-in form, you have probably wired the login into the start of your script: type the email, type the password, click submit, wait for the dashboard, then scrape. It works, but it runs the full login on every job, which is slow, hammers the login endpoint, and trips rate limits or a fresh email or SMS challenge after a handful of attempts.

The fix is to authenticate once interactively, capture the resulting session, and replay it on later runs. We'll build two short scripts: the first logs in by hand one time, waits for a post-login signal so the session is real before it saves anything, and writes the cookies and localStorage to a JSON file; the second loads that file into a fresh browser context on every later run so Playwright starts already signed in and goes straight to the protected page, checking an account-only element first to confirm the session still holds. It comes out to about 30 and 20 lines, using only Playwright. Log in to your own or an authorized account; this is for sessions you are allowed to access.

The complete script

// save-login.mjs
// run this once, by hand, to capture a logged-in session to disk.
import { chromium } from 'playwright'

const browser = await chromium.launch({ headless: true })
const context = await browser.newContext()
const page = await context.newPage()

// 1. go to the login form. replace with the site you are authorized to scrape.
await page.goto('https://practicetestautomation.com/practice-test-login/')

// 2. fill the credentials. read them from the environment, never hardcode them.
await page.fill('#username', process.env.SCRAPE_USER)
await page.fill('#password', process.env.SCRAPE_PASS)

// 3. submit and wait for a post-login signal before saving anything.
//    waitForURL resolves once the browser is on the logged-in page, which
//    means the session cookies have been set.
await Promise.all([
  page.waitForURL('**/logged-in-successfully/'),
  page.click('#submit')
])

// 4. persist cookies + localStorage to disk. this file IS the session.
await context.storageState({ path: 'storageState.json' })
console.log('Saved session to storageState.json')

await browser.close()

bash

npm install playwright
npx playwright install chromium
SCRAPE_USER='student' SCRAPE_PASS='Password123' node save-login.mjs

Then reuse the saved session on every later run without touching the login form:

// scrape-with-session.mjs
// run this as often as you like. it never logs in; it replays the saved session.
import { chromium } from 'playwright'

const browser = await chromium.launch({ headless: true })

// load the cookies + localStorage captured by save-login.mjs.
// the new context starts already authenticated.
const context = await browser.newContext({ storageState: 'storageState.json' })
const page = await context.newPage()

// go straight to the protected page. no login form involved.
await page.goto('https://practicetestautomation.com/logged-in-successfully/')

// confirm the session is still valid before trusting the scrape.
const heading = await page.textContent('.post-title')
if (!heading || !heading.includes('Logged In Successfully')) {
  throw new Error('Session expired or invalid. Re-run save-login.mjs.')
}

// scrape the account-only content.
const body = await page.textContent('.post-content')
console.log(body.trim())

await browser.close()

bash

node scrape-with-session.mjs

How it works

Launch headless, but log in headed the first time if there is a challenge. chromium.launch({ headless: true }) is the current default and is fine when the login is a plain form. If the site shows a CAPTCHA or a one-time code on first sign-in, run save-login.mjs once with headless: false, clear the challenge by hand, and let the script reach the save step. After that, scrape-with-session.mjs stays headless because it never logs in.

Read credentials from the environment. process.env.SCRAPE_USER and process.env.SCRAPE_PASS keep the email and password out of the source and out of version control. The example uses the public demo account on practicetestautomation.com so the script runs as written; point the URL and selectors at the account you are authorized to scrape. The #username, #password, and #submit selectors are specific to that demo form, so on another site they match nothing and page.fill times out after 30 seconds - open the target login form, copy the real field selectors, and prefer stable attributes like input[name="email"] or getByLabel('Password') over generated class names.

Wait for a post-login signal before saving. page.waitForURL('**/logged-in-successfully/') resolves once the browser lands on the signed-in page, which is the point where the session cookies exist. Pairing it with the submit click inside Promise.all avoids the race where the click navigates away before the wait is registered. Saving before this signal captures a logged-out session, which is the most common reason the saved file does not work.

Persist with context.storageState({ path }). This writes a JSON file holding the cookies and each origin's localStorage. Capturing both matters because single-page apps often keep the auth token in localStorage rather than a cookie, so a cookies-only save with context.cookies() reloads without the token and the app treats you as signed out. That file is the entire session. Anyone with it is signed in as that account, so treat it like a password: add it to .gitignore before the first run, keep it off shared machines, and rotate the account password if it was ever pushed.

Reload with the storageState option. browser.newContext({ storageState: 'storageState.json' }) seeds the new context with those cookies and localStorage, so the first navigation is already authenticated. Cookies have a lifetime, and when they lapse the context still loads and the navigation still succeeds, so the reuse script goes directly to the protected URL and asserts an account-only element to confirm the session held before it trusts the output, rather than trusting a 200 response. One more thing some sites do is bind the session to the issuing IP or User-Agent, so replay from the same egress IP where you can and pass a consistent userAgent to newContext so the reused session matches the context that created it.

Use this when

You scrape an account-gated page on your own or an authorized account and want to run the scrape repeatedly without re-submitting the login form, for a dashboard export, a private feed, or an internal tool that has no API.

Skip this when

The site offers an official API with a token (call the API instead of driving a browser); the content is public and needs no auth (a plain page.goto is enough); the login depends on a CAPTCHA or one-time code on every attempt (solve the challenge interactively and lean on storageState reuse to keep re-logins rare); or you need many concurrent accounts (give each its own storageState file and context rather than one shared session).

How to scrape a page behind a login in Playwright ​

The complete script ​

How it works ​

Related guides ​

Skip the code, just get the data Simplescraper turns any website into structured data in seconds.

How to scrape a page behind a login in Playwright

The complete script

How it works

Related guides

Skip the code, just get the data
Simplescraper turns any website into structured data in seconds.