How to scrape a page behind a login in Playwright
If you're scraping pages that sit behind a sign-in form, you have probably wired the login into the start of your script: type the email, type the password, click submit, wait for the dashboard, then scrape. It works, but it runs the full login on every job, which is slow, hammers the login endpoint, and trips rate limits or a fresh email or SMS challenge after a handful of attempts.
The fix is to authenticate once interactively, capture the resulting session, and replay it on later runs. Playwright stores the cookies and localStorage the site uses to keep you signed in, writes them to a JSON file with context.storageState(), and loads them back into a fresh context with the storageState option so the next run starts already logged in. This page splits the work into two short scripts of about 30 and 20 lines, using only Playwright. Log in to your own or an authorized account; this is for sessions you are allowed to access.
Key terms
- storageState. A Playwright JSON snapshot of a browser context's cookies and per-origin localStorage, written by
context.storageState({ path })and reloaded bybrowser.newContext({ storageState }). - Browser context. An isolated session inside one browser process, with its own cookies and storage, created by
browser.newContext(). Two contexts do not share a login. - Cookies. The session identifiers most sites set after a successful login; sending them back on later requests is what keeps you signed in.
- localStorage. Per-origin key-value storage some single-page apps use to hold an auth token instead of, or alongside, a cookie. storageState captures it too.
Here is what the two scripts do:
- Run a one-time login with Playwright, fill the email and password fields, and submit the form.
- Wait for a post-login signal (a URL change or an account-only element) so the session cookies are set before you save anything.
- Write the cookies and localStorage to
storageState.jsonwithcontext.storageState({ path }). - On every later run, create a context with
{ storageState: 'storageState.json' }so Playwright starts signed in and goes straight to the protected page.
The complete script
// save-login.mjs
// Run this once, by hand, to capture a logged-in session to disk.
import { chromium } from 'playwright'
const browser = await chromium.launch({ headless: true })
const context = await browser.newContext()
const page = await context.newPage()
// 1. Go to the login form. Replace with the site you are authorized to scrape.
await page.goto('https://practicetestautomation.com/practice-test-login/')
// 2. Fill the credentials. Read them from the environment, never hardcode them.
await page.fill('#username', process.env.SCRAPE_USER)
await page.fill('#password', process.env.SCRAPE_PASS)
// 3. Submit and wait for a post-login signal before saving anything.
// waitForURL resolves once the browser is on the logged-in page, which
// means the session cookies have been set.
await Promise.all([
page.waitForURL('**/logged-in-successfully/'),
page.click('#submit')
])
// 4. Persist cookies + localStorage to disk. This file IS the session.
await context.storageState({ path: 'storageState.json' })
console.log('Saved session to storageState.json')
await browser.close()npm install playwright
npx playwright install chromium
SCRAPE_USER='student' SCRAPE_PASS='Password123' node save-login.mjsThen reuse the saved session on every later run without touching the login form:
// scrape-with-session.mjs
// Run this as often as you like. It never logs in; it replays the saved session.
import { chromium } from 'playwright'
const browser = await chromium.launch({ headless: true })
// Load the cookies + localStorage captured by save-login.mjs.
// The new context starts already authenticated.
const context = await browser.newContext({ storageState: 'storageState.json' })
const page = await context.newPage()
// Go straight to the protected page. No login form involved.
await page.goto('https://practicetestautomation.com/logged-in-successfully/')
// Confirm the session is still valid before trusting the scrape.
const heading = await page.textContent('.post-title')
if (!heading || !heading.includes('Logged In Successfully')) {
throw new Error('Session expired or invalid. Re-run save-login.mjs.')
}
// Scrape the account-only content.
const body = await page.textContent('.post-content')
console.log(body.trim())
await browser.close()node scrape-with-session.mjsWhat each step does
Launch headless, but log in headed the first time if there is a challenge. chromium.launch({ headless: true }) is the current default and is fine when the login is a plain form. If the site shows a CAPTCHA or a one-time code on first sign-in, run save-login.mjs once with headless: false, clear the challenge by hand, and let the script reach the save step. After that, scrape-with-session.mjs stays headless because it never logs in.
Read credentials from the environment. process.env.SCRAPE_USER and process.env.SCRAPE_PASS keep the email and password out of the source and out of version control. The example uses the public demo account on practicetestautomation.com so the script runs as written; point the URL and selectors at the account you are authorized to scrape.
Wait for a post-login signal before saving. page.waitForURL('**/logged-in-successfully/') resolves once the browser lands on the signed-in page, which is the point where the session cookies exist. Pairing it with the submit click inside Promise.all avoids the race where the click navigates away before the wait is registered. Saving before this signal captures a logged-out session, which is the most common reason the saved file does not work.
Persist with context.storageState({ path }). This writes a JSON file holding the cookies and each origin's localStorage. That file is the entire session. Anyone with it is signed in as that account, so treat it like a password: keep it out of git and off shared machines.
Reload with the storageState option. browser.newContext({ storageState: 'storageState.json' }) seeds the new context with those cookies and localStorage, so the first navigation is already authenticated. The reuse script goes directly to the protected URL and checks an account-only element to confirm the session held before it trusts the output.
Gotchas
The saved file is logged out because you saved too early.
- Issue: Calling
context.storageState()right afterpage.click('#submit')often runs before the navigation finishes, so the cookies are not set yet and the JSON holds an anonymous session. - Fix: Gate the save on a real post-login signal,
await page.waitForURL('**/logged-in-successfully/')orawait page.waitForSelector('.account-menu'), and only then callcontext.storageState({ path }).
- Issue: Calling
The session expires and later runs silently scrape the logged-out page.
- Issue: Cookies have a lifetime. When they lapse,
newContext({ storageState })still loads, the navigation still succeeds, and you scrape the public or login-redirect version without an error. - Fix: Assert an account-only element after navigation, as the reuse script does with
.post-title, andthrowto re-runsave-login.mjswhen it is missing, rather than trusting a 200 response.
- Issue: Cookies have a lifetime. When they lapse,
A token in localStorage is missed if you only save cookies.
- Issue: Single-page apps often keep the auth token in localStorage, not a cookie, so a cookies-only save (for example
context.cookies()) reloads without the token and the app treats you as signed out. - Fix: Use
context.storageState(), which captures cookies and per-origin localStorage together, instead of saving cookies on their own.
- Issue: Single-page apps often keep the auth token in localStorage, not a cookie, so a cookies-only save (for example
storageState.json lands in your git history.
- Issue: The file contains live session credentials, so committing it hands account access to anyone who reads the repository.
- Fix: Add
storageState.jsonto.gitignorebefore the first run, and rotate the account password if it was ever pushed.
Selectors aimed at the demo form do not match the target site.
- Issue:
#username,#password, and#submitare specific to practicetestautomation.com; on another site they select nothing andpage.filltimes out after 30 seconds. - Fix: Open the target login form, copy the real field selectors, and prefer stable attributes like
input[name="email"]orgetByLabel('Password')over generated class names.
- Issue:
Some sites bind the session to the browser fingerprint or IP.
- Issue: A session captured on one machine can be rejected when replayed from a different IP or a context with a different User-Agent, because the site ties the cookie to where it was issued.
- Fix: Replay from the same egress IP where you can, and pass a consistent
userAgenttonewContextso the reused session matches the context that created it.
Use this when
You scrape an account-gated page on your own or an authorized account and want to run the scrape repeatedly without re-submitting the login form, for a dashboard export, a private feed, or an internal tool that has no API.
Skip this when
The site offers an official API with a token (call the API instead of driving a browser); the content is public and needs no auth (a plain page.goto is enough); the login depends on a CAPTCHA or one-time code on every attempt (solve the challenge interactively and lean on storageState reuse to keep re-logins rare); or you need many concurrent accounts (give each its own storageState file and context rather than one shared session).