Simplescraper Learn

Web scraping guides for developers

Practical, runnable code for real scraping problems. Pick a task below.

Dynamic & JavaScript pages

How to scrape data from a Shadow DOM

Read text and attributes out of an open shadow DOM with raw Puppeteer using the >>> deep combinator and pierce/ selectors. No SDK, runnable code.

How to scrape an infinite scroll page in Puppeteer

Scrape an infinite scroll page in Puppeteer by scrolling in a page.evaluate loop until the content stops growing, then read every item out of the expanded DOM. Working code, raw Puppeteer, no API.

How to scrape a JavaScript-rendered page in Node.js

Scrape a JavaScript-rendered page in Node.js with Puppeteer by launching headless Chrome, waiting for the injected content, and reading the rendered DOM. Working code, no SDK lock-in.

How to scrape a page behind a login in Playwright

Log in once with Playwright, save the session to storageState.json, and reuse it on every later run so you skip the login form and avoid re-authenticating.

How to scrape a single-page app (React/Vue) in Node.js

Scrape a React or Vue single-page app in Node.js with Puppeteer, waiting for the framework to render before reading the DOM. Working code, no SDK.

How to wait for an element to load before scraping

Wait for a JavaScript-rendered element to appear before scraping it, using Puppeteer's waitForSelector in modern ESM Node.js instead of a fixed sleep.

How to handle load more buttons when scraping

Click a "Load more" button in a loop with Puppeteer until every item is loaded, waiting on the XHR response after each click instead of guessing with a fixed delay.

How to scrape a page that requires scrolling to a specific element

Scroll a headless Chrome page to a specific element with Puppeteer, wait for the lazy content it triggers, then read it. Working code, no SDK lock-in.

How to scrape an iframe's contents in Puppeteer

Scrape text and data from inside an iframe with Puppeteer using page.frames(), contentFrame(), and frame.evaluate(). Working code, handles cross-origin and late-loading frames.

Extracting data

How to extract all links from a page in JavaScript

Extract a page's anchor links in Node.js with cheerio, resolving relative URLs and deduplicating. Includes a jsdom alternative, no SDK.

How to extract structured JSON from messy HTML in Node.js

Turn a product page's tangled HTML into a typed JSON object in Node.js using cheerio selectors mapped to fields, with a JSON-LD fallback.

How to intercept and read network requests in Puppeteer

Capture every request and response a page makes with Puppeteer's page.on('request') and page.on('response'), then read JSON response bodies through a CDP Network session. Working code, no SDK.

How to scrape an API with pagination in JavaScript

Page through a JSON API in Node.js with native fetch, covering cursor, offset, and page-number pagination as three runnable loops that stop when the API is exhausted.

How to scrape data loaded from an XHR/fetch request

Find the JSON endpoint a page calls with Puppeteer, then fetch it directly with node-fetch so you skip the browser on every later run.

How to scrape paginated results in JavaScript

Loop through page-number pagination in JavaScript with fetch and cheerio, incrementing ?page=N until a page returns no rows. Working code, no SDK lock-in.

How to scrape a sitemap and crawl every page

Read a sitemap.xml (and nested sitemap indexes) and crawl the discovered URLs in Node.js using native fetch and fast-xml-parser, with a concurrency pool. No SDK.

How to parse and scrape RSS feeds in Node.js

Parse RSS 2.0 and Atom feeds into a single clean array of items in Node.js using rss-parser, with custom fields and broken-feed handling.

How to scrape lazy-loaded images

Scrape image URLs from img elements on a lazy-loading page with Puppeteer by scrolling so the IntersectionObserver fires, then reading src with a data-src fallback. Working code, handles placeholders and srcset.

Output & file formats

How to scrape and download files in Node.js

Download files from scraped URLs in Node.js using native fetch and stream pipeline, with content-disposition and mime-types for correct filenames.

How to scrape a page to a full-page screenshot in Puppeteer

Capture a full-page screenshot of any URL in Node.js with Puppeteer, including viewport tuning and a lazy-image warm-up so nothing renders blank.

How to scrape a page to PDF in Node.js

Render any web page to a paginated PDF in Node.js using Puppeteer's printToPDF, with margins, backgrounds, and a page-numbered header and footer.

How to scrape a page to clean Markdown in Node.js

Convert any web page to clean, LLM-ready Markdown in Node.js using @mozilla/readability + turndown. Working code, no SDK lock-in.

How to scrape a table into CSV in JavaScript

Scrape an HTML table into a clean CSV file in Node.js using cheerio to parse the table and csv-stringify to escape every cell correctly.

Avoiding detection

How to bypass Cloudflare in Puppeteer

Get a Puppeteer scrape past a Cloudflare interstitial in Node.js using rebrowser-puppeteer, then detect the challenge and reuse the cf_clearance cookie. Working code, no SDK lock-in.

How to detect if you've been soft-blocked while scraping

Detect soft blocks in Node.js by combining the HTTP status code, the response body size, and challenge-page content markers, so you catch the 200-OK interstitials a status check alone misses.

How to handle cookies and sessions when scraping in Node.js

Persist a login across requests when scraping in Node.js using a tough-cookie jar for fetch, plus storageState save and restore for Puppeteer and Playwright.

How to rotate user agents per request in Node.js

Rotate a pool of user agents in Node.js and send a matching header set per request using intoli/user-agents plus a small lookup, so the User-Agent does not contradict the other headers. Working code, no SDK lock-in.

How to scrape behind a proxy in Puppeteer

Route Puppeteer through an authenticated proxy with the --proxy-server launch flag and page.authenticate(), in plain Puppeteer with no Crawlee. Working code with a rotating pool.

How to solve a Turnstile / reCAPTCHA challenge programmatically

Solve a Cloudflare Turnstile or reCAPTCHA v2 challenge in Node.js by sending the sitekey to a solver service, polling for the token, and injecting it into the page's response field. Working code, no SDK lock-in.

How to spoof a realistic browser fingerprint in Playwright

Generate and inject a consistent browser fingerprint in Playwright with fingerprint-injector, with the viewport, locale, and timezone matched to the User-Agent.

How to add human-like delays and mouse movement in Puppeteer

Add curved mouse paths and randomized timing to Puppeteer with ghost-cursor, so authorized scraping does not trip naive rate or motion throttles. Working code, no SDK lock-in.

How to patch headless Chrome to avoid detection

Patch the leaks that mark a headless Chrome as automated in Node.js using rebrowser-puppeteer plus canvas and WebGL fingerprint overrides. Working code, no SDK lock-in.

Reliability & scaling

How to scrape concurrently with a promise pool in Node.js

Scrape many URLs at a fixed concurrency in plain Node.js using a Promise.race pool that keeps N fetches in flight and tops up as each finishes, in about 50 lines with nothing to install.

How to deduplicate scraped records in JavaScript

Deduplicate scraped records in JavaScript with a Set and a hash of the normalized title and body as a secondary key, so the same item reached from different URLs collapses to one entry.

How to rate-limit requests with backoff in JavaScript

Cap your outgoing request rate and back off when a server returns 429, using a token-bucket limiter plus exponential backoff with full jitter, in about 70 lines of plain Node.js with no dependencies.

How to resume a scrape after a crash

Make a long scrape resumable in Node.js by checkpointing each URL's status to SQLite with better-sqlite3, so a crash restarts from the last finished row instead of the top.

How to retry failed scrapes with exponential backoff

Retry transient scrape failures in Node.js with exponential backoff and jitter using p-retry, retrying only the errors worth retrying. Working code, no SDK lock-in.

How to run a scraper on a schedule with Node.js

Run a Node.js scraper on a cron schedule with node-cron, with noOverlap and a lockfile so two runs never collide. Working code, no service to host.

Skip the code, just get the data
Simplescraper turns any website into structured data in seconds.

Try it free