How to scrape a page to clean Markdown in Node.js
If you've tried to feed a web page to an LLM or a search index, you have probably watched most of what you send turn out to be noise: the nav bar, a cookie banner, ad slots, share buttons, a few analytics scripts. The article you actually wanted is a small slice of the bytes, buried in markup the model then has to read around.
Stripping all of that away is a solved problem. The solution is to run the page through a readability pass that keeps only the article body, and then convert that clean HTML to Markdown. It takes about 25 lines of Node.js with two open-source libraries.
Key terms
- Readability. Mozilla's article-extraction algorithm, the engine behind Firefox Reader View, which scores DOM nodes and returns only the article body.
- Turndown. A library that converts an HTML DOM into Markdown, configurable for heading and code-block style.
- JSDOM. A Node implementation of the browser DOM, so server-side code can parse and query HTML the way a browser does.
- User-Agent. The header a client sends to identify itself, which many servers check before deciding whether to return the full page or block the request.
Here is what the script does:
- Fetch the page's HTML with a normal browser User-Agent, so the server returns the full page instead of a bot-blocked stub.
- Strip the navigation, ads, and boilerplate with Mozilla's Readability, the engine behind Firefox Reader View.
- Convert the cleaned-up article body to Markdown with Turndown.
- Check the result and handle the common edge cases: JavaScript-rendered pages, tables, and fenced code blocks.
The complete script
// scrape-to-markdown.mjs
import { Readability } from '@mozilla/readability'
import { JSDOM } from 'jsdom'
import TurndownService from 'turndown'
const url = 'https://en.wikipedia.org/wiki/Web_scraping'
const html = await fetch(url, {
headers: { 'User-Agent': 'Mozilla/5.0' }
}).then(r => r.text())
const dom = new JSDOM(html, { url })
const article = new Readability(dom.window.document).parse()
const turndown = new TurndownService({
headingStyle: 'atx',
codeBlockStyle: 'fenced'
})
console.log(article.title)
console.log(turndown.turndown(article.content))npm install @mozilla/readability jsdom turndown
node scrape-to-markdown.mjsWhat each step does
Set a normal browser User-Agent. A bare fetch() from Node sends node as its User-Agent. Plenty of sites 403 on that. Pasting a normal Mozilla string fixes most of them. This is politeness, not stealth - sites that actually want to block bots block harder than a UA string.
Parse with JSDOM, pass the URL. Readability needs a DOM, and it needs the page's URL so relative links resolve. The { url } option is not optional. Drop it and the internal links in your Markdown point nowhere.
Extract with Readability. new Readability(doc).parse() returns { title, byline, excerpt, content, ... }. The content field is cleaned-up HTML of the article body. Listing pages, splash pages, and paywalls return null - check before you use it.
Configure Turndown once. Defaults give you indented code blocks and a setext-and-atx heading mix. Pass headingStyle: 'atx' and codeBlockStyle: 'fenced' at construction time. Reuse the instance for every page you scrape.
Gotchas
JavaScript-rendered sites return a stub.
- Issue:
fetchonly sees the server's initial HTML, so React, Vue, and Next-with-client-data hand back an empty shell with none of the article in it. - Fix: render with Puppeteer or Playwright first, then pass
page.content()to JSDOM. See How to scrape a JavaScript-rendered page in Node.js.
- Issue:
Tables get dropped.
- Issue: Turndown's core plugin set skips
<table>, so any tabular content vanishes from the Markdown. - Fix: install
turndown-plugin-gfmand callturndown.use(gfm.tables).
- Issue: Turndown's core plugin set skips
Code blocks lose their language hint.
- Issue: Turndown converts
<pre><code class="language-js">to a plain fenced block, dropping thejsthat drives syntax highlighting downstream. - Fix: register a custom rule that reads the
language-Xclass and emits```Xinstead of a bare```.
- Issue: Turndown converts
JSDOM is slow on big batches.
- Issue: JSDOM parses at 100-300ms per heavy page, which dominates the runtime once you are processing thousands of pages.
- Fix: swap JSDOM for
linkedom, which is about 5x faster and mostly drop-in.
Wiki-style pages bring the table of contents along.
- Issue: Readability keeps the inline table of contents on Wikipedia and similar pages, so it lands in your Markdown as a list of section links.
- Fix: strip it before parsing:
doc.querySelectorAll('.toc, .vector-toc, #toc').forEach(el => el.remove()).
Use this when
You want the main article body of a page as Markdown - for an LLM context window, a RAG index, a content-syndication pipeline, or a personal read-later workflow. One document of human text per page.
Skip this when
You need every link on a listing page (use a sitemap walker); the page is single-page-app rendered (render first with Puppeteer); the content is behind a login (handle auth first); you need structured fields rather than article text (use schema extraction).