Simplescraper
Skip to content

How to scrape a page to clean Markdown in Node.js

How to scrape a page to clean Markdown in Node.js

Updated 2026-06-18 · 5 min read

If you've tried to feed a web page to an LLM or a search index, you have probably watched most of what you send turn out to be noise: the nav bar, a cookie banner, ad slots, share buttons, a few analytics scripts. The article you actually wanted is a small slice of the bytes, buried in markup the model then has to read around.

Stripping all of that away is a solved problem. The solution is to run the page through a readability pass that keeps only the article body, and then convert that clean HTML to Markdown. It takes about 25 lines of Node.js with two open-source libraries.

Key terms

  • Readability. Mozilla's article-extraction algorithm, the engine behind Firefox Reader View, which scores DOM nodes and returns only the article body.
  • Turndown. A library that converts an HTML DOM into Markdown, configurable for heading and code-block style.
  • JSDOM. A Node implementation of the browser DOM, so server-side code can parse and query HTML the way a browser does.
  • User-Agent. The header a client sends to identify itself, which many servers check before deciding whether to return the full page or block the request.

Here is what the script does:

  • Fetch the page's HTML with a normal browser User-Agent, so the server returns the full page instead of a bot-blocked stub.
  • Strip the navigation, ads, and boilerplate with Mozilla's Readability, the engine behind Firefox Reader View.
  • Convert the cleaned-up article body to Markdown with Turndown.
  • Check the result and handle the common edge cases: JavaScript-rendered pages, tables, and fenced code blocks.

The complete script

js
// scrape-to-markdown.mjs
import { Readability } from '@mozilla/readability'
import { JSDOM } from 'jsdom'
import TurndownService from 'turndown'

const url = 'https://en.wikipedia.org/wiki/Web_scraping'

const html = await fetch(url, {
  headers: { 'User-Agent': 'Mozilla/5.0' }
}).then(r => r.text())

const dom = new JSDOM(html, { url })
const article = new Readability(dom.window.document).parse()

const turndown = new TurndownService({
  headingStyle: 'atx',
  codeBlockStyle: 'fenced'
})

console.log(article.title)
console.log(turndown.turndown(article.content))
bash
npm install @mozilla/readability jsdom turndown
node scrape-to-markdown.mjs

What each step does

Set a normal browser User-Agent. A bare fetch() from Node sends node as its User-Agent. Plenty of sites 403 on that. Pasting a normal Mozilla string fixes most of them. This is politeness, not stealth - sites that actually want to block bots block harder than a UA string.

Parse with JSDOM, pass the URL. Readability needs a DOM, and it needs the page's URL so relative links resolve. The { url } option is not optional. Drop it and the internal links in your Markdown point nowhere.

Extract with Readability. new Readability(doc).parse() returns { title, byline, excerpt, content, ... }. The content field is cleaned-up HTML of the article body. Listing pages, splash pages, and paywalls return null - check before you use it.

Configure Turndown once. Defaults give you indented code blocks and a setext-and-atx heading mix. Pass headingStyle: 'atx' and codeBlockStyle: 'fenced' at construction time. Reuse the instance for every page you scrape.

Gotchas

  • JavaScript-rendered sites return a stub.

    • Issue: fetch only sees the server's initial HTML, so React, Vue, and Next-with-client-data hand back an empty shell with none of the article in it.
    • Fix: render with Puppeteer or Playwright first, then pass page.content() to JSDOM. See How to scrape a JavaScript-rendered page in Node.js.
  • Tables get dropped.

    • Issue: Turndown's core plugin set skips <table>, so any tabular content vanishes from the Markdown.
    • Fix: install turndown-plugin-gfm and call turndown.use(gfm.tables).
  • Code blocks lose their language hint.

    • Issue: Turndown converts <pre><code class="language-js"> to a plain fenced block, dropping the js that drives syntax highlighting downstream.
    • Fix: register a custom rule that reads the language-X class and emits ```X instead of a bare ```.
  • JSDOM is slow on big batches.

    • Issue: JSDOM parses at 100-300ms per heavy page, which dominates the runtime once you are processing thousands of pages.
    • Fix: swap JSDOM for linkedom, which is about 5x faster and mostly drop-in.
  • Wiki-style pages bring the table of contents along.

    • Issue: Readability keeps the inline table of contents on Wikipedia and similar pages, so it lands in your Markdown as a list of section links.
    • Fix: strip it before parsing: doc.querySelectorAll('.toc, .vector-toc, #toc').forEach(el => el.remove()).

Use this when

You want the main article body of a page as Markdown - for an LLM context window, a RAG index, a content-syndication pipeline, or a personal read-later workflow. One document of human text per page.

Skip this when

You need every link on a listing page (use a sitemap walker); the page is single-page-app rendered (render first with Puppeteer); the content is behind a login (handle auth first); you need structured fields rather than article text (use schema extraction).

Skip the code, just get the data

Simplescraper turns any website into structured data in seconds.