How to scrape a page to clean Markdown in Node.js

Updated 2026-06-18 · 5 min read

If you've tried to feed a web page to an LLM or a search index, you have probably watched most of what you send turn out to be noise: the nav bar, a cookie banner, ad slots, share buttons, a few analytics scripts. The article you actually wanted is a small slice of the bytes, buried in markup the model then has to read around.

Stripping all of that away is a solved problem. The solution is to run the page through a readability pass that keeps only the article body, then convert that clean HTML to Markdown. We'll build a small script that fetches the page with a normal browser header so the server returns the real article instead of a bot stub, runs it through Mozilla's Readability to drop the nav, ads, and boilerplate, and hands what's left to Turndown for the Markdown, with a little care for the edge cases that bite in practice: JavaScript-rendered pages, tables, and code blocks. It comes out to about 25 lines and two open-source libraries.

The complete script

// scrape-to-markdown.mjs
import { Readability } from '@mozilla/readability'
import { JSDOM } from 'jsdom'
import TurndownService from 'turndown'

const url = 'https://en.wikipedia.org/wiki/Web_scraping'

const html = await fetch(url, {
  headers: { 'User-Agent': 'Mozilla/5.0' }
}).then(r => r.text())

const dom = new JSDOM(html, { url })
const article = new Readability(dom.window.document).parse()

const turndown = new TurndownService({
  headingStyle: 'atx',
  codeBlockStyle: 'fenced'
})

console.log(article.title)
console.log(turndown.turndown(article.content))

bash

npm install @mozilla/readability jsdom turndown
node scrape-to-markdown.mjs

How it works

Set a normal browser User-Agent. A bare fetch() from Node sends node as its User-Agent, and plenty of sites 403 on that, so pasting a normal Mozilla string fixes most of them. This is politeness, not stealth - sites that actually want to block bots block harder than a UA string.

Parse with JSDOM, and pass the URL. Readability needs a DOM, and it needs the page's URL so relative links resolve, so the { url } option is not optional - drop it and the internal links in your Markdown point nowhere. JSDOM parses at 100-300ms per heavy page, so once you are processing thousands of pages, swap it for linkedom, which is about 5x faster and mostly drop-in.

Extract with Readability. new Readability(doc).parse() returns { title, byline, excerpt, content, ... }, where content is cleaned-up HTML of the article body. Listing pages, splash pages, and paywalls return null, so check before you use it. One failure mode dominates: fetch only sees the server's initial HTML, so a React, Vue, or Next-with-client-data page hands back an empty shell, which means you render it with Puppeteer or Playwright first and pass page.content() to JSDOM. On Wikipedia-style pages, Readability keeps the inline table of contents, so strip it before parsing with doc.querySelectorAll('.toc, .vector-toc, #toc').forEach(el => el.remove()).

Configure Turndown once. Defaults give you indented code blocks and a setext-and-atx heading mix, so pass headingStyle: 'atx' and codeBlockStyle: 'fenced' at construction time and reuse the instance for every page. Two conversions need help: Turndown's core set skips <table>, so install turndown-plugin-gfm and call turndown.use(gfm.tables) to keep tabular content; and it drops the language-X class on code blocks, so register a custom rule that reads that class and emits a language-tagged fence when you need the highlight hint downstream.

Use this when

You want the main article body of a page as Markdown - for an LLM context window, a RAG index, a content-syndication pipeline, or a personal read-later workflow. One document of human text per page.

Skip this when

You need every link on a listing page (use a sitemap walker); the page is single-page-app rendered (render first with Puppeteer); the content is behind a login (handle auth first); you need structured fields rather than article text (use schema extraction).

How to scrape a page to clean Markdown in Node.js ​

The complete script ​

How it works ​

Related guides ​

Skip the code, just get the data Simplescraper turns any website into structured data in seconds.

How to scrape a page to clean Markdown in Node.js

The complete script

How it works

Related guides

Skip the code, just get the data
Simplescraper turns any website into structured data in seconds.