How to extract structured JSON from messy HTML in Node.js
If you searched for parsing JSON in Node.js and landed on JSON.parse(), that is the wrong tool for what you have. You are not holding a JSON file. You are holding a product page where the price sits inside a <span class="price-now">$49.99</span>, the rating is a data-rating attribute, and the title is in an <h1> two divs deep. The data is real, but it is wrapped in display markup that changes between templates and carries currency symbols, whitespace, and stray nodes.
The fix is to treat the page as a tree, point a CSS selector at each value you want, and coerce that text into the typed field it belongs in. You write one extraction map from selector to field, and you get back a plain JavaScript object you can JSON.stringify(). It takes about 40 lines of Node.js with cheerio, the jQuery-style server-side HTML parser.
Key terms
- cheerio. A server-side HTML parser with a jQuery-style API, so you can write
$('.price').text()against a string of HTML without a browser. - Selector map. A small object that pairs each output field with the CSS selector and a coercion function, keeping the extraction logic in one editable place.
- Coercion. Turning a selector's raw text into a typed value: trimming whitespace, stripping a currency symbol, and calling
Number()sopriceis a number rather than the string"$49.99". - JSON-LD. A
<script type="application/ld+json">block many sites embed for search engines, holding the same fields already structured, which you read directly when it is present.
Here is what the script does:
- Fetch the page HTML with a stock desktop browser User-Agent so the server returns the full markup.
- Load that HTML into cheerio so each field can be addressed with a CSS selector.
- Walk a selector map that pairs every output field with its selector and a coercion function, building one typed object.
- Read the page's JSON-LD block when it exists, and fill any field the selectors missed from that structured data.
The complete script
// extract-json.mjs
import * as cheerio from 'cheerio'
const url = 'https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html'
const html = await fetch(url, {
headers: { 'User-Agent': 'Mozilla/5.0' }
}).then(r => r.text())
const $ = cheerio.load(html)
/* Each field names its selector and a coercion that turns raw text into a typed value.
Keep the map here so adapting to a new page layout is a one-line edit per field. */
const fieldMap = {
title: {
selector: 'article.product_page h1',
coerce: (text) => text.trim() || null
},
price: {
selector: 'article.product_page p.price_color',
coerce: (text) => {
const digits = text.replace(/[^0-9.]/g, '')
return digits ? Number(digits) : null
}
},
inStock: {
selector: 'article.product_page p.availability',
coerce: (text) => /in stock/i.test(text)
},
description: {
selector: 'article.product_page > p',
coerce: (text) => text.trim() || null
}
}
const record = {}
for (const [field, { selector, coerce }] of Object.entries(fieldMap)) {
/* .first() guards against a selector matching several nodes and concatenating their text. */
const raw = $(selector).first().text()
record[field] = coerce(raw)
}
/* Fallback: many sites ship the same fields as JSON-LD for search engines.
Read it and fill any field the selectors left null. */
const ldText = $('script[type="application/ld+json"]').first().text()
if (ldText) {
try {
const ld = JSON.parse(ldText)
if (record.title === null && typeof ld.name === 'string') record.title = ld.name
if (record.price === null && ld.offers?.price != null) record.price = Number(ld.offers.price)
} catch {
/* Malformed JSON-LD is common; ignore it and keep the selector results. */
}
}
console.log(JSON.stringify(record, null, 2))npm install cheerio
node extract-json.mjsWhat each step does
Set a stock desktop browser User-Agent. A bare fetch() from Node sends node as its User-Agent, and some servers return a blocked stub for that. A plain Mozilla/5.0 string gets the full page from most public sites. This is politeness, not evasion; a site that hardens against bots needs more than a header.
Load the HTML into cheerio. cheerio.load(html) parses the string into a queryable tree and returns a $ function that takes CSS selectors. From here you address any value on the page the same way you would in a browser console, without launching one.
Drive the extraction from a selector map. Each field in fieldMap carries its own selector and a coerce function. The loop reads $(selector).first().text() and passes the raw string through coerce, so price becomes the number 51.77, inStock becomes a boolean, and an empty match becomes null rather than an empty string. Adapting to a different page layout means editing one selector, not rewriting the loop.
Fall back to JSON-LD. Many product, article, and recipe pages embed a <script type="application/ld+json"> block with the same fields already structured for search engines. The script reads it, parses it inside a try, and fills any field the selectors left null. When the visible markup is messy but the page ships clean structured data, this recovers the value.
Gotchas
A selector matches more than one node and concatenates their text.
- Issue:
$('p.price_color').text()returns the joined text of every matching node, so a page with a list price and a sale price hands you"£51.77£45.00"andNumber()producesNaN. - Fix: scope to one node with
$(selector).first(), as the script does, or tighten the selector to the container you mean, for examplearticle.product_page p.price_color.
- Issue:
The number keeps its currency symbol and separators.
- Issue:
Number('$1,299.00')returnsNaNbecause the dollar sign and comma are not numeric characters, so the price lands asnullor breaks downstream math. - Fix: strip non-numeric characters before converting,
text.replace(/[^0-9.]/g, ''), thenNumber(). Watch locales where the comma is the decimal mark.
- Issue:
A missing element returns an empty string, not an error.
- Issue: cheerio's
.text()on a selector that matches nothing returns'', so a renamed class silently yields a blank field and you do not notice until the data is wrong. - Fix: coerce
''tonullin each field,text.trim() || null, so a missing value is visible in the JSON instead of an empty string masquerading as data.
- Issue: cheerio's
The page is rendered by JavaScript and the HTML is a shell.
- Issue:
fetchreturns only the server's initial HTML, so a React, Vue, or Next-with-client-data page hands back an empty container and every selector matches nothing. - Fix: render with Puppeteer or Playwright first, then pass
page.content()tocheerio.load(). See How to scrape a JavaScript-rendered page in Node.js.
- Issue:
JSON-LD is malformed or split across several blocks.
- Issue: a stray trailing comma or an HTML comment inside the script tag makes
JSON.parse()throw, and some pages ship multiple JSON-LD blocks where the one you want is not the first. - Fix: wrap the parse in
try/catchas the script does, and when a page has several blocks, iterate$('script[type="application/ld+json"]')and match on the@typeyou need rather than taking.first().
- Issue: a stray trailing comma or an HTML comment inside the script tag makes
You need the same map to run against many pages of one site.
- Issue: product templates drift between categories, so a selector that works on one page returns
nullon another with a slightly different layout. - Fix: give a field two candidate selectors and take the first non-empty result, and log which pages produced a
nullso you can spot template variants before they corrupt the dataset.
- Issue: product templates drift between categories, so a selector that works on one page returns
Use this when
You have one or more pages of display HTML and you want named, typed fields out of them: a price as a number, a stock flag as a boolean, a title as a trimmed string, ready to write to a database or feed an API.
Skip this when
The page is single-page-app rendered, so render it first with Puppeteer; the source already serves a JSON API you can hit directly, so call that instead of parsing display HTML; you want the article body as prose rather than discrete fields, so use a readability pass to clean Markdown; the rows are tabular, so a table-to-CSV walk fits better than a per-field map.