How to extract structured JSON from messy HTML in Node.js

Updated 2026-06-25 · 5 min read

If you searched for parsing JSON in Node.js and landed on JSON.parse(), that is the wrong tool for what you have. You are not holding a JSON file. You are holding a product page where the price sits inside a <span class="price-now">$49.99</span>, the rating is a data-rating attribute, and the title is in an <h1> two divs deep. The data is real, but it is wrapped in display markup that changes between templates and carries currency symbols, whitespace, and stray nodes.

The fix is to treat the page as a tree, point a CSS selector at each value you want, and coerce that text into the typed field it belongs in. We'll build a small script that fetches the page with a normal browser header so the server returns the full markup, loads the HTML into cheerio so each value can be addressed by a CSS selector, walks one selector-to-field map that turns each match into the typed value it should be, and falls back to the page's JSON-LD block to fill anything the selectors missed. You get back a plain JavaScript object you can JSON.stringify(). It takes about 40 lines of Node.js with cheerio, the jQuery-style server-side HTML parser, and nothing else to install.

The complete script

// extract-json.mjs
import * as cheerio from 'cheerio'

const url = 'https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html'

const html = await fetch(url, {
  headers: { 'User-Agent': 'Mozilla/5.0' }
}).then(r => r.text())

const $ = cheerio.load(html)

/* each field names its selector and a coercion that turns raw text into a typed value.
   keep the map here so adapting to a new page layout is a one-line edit per field. */
const fieldMap = {
  title: {
    selector: 'article.product_page h1',
    coerce: (text) => text.trim() || null
  },
  price: {
    selector: 'article.product_page p.price_color',
    coerce: (text) => {
      const digits = text.replace(/[^0-9.]/g, '')
      return digits ? Number(digits) : null
    }
  },
  inStock: {
    selector: 'article.product_page p.availability',
    coerce: (text) => /in stock/i.test(text)
  },
  description: {
    selector: 'article.product_page > p',
    coerce: (text) => text.trim() || null
  }
}

const record = {}
for (const [field, { selector, coerce }] of Object.entries(fieldMap)) {
  /* .first() guards against a selector matching several nodes and concatenating their text. */
  const raw = $(selector).first().text()
  record[field] = coerce(raw)
}

/* fallback: many sites ship the same fields as JSON-LD for search engines.
   read it and fill any field the selectors left null. */
const ldText = $('script[type="application/ld+json"]').first().text()
if (ldText) {
  try {
    const ld = JSON.parse(ldText)
    if (record.title === null && typeof ld.name === 'string') record.title = ld.name
    if (record.price === null && ld.offers?.price != null) record.price = Number(ld.offers.price)
  } catch {
    /* malformed JSON-LD is common; ignore it and keep the selector results. */
  }
}

console.log(JSON.stringify(record, null, 2))

bash

npm install cheerio
node extract-json.mjs

How it works

Set a stock desktop browser User-Agent. A bare fetch() from Node sends node as its User-Agent, and some servers return a blocked stub for that. A plain Mozilla/5.0 string gets the full page from most public sites. This is politeness, not evasion; a site that hardens against bots needs more than a header.

Load the HTML into cheerio. cheerio.load(html) parses the string into a queryable tree and returns a $ function that takes CSS selectors. From here you address any value on the page the same way you would in a browser console, without launching one. If fetch hands back an empty container because the page is rendered by JavaScript, every selector matches nothing; render with Puppeteer or Playwright first and pass page.content() to cheerio.load(). See How to scrape a JavaScript-rendered page in Node.js.

Drive the extraction from a selector map. Each field in fieldMap carries its own selector and a coerce function. The loop reads $(selector).first().text() and passes the raw string through coerce, so price becomes the number 51.77, inStock becomes a boolean, and an empty match becomes null rather than an empty string. The .first() matters: $('p.price_color').text() joins every matching node, so a page with a list price and a sale price hands you "£51.77£45.00" and Number() produces NaN. Coercion does the rest of the cleanup: strip non-numeric characters before Number() so a price like $1,299.00 does not return NaN, and watch locales where the comma is the decimal mark; coerce an empty string to null so a renamed class shows up as a missing field in the JSON instead of a blank that reads like real data. Adapting to a different page layout means editing one selector, not rewriting the loop, and when one map has to run across a whole site whose templates drift between categories, give a field two candidate selectors, take the first non-empty result, and log which pages produced a null so you spot template variants before they corrupt the dataset.

Fall back to JSON-LD. Many product, article, and recipe pages embed a <script type="application/ld+json"> block with the same fields already structured for search engines. The script reads it, parses it inside a try because a stray trailing comma or an inline HTML comment makes JSON.parse() throw, and fills any field the selectors left null. When a page ships several JSON-LD blocks and the one you want is not the first, iterate $('script[type="application/ld+json"]') and match on the @type you need rather than taking .first(). When the visible markup is messy but the page ships clean structured data, this recovers the value.

Use this when

You have one or more pages of display HTML and you want named, typed fields out of them: a price as a number, a stock flag as a boolean, a title as a trimmed string, ready to write to a database or feed an API.

Skip this when

The page is single-page-app rendered, so render it first with Puppeteer; the source already serves a JSON API you can hit directly, so call that instead of parsing display HTML; you want the article body as prose rather than discrete fields, so use a readability pass to clean Markdown; the rows are tabular, so a table-to-CSV walk fits better than a per-field map.

How to extract structured JSON from messy HTML in Node.js ​

The complete script ​

How it works ​

Related guides ​

Skip the code, just get the data Simplescraper turns any website into structured data in seconds.

How to extract structured JSON from messy HTML in Node.js

The complete script

How it works

Related guides

Skip the code, just get the data
Simplescraper turns any website into structured data in seconds.