Simplescraper
Skip to content

How to extract structured JSON from messy HTML in Node.js

How to extract structured JSON from messy HTML in Node.js

Updated 2026-06-25 · 5 min read

If you searched for parsing JSON in Node.js and landed on JSON.parse(), that is the wrong tool for what you have. You are not holding a JSON file. You are holding a product page where the price sits inside a <span class="price-now">$49.99</span>, the rating is a data-rating attribute, and the title is in an <h1> two divs deep. The data is real, but it is wrapped in display markup that changes between templates and carries currency symbols, whitespace, and stray nodes.

The fix is to treat the page as a tree, point a CSS selector at each value you want, and coerce that text into the typed field it belongs in. You write one extraction map from selector to field, and you get back a plain JavaScript object you can JSON.stringify(). It takes about 40 lines of Node.js with cheerio, the jQuery-style server-side HTML parser.

Key terms

  • cheerio. A server-side HTML parser with a jQuery-style API, so you can write $('.price').text() against a string of HTML without a browser.
  • Selector map. A small object that pairs each output field with the CSS selector and a coercion function, keeping the extraction logic in one editable place.
  • Coercion. Turning a selector's raw text into a typed value: trimming whitespace, stripping a currency symbol, and calling Number() so price is a number rather than the string "$49.99".
  • JSON-LD. A <script type="application/ld+json"> block many sites embed for search engines, holding the same fields already structured, which you read directly when it is present.

Here is what the script does:

  • Fetch the page HTML with a stock desktop browser User-Agent so the server returns the full markup.
  • Load that HTML into cheerio so each field can be addressed with a CSS selector.
  • Walk a selector map that pairs every output field with its selector and a coercion function, building one typed object.
  • Read the page's JSON-LD block when it exists, and fill any field the selectors missed from that structured data.

The complete script

js
// extract-json.mjs
import * as cheerio from 'cheerio'

const url = 'https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html'

const html = await fetch(url, {
  headers: { 'User-Agent': 'Mozilla/5.0' }
}).then(r => r.text())

const $ = cheerio.load(html)

/* Each field names its selector and a coercion that turns raw text into a typed value.
   Keep the map here so adapting to a new page layout is a one-line edit per field. */
const fieldMap = {
  title: {
    selector: 'article.product_page h1',
    coerce: (text) => text.trim() || null
  },
  price: {
    selector: 'article.product_page p.price_color',
    coerce: (text) => {
      const digits = text.replace(/[^0-9.]/g, '')
      return digits ? Number(digits) : null
    }
  },
  inStock: {
    selector: 'article.product_page p.availability',
    coerce: (text) => /in stock/i.test(text)
  },
  description: {
    selector: 'article.product_page > p',
    coerce: (text) => text.trim() || null
  }
}

const record = {}
for (const [field, { selector, coerce }] of Object.entries(fieldMap)) {
  /* .first() guards against a selector matching several nodes and concatenating their text. */
  const raw = $(selector).first().text()
  record[field] = coerce(raw)
}

/* Fallback: many sites ship the same fields as JSON-LD for search engines.
   Read it and fill any field the selectors left null. */
const ldText = $('script[type="application/ld+json"]').first().text()
if (ldText) {
  try {
    const ld = JSON.parse(ldText)
    if (record.title === null && typeof ld.name === 'string') record.title = ld.name
    if (record.price === null && ld.offers?.price != null) record.price = Number(ld.offers.price)
  } catch {
    /* Malformed JSON-LD is common; ignore it and keep the selector results. */
  }
}

console.log(JSON.stringify(record, null, 2))
bash
npm install cheerio
node extract-json.mjs

What each step does

Set a stock desktop browser User-Agent. A bare fetch() from Node sends node as its User-Agent, and some servers return a blocked stub for that. A plain Mozilla/5.0 string gets the full page from most public sites. This is politeness, not evasion; a site that hardens against bots needs more than a header.

Load the HTML into cheerio. cheerio.load(html) parses the string into a queryable tree and returns a $ function that takes CSS selectors. From here you address any value on the page the same way you would in a browser console, without launching one.

Drive the extraction from a selector map. Each field in fieldMap carries its own selector and a coerce function. The loop reads $(selector).first().text() and passes the raw string through coerce, so price becomes the number 51.77, inStock becomes a boolean, and an empty match becomes null rather than an empty string. Adapting to a different page layout means editing one selector, not rewriting the loop.

Fall back to JSON-LD. Many product, article, and recipe pages embed a <script type="application/ld+json"> block with the same fields already structured for search engines. The script reads it, parses it inside a try, and fills any field the selectors left null. When the visible markup is messy but the page ships clean structured data, this recovers the value.

Gotchas

  • A selector matches more than one node and concatenates their text.

    • Issue: $('p.price_color').text() returns the joined text of every matching node, so a page with a list price and a sale price hands you "£51.77£45.00" and Number() produces NaN.
    • Fix: scope to one node with $(selector).first(), as the script does, or tighten the selector to the container you mean, for example article.product_page p.price_color.
  • The number keeps its currency symbol and separators.

    • Issue: Number('$1,299.00') returns NaN because the dollar sign and comma are not numeric characters, so the price lands as null or breaks downstream math.
    • Fix: strip non-numeric characters before converting, text.replace(/[^0-9.]/g, ''), then Number(). Watch locales where the comma is the decimal mark.
  • A missing element returns an empty string, not an error.

    • Issue: cheerio's .text() on a selector that matches nothing returns '', so a renamed class silently yields a blank field and you do not notice until the data is wrong.
    • Fix: coerce '' to null in each field, text.trim() || null, so a missing value is visible in the JSON instead of an empty string masquerading as data.
  • The page is rendered by JavaScript and the HTML is a shell.

    • Issue: fetch returns only the server's initial HTML, so a React, Vue, or Next-with-client-data page hands back an empty container and every selector matches nothing.
    • Fix: render with Puppeteer or Playwright first, then pass page.content() to cheerio.load(). See How to scrape a JavaScript-rendered page in Node.js.
  • JSON-LD is malformed or split across several blocks.

    • Issue: a stray trailing comma or an HTML comment inside the script tag makes JSON.parse() throw, and some pages ship multiple JSON-LD blocks where the one you want is not the first.
    • Fix: wrap the parse in try/catch as the script does, and when a page has several blocks, iterate $('script[type="application/ld+json"]') and match on the @type you need rather than taking .first().
  • You need the same map to run against many pages of one site.

    • Issue: product templates drift between categories, so a selector that works on one page returns null on another with a slightly different layout.
    • Fix: give a field two candidate selectors and take the first non-empty result, and log which pages produced a null so you can spot template variants before they corrupt the dataset.

Use this when

You have one or more pages of display HTML and you want named, typed fields out of them: a price as a number, a stock flag as a boolean, a title as a trimmed string, ready to write to a database or feed an API.

Skip this when

The page is single-page-app rendered, so render it first with Puppeteer; the source already serves a JSON API you can hit directly, so call that instead of parsing display HTML; you want the article body as prose rather than discrete fields, so use a readability pass to clean Markdown; the rows are tabular, so a table-to-CSV walk fits better than a per-field map.

Skip the code, just get the data

Simplescraper turns any website into structured data in seconds.