Simplescraper
Skip to content

How to scrape a table into CSV in JavaScript

How to scrape a table into CSV in JavaScript

Updated 2026-06-18 · 6 min read

If you're turning an HTML table into a CSV by reading each cell and joining it with commas, you're probably about to ship a file that looks correct and quietly breaks later. The moment a cell holds a name like Smith, John, a product with a " in it, or an address that wraps onto two lines, the row gains a column or splits in two, and you won't see it until a downstream import fails. This is one of the most common ways a scrape goes wrong, and it has a settled fix.

The solution is to read each cell as a plain value and hand the rows to a library that knows CSV's escaping rules, so any cell with a comma, a quote, or a newline gets quoted correctly instead of corrupting the line. You get a file that opens the same in every spreadsheet and survives a database import. That comes to about 40 lines of Node.js with two open-source libraries, cheerio to parse the table and csv-stringify to escape every cell.

Key terms

  • cheerio. A server-side HTML parser that lets you query the parsed DOM with jQuery-style selectors like $('table.wikitable').
  • csv-stringify. A library that turns arrays of row values into CSV text, applying the format's quoting and escaping rules for you.
  • CSV escaping. The rule that a field with a comma, double quote, or newline must be wrapped in double quotes, with any literal quote inside it doubled.
  • colspan and rowspan. Attributes that merge a cell across columns or rows, which breaks naive column alignment unless you pad for the span.
  • thead and tbody. The table sections that separate header rows from body rows, which the script reads with distinct selectors.

Here is what the script does:

  • Fetch the page's HTML with a normal browser User-Agent, so the server returns the full server-rendered page instead of a bot-blocked stub.
  • Parse the HTML with cheerio and select the one table you want, by id, by caption, or by index.
  • Walk the header row and the body rows with cheerio, reading each cell's trimmed text.
  • Hand the rows to csv-stringify, which quotes and escapes any cell containing a comma, a quote, or a newline.

The complete script

js
// scrape-table-to-csv.mjs
import { load } from 'cheerio'
import { stringify } from 'csv-stringify/sync'
import { writeFile } from 'node:fs/promises'

const url = 'https://en.wikipedia.org/wiki/List_of_countries_by_population_(United_Nations)'

/* Fetch with a browser User-Agent so the server returns the real page. */
const html = await fetch(url, {
  headers: { 'User-Agent': 'Mozilla/5.0' }
}).then(r => r.text())

const $ = load(html)

/* Select one specific table. `.wikitable` is the data table on this page;
   on your target, use an id (`#prices`), a caption, or an index (see Gotchas). */
const table = $('table.wikitable').first()

/* Read the header row. Wikipedia puts the header in <th> cells inside <thead>;
   fall back to the first <tr> when there is no <thead> (see Gotchas). */
const headerCells = table.find('thead tr').first().find('th, td')
const headers = headerCells.map((i, el) => $(el).text().trim()).get()

/* Read every body row. .text() strips nested markup and returns plain text;
   .trim() drops the whitespace and newlines that HTML indentation leaves behind. */
const rows = table.find('tbody tr').map((i, tr) => {
  const cells = $(tr).find('td, th')
  return [cells.map((j, td) => $(td).text().trim().replace(/\s+/g, ' ')).get()]
}).get().filter(row => row.length > 0)

/* csv-stringify quotes and escapes any cell with a comma, quote, or newline.
   Passing `columns: headers` writes the header row first. */
const csv = stringify(rows, { header: true, columns: headers })

await writeFile('table.csv', csv)
console.log(`Wrote ${rows.length} rows and ${headers.length} columns to table.csv`)
bash
npm install cheerio csv-stringify
node scrape-table-to-csv.mjs

What each step does

Fetch with a browser User-Agent. A bare fetch() from Node sends node as its User-Agent, and plenty of sites return a 403 on that. A normal Mozilla string fixes most of them. This is politeness, not stealth; a site that genuinely blocks bots blocks harder than a header.

Select one table, not all of them. $('table.wikitable').first() pins the script to a single table. A page often has several tables, including layout tables wrapping the content, so a bare $('table') grabs the wrong one. Anchor on an id, a class, a nearby caption, or a known index.

Read the header row separately. The header lives in <th> cells, usually inside <thead>. Reading it on its own keeps the column names out of the data rows and gives csv-stringify the columns list it needs to label the output.

Trim and collapse cell text. .text() returns the concatenated text of a cell and all its descendants, so a cell holding <a>France</a> yields France without the markup. .trim() removes the surrounding whitespace from HTML indentation, and .replace(/\s+/g, ' ') collapses internal runs of spaces and newlines into one space.

Let csv-stringify build the file. Passing the array of row-arrays plus { header: true, columns: headers } emits a header line and one escaped line per row. The /sync import returns a string in one call, which is the right shape for a table that fits in memory.

Gotchas

  • The table has no <thead>.

    • Issue: many hand-written tables put the header in the first <tr> with no <thead> wrapper, so a thead tr selector returns nothing.
    • Fix: fall back to the first row: const headerRow = table.find('thead tr').first(); const header = headerRow.length ? headerRow : table.find('tr').first(), then read body rows from table.find('tr').slice(1) instead of tbody tr.
  • A cell contains a comma, quote, or newline.

    • Issue: this is the whole reason to use csv-stringify rather than join(','). Hand-rolled joining turns Paris, France into two columns and breaks on an embedded quote.
    • Fix: let csv-stringify escape it: Paris, France becomes "Paris, France", and 27" monitor becomes "27"" monitor" with the inner quote doubled. Never string-join CSV.
  • A cell spans columns or rows.

    • Issue: a merged cell breaks column alignment, because one <td colspan="2"> fills two columns but counts as a single element.
    • Fix: read the span and pad: const span = parseInt($(td).attr('colspan')) || 1, then push the value once and span - 1 empty strings after it. rowspan belongs to rows below, so for those tables prefer tabletojson, which tracks row and column spans for you.
  • A table is nested inside a cell.

    • Issue: a <table> inside a <td> makes table.find('tbody tr') pick up the inner table's rows as well as the outer ones.
    • Fix: scope the row search to direct children with table.children('tbody').children('tr'), or select the outer table by an id the inner one does not share.
  • A cell holds markup instead of plain text.

    • Issue: .text() flattens <a>, <span>, and <br> to their text, so you lose link targets and run two <br>-separated values together.
    • Fix: read what you need explicitly. For a link target, $(td).find('a').attr('href'). To keep <br>-separated values apart, $(td).find('br').replaceWith('\n') before reading.
  • More than one table matches the selector.

    • Issue: if table.wikitable matches several tables, .first() takes the wrong one as often as the right one.
    • Fix: pick by index after inspecting the page ($('table.wikitable').eq(2)), or anchor on the <caption> text: $('table').filter((i, el) => $(el).find('caption').text().includes('Population')).first().
  • The table is rendered by JavaScript.

    • Issue: fetch only returns the server's initial HTML, so a table built client-side by React, Vue, or a data-grid library is absent and cheerio finds nothing.
    • Fix: render the page with Puppeteer first, then pass await page.content() to cheerio's load(). See How to scrape a JavaScript-rendered page in Node.js.

Use this when

You have a page with an actual HTML <table> element and you want its contents as a CSV file for a spreadsheet, a database import, or a data-analysis step. This is the right tool for tabular data that exists in the page's server-rendered HTML.

Skip this when

The table is drawn by client-side JavaScript and absent from the initial HTML (render with Puppeteer first, then parse the rendered DOM); the data is in a PDF rather than an HTML table (use a PDF table parser such as pdfplumber); the "table" is a CSS grid of <div> elements with no <table> tag (select the row and cell <div> classes directly instead of td/th); or you need typed JSON objects keyed by column rather than flat CSV (use tabletojson, which also tracks colspan and rowspan).

Skip the code, just get the data

Simplescraper turns any website into structured data in seconds.