How to scrape a table into CSV in JavaScript
If you're turning an HTML table into a CSV by reading each cell and joining it with commas, you're probably about to ship a file that looks correct and quietly breaks later. The moment a cell holds a name like Smith, John, a product with a " in it, or an address that wraps onto two lines, the row gains a column or splits in two, and you won't see it until a downstream import fails. This is one of the most common ways a scrape goes wrong, and it has a settled fix.
The solution is to read each cell as a plain value and hand the rows to a library that knows CSV's escaping rules, so any cell with a comma, a quote, or a newline gets quoted correctly instead of corrupting the line. You get a file that opens the same in every spreadsheet and survives a database import. That comes to about 40 lines of Node.js with two open-source libraries, cheerio to parse the table and csv-stringify to escape every cell.
Key terms
- cheerio. A server-side HTML parser that lets you query the parsed DOM with jQuery-style selectors like
$('table.wikitable'). - csv-stringify. A library that turns arrays of row values into CSV text, applying the format's quoting and escaping rules for you.
- CSV escaping. The rule that a field with a comma, double quote, or newline must be wrapped in double quotes, with any literal quote inside it doubled.
colspanandrowspan. Attributes that merge a cell across columns or rows, which breaks naive column alignment unless you pad for the span.theadandtbody. The table sections that separate header rows from body rows, which the script reads with distinct selectors.
Here is what the script does:
- Fetch the page's HTML with a normal browser User-Agent, so the server returns the full server-rendered page instead of a bot-blocked stub.
- Parse the HTML with cheerio and select the one table you want, by id, by caption, or by index.
- Walk the header row and the body rows with cheerio, reading each cell's trimmed text.
- Hand the rows to csv-stringify, which quotes and escapes any cell containing a comma, a quote, or a newline.
The complete script
// scrape-table-to-csv.mjs
import { load } from 'cheerio'
import { stringify } from 'csv-stringify/sync'
import { writeFile } from 'node:fs/promises'
const url = 'https://en.wikipedia.org/wiki/List_of_countries_by_population_(United_Nations)'
/* Fetch with a browser User-Agent so the server returns the real page. */
const html = await fetch(url, {
headers: { 'User-Agent': 'Mozilla/5.0' }
}).then(r => r.text())
const $ = load(html)
/* Select one specific table. `.wikitable` is the data table on this page;
on your target, use an id (`#prices`), a caption, or an index (see Gotchas). */
const table = $('table.wikitable').first()
/* Read the header row. Wikipedia puts the header in <th> cells inside <thead>;
fall back to the first <tr> when there is no <thead> (see Gotchas). */
const headerCells = table.find('thead tr').first().find('th, td')
const headers = headerCells.map((i, el) => $(el).text().trim()).get()
/* Read every body row. .text() strips nested markup and returns plain text;
.trim() drops the whitespace and newlines that HTML indentation leaves behind. */
const rows = table.find('tbody tr').map((i, tr) => {
const cells = $(tr).find('td, th')
return [cells.map((j, td) => $(td).text().trim().replace(/\s+/g, ' ')).get()]
}).get().filter(row => row.length > 0)
/* csv-stringify quotes and escapes any cell with a comma, quote, or newline.
Passing `columns: headers` writes the header row first. */
const csv = stringify(rows, { header: true, columns: headers })
await writeFile('table.csv', csv)
console.log(`Wrote ${rows.length} rows and ${headers.length} columns to table.csv`)npm install cheerio csv-stringify
node scrape-table-to-csv.mjsWhat each step does
Fetch with a browser User-Agent. A bare fetch() from Node sends node as its User-Agent, and plenty of sites return a 403 on that. A normal Mozilla string fixes most of them. This is politeness, not stealth; a site that genuinely blocks bots blocks harder than a header.
Select one table, not all of them. $('table.wikitable').first() pins the script to a single table. A page often has several tables, including layout tables wrapping the content, so a bare $('table') grabs the wrong one. Anchor on an id, a class, a nearby caption, or a known index.
Read the header row separately. The header lives in <th> cells, usually inside <thead>. Reading it on its own keeps the column names out of the data rows and gives csv-stringify the columns list it needs to label the output.
Trim and collapse cell text. .text() returns the concatenated text of a cell and all its descendants, so a cell holding <a>France</a> yields France without the markup. .trim() removes the surrounding whitespace from HTML indentation, and .replace(/\s+/g, ' ') collapses internal runs of spaces and newlines into one space.
Let csv-stringify build the file. Passing the array of row-arrays plus { header: true, columns: headers } emits a header line and one escaped line per row. The /sync import returns a string in one call, which is the right shape for a table that fits in memory.
Gotchas
The table has no
<thead>.- Issue: many hand-written tables put the header in the first
<tr>with no<thead>wrapper, so athead trselector returns nothing. - Fix: fall back to the first row:
const headerRow = table.find('thead tr').first(); const header = headerRow.length ? headerRow : table.find('tr').first(), then read body rows fromtable.find('tr').slice(1)instead oftbody tr.
- Issue: many hand-written tables put the header in the first
A cell contains a comma, quote, or newline.
- Issue: this is the whole reason to use csv-stringify rather than
join(','). Hand-rolled joining turnsParis, Franceinto two columns and breaks on an embedded quote. - Fix: let csv-stringify escape it:
Paris, Francebecomes"Paris, France", and27" monitorbecomes"27"" monitor"with the inner quote doubled. Never string-join CSV.
- Issue: this is the whole reason to use csv-stringify rather than
A cell spans columns or rows.
- Issue: a merged cell breaks column alignment, because one
<td colspan="2">fills two columns but counts as a single element. - Fix: read the span and pad:
const span = parseInt($(td).attr('colspan')) || 1, then push the value once andspan - 1empty strings after it.rowspanbelongs to rows below, so for those tables prefertabletojson, which tracks row and column spans for you.
- Issue: a merged cell breaks column alignment, because one
A table is nested inside a cell.
- Issue: a
<table>inside a<td>makestable.find('tbody tr')pick up the inner table's rows as well as the outer ones. - Fix: scope the row search to direct children with
table.children('tbody').children('tr'), or select the outer table by an id the inner one does not share.
- Issue: a
A cell holds markup instead of plain text.
- Issue:
.text()flattens<a>,<span>, and<br>to their text, so you lose link targets and run two<br>-separated values together. - Fix: read what you need explicitly. For a link target,
$(td).find('a').attr('href'). To keep<br>-separated values apart,$(td).find('br').replaceWith('\n')before reading.
- Issue:
More than one table matches the selector.
- Issue: if
table.wikitablematches several tables,.first()takes the wrong one as often as the right one. - Fix: pick by index after inspecting the page (
$('table.wikitable').eq(2)), or anchor on the<caption>text:$('table').filter((i, el) => $(el).find('caption').text().includes('Population')).first().
- Issue: if
The table is rendered by JavaScript.
- Issue:
fetchonly returns the server's initial HTML, so a table built client-side by React, Vue, or a data-grid library is absent and cheerio finds nothing. - Fix: render the page with Puppeteer first, then pass
await page.content()to cheerio'sload(). See How to scrape a JavaScript-rendered page in Node.js.
- Issue:
Use this when
You have a page with an actual HTML <table> element and you want its contents as a CSV file for a spreadsheet, a database import, or a data-analysis step. This is the right tool for tabular data that exists in the page's server-rendered HTML.
Skip this when
The table is drawn by client-side JavaScript and absent from the initial HTML (render with Puppeteer first, then parse the rendered DOM); the data is in a PDF rather than an HTML table (use a PDF table parser such as pdfplumber); the "table" is a CSS grid of <div> elements with no <table> tag (select the row and cell <div> classes directly instead of td/th); or you need typed JSON objects keyed by column rather than flat CSV (use tabletojson, which also tracks colspan and rowspan).