How to scrape a table into CSV in JavaScript

Updated 2026-06-18 · 6 min read

If you're turning an HTML table into a CSV by reading each cell and joining it with commas, you're probably about to ship a file that looks correct and quietly breaks later. The moment a cell holds a name like Smith, John, a product with a " in it, or an address that wraps onto two lines, the row gains a column or splits in two, and you won't see it until a downstream import fails. This is one of the most common ways a scrape goes wrong, and it has a settled fix.

The solution is to read each cell as a plain value and hand the rows to a library that knows CSV's escaping rules, so any cell with a comma, a quote, or a newline gets quoted correctly instead of corrupting the line. We'll build a small script that fetches the page with a normal browser header so the server returns the real table instead of a bot stub, parses the HTML and pins onto the one table you want, walks the header row and the body rows reading each cell's plain text, and hands those rows to a CSV writer that quotes and escapes anything that would otherwise split a line. You get a file that opens the same in every spreadsheet and survives a database import. It comes out to about 40 lines of Node.js and two open-source libraries: cheerio to query the parsed HTML with jQuery-style selectors, and csv-stringify to apply CSV's quoting and escaping rules for you.

The complete script

// scrape-table-to-csv.mjs
import { load } from 'cheerio'
import { stringify } from 'csv-stringify/sync'
import { writeFile } from 'node:fs/promises'

const url = 'https://en.wikipedia.org/wiki/List_of_countries_by_population_(United_Nations)'

/* fetch with a browser User-Agent so the server returns the real page. */
const html = await fetch(url, {
  headers: { 'User-Agent': 'Mozilla/5.0' }
}).then(r => r.text())

const $ = load(html)

/* select one specific table. `.wikitable` is the data table on this page;
   on your target, use an id (`#prices`), a caption, or an index. */
const table = $('table.wikitable').first()

/* read the header row. Wikipedia puts the header in <th> cells inside <thead>;
   fall back to the first <tr> when there is no <thead>. */
const headerCells = table.find('thead tr').first().find('th, td')
const headers = headerCells.map((i, el) => $(el).text().trim()).get()

/* read every body row. .text() strips nested markup and returns plain text;
   .trim() drops the whitespace and newlines that HTML indentation leaves behind. */
const rows = table.find('tbody tr').map((i, tr) => {
  const cells = $(tr).find('td, th')
  return [cells.map((j, td) => $(td).text().trim().replace(/\s+/g, ' ')).get()]
}).get().filter(row => row.length > 0)

/* csv-stringify quotes and escapes any cell with a comma, quote, or newline.
   passing `columns: headers` writes the header row first. */
const csv = stringify(rows, { header: true, columns: headers })

await writeFile('table.csv', csv)
console.log(`Wrote ${rows.length} rows and ${headers.length} columns to table.csv`)

bash

npm install cheerio csv-stringify
node scrape-table-to-csv.mjs

How it works

Fetch with a browser User-Agent. A bare fetch() from Node sends node as its User-Agent, and plenty of sites return a 403 on that. A normal Mozilla string fixes most of them. This is politeness, not stealth; a site that genuinely blocks bots blocks harder than a header. One failure mode dominates here: fetch only sees the server's initial HTML, so a table built client-side by React, Vue, or a data-grid library is absent and cheerio finds nothing, which means you render the page with Puppeteer first and pass await page.content() to load().

Select one table, not all of them. $('table.wikitable').first() pins the script to a single table. A page often has several tables, including layout tables wrapping the content, so a bare $('table') grabs the wrong one, and when one class matches several tables .first() takes the wrong one as often as the right one. Anchor on an id, a class, a known index ($('table.wikitable').eq(2)), or the caption text: $('table').filter((i, el) => $(el).find('caption').text().includes('Population')).first(). If the table you want is nested inside another table's cell, table.find('tbody tr') picks up the inner rows too, so scope the search to direct children with table.children('tbody').children('tr').

Read the header row separately. The header lives in <th> cells, usually inside <thead>. Reading it on its own keeps the column names out of the data rows and gives csv-stringify the columns list it needs to label the output. Many hand-written tables put the header in the first <tr> with no <thead> wrapper, so a thead tr selector returns nothing; fall back to the first row with table.find('thead tr').first().length ? ... : table.find('tr').first() and read body rows from table.find('tr').slice(1) instead of tbody tr.

Trim and collapse cell text. .text() returns the concatenated text of a cell and all its descendants, so a cell holding <a>France</a> yields France without the markup. .trim() removes the surrounding whitespace from HTML indentation, and .replace(/\s+/g, ' ') collapses internal runs of spaces and newlines into one space. Flattening this way loses link targets and runs <br>-separated values together, so when you need those, read them explicitly: $(td).find('a').attr('href') for a link, or $(td).find('br').replaceWith('\n') before reading to keep <br>-separated values apart. A merged cell also breaks column alignment, since one <td colspan="2"> fills two columns but counts as a single element, so read the span with parseInt($(td).attr('colspan')) || 1 and push that many values; for rowspan, which spills into rows below, reach for tabletojson, which tracks both spans for you.

Let csv-stringify build the file. Passing the array of row-arrays plus { header: true, columns: headers } emits a header line and one escaped line per row, quoting any cell with a comma, quote, or newline: Paris, France becomes "Paris, France" and 27" monitor becomes "27"" monitor" with the inner quote doubled. This escaping is the whole reason to use the library rather than join(','), which would turn that first value into two columns and break on the embedded quote. The /sync import returns a string in one call, which is the right shape for a table that fits in memory.

Use this when

You have a page with an actual HTML <table> element and you want its contents as a CSV file for a spreadsheet, a database import, or a data-analysis step. This is the right tool for tabular data that exists in the page's server-rendered HTML.

Skip this when

The table is drawn by client-side JavaScript and absent from the initial HTML (render with Puppeteer first, then parse the rendered DOM); the data is in a PDF rather than an HTML table (use a PDF table parser such as pdfplumber); the "table" is a CSS grid of <div> elements with no <table> tag (select the row and cell <div> classes directly instead of td/th); or you need typed JSON objects keyed by column rather than flat CSV (use tabletojson, which also tracks colspan and rowspan).

How to scrape a table into CSV in JavaScript ​

The complete script ​

How it works ​

Related guides ​

Skip the code, just get the data Simplescraper turns any website into structured data in seconds.

How to scrape a table into CSV in JavaScript

The complete script

How it works

Related guides

Skip the code, just get the data
Simplescraper turns any website into structured data in seconds.