How to parse and scrape RSS feeds in Node.js

Updated 2026-06-24 · 5 min read

If you've built a feed reader on a generic XML parser, you're probably watching it work until the first Atom feed lands and then break field by field. Atom has no <channel>, its items are <entry> not <item>, its links live in a <link href> attribute instead of element text, and its dates are <updated> not <pubDate>, so your reader fills up with if/else branches and RSS 1.0 breaks them again. Feeds come in three dialects that look similar and differ in every detail, and a parser that already maps them removes the branching.

The solution is to read each feed through rss-parser, which folds the three feed dialects into one shape. We'll build a small script that parses any feed URL through a single call so RSS 2.0, RSS 1.0, and Atom all come back as one array of items with stable keys, normalizes the fields that differ between the dialects so downstream code never branches on which format the feed used, and registers the namespaced extras like dc:creator and media:content that the default parser would otherwise drop. It comes to about 30 lines of Node.js with one library, rss-parser.

The complete script

// parse-rss.mjs
import Parser from 'rss-parser'

// rss-parser does not name namespaced fields by default.
// register the ones you want and alias them to plain keys.
const parser = new Parser({
  customFields: {
    feed: [['language', 'language']],
    item: [
      ['dc:creator', 'creator'],
      ['media:content', 'media', { keepArray: true }]
    ]
  }
})

const feedUrl = 'https://hnrss.org/frontpage'

const feed = await parser.parseURL(feedUrl)

console.log(`Feed: ${feed.title} (${feed.items.length} items)`)

// RSS uses <pubDate>, Atom uses <updated>/<published>.
// rss-parser exposes both as item.pubDate and item.isoDate.
// isoDate is already a normalized ISO 8601 string, so prefer it.
const items = feed.items.map(item => ({
  title: item.title ?? '(untitled)',
  link: item.link ?? null,
  date: item.isoDate ?? null,
  author: item.creator ?? item.author ?? null,
  summary: item.contentSnippet ?? null
}))

for (const item of items.slice(0, 5)) {
  console.log(`${item.date}  ${item.title}`)
}

bash

npm install rss-parser
node parse-rss.mjs

How it works

Construct the parser once with custom fields. rss-parser keeps the standard fields (title, link, pubDate, content) but drops anything namespaced. Pass customFields.item to alias dc:creator to a plain creator key, and customFields.feed for feed-level extras like language. The keepArray: true option matters for repeated elements like media:content, where a single item can carry several images. One field pair catches people out: item.content is HTML and may be truncated to the feed's <description>, while the full article body often sits in the namespaced content:encoded, so map that as a custom field (['content:encoded', 'fullContent']) when you need the whole body and reserve item.contentSnippet, which is the same content with tags stripped, for plain-text previews.

Call parseURL and read feed.items. parseURL fetches and parses in one await. The returned object splits into feed-level metadata (feed.title, feed.description, feed.link) and the per-item array (feed.items). Two failures show up here. Some hosts block the default User-Agent or return an HTML challenge page, which makes rss-parser throw "Non-whitespace before first tag" or hand back zero items, so set a browser User-Agent (new Parser({ headers: { 'User-Agent': 'Mozilla/5.0' } })) or fetch the XML yourself and pass it to parser.parseString(xml), which is also the path to take when you fetch through a proxy first. And because the parser is strict, a single unescaped & or stray tag makes parseURL reject and you lose every valid item, so for feeds known to be messy, parse with a lenient library like feed-parser that tolerates real-world non-standard markup.

Prefer isoDate over pubDate. pubDate is the raw string the feed shipped, in whatever timezone and format the publisher chose. isoDate is that value parsed to ISO 8601, populated from whichever date element the feed used, so reading item.pubDate on an Atom feed gives undefined (Atom uses <updated> and <published>) while isoDate is always present. Sort, store, and compare on isoDate; show pubDate only if you want the publisher's original wording.

Coalesce the fields that go missing. Real feeds omit fields. Use ?? to fall back: item.creator ?? item.author, item.title ?? '(untitled)'. This keeps one item with a missing author from throwing three steps later when you write to a database column. A feed also returns only the publisher's most recent 10 to 50 items, so a one-shot poll misses anything older or published between runs; poll on a schedule, dedupe on item.guid (or item.link when guid is absent), and follow <link rel="next"> pagination when the feed advertises it.

Use this when

The site publishes a feed and you want its items: a blog reader, a news aggregator, a "new releases" notifier, a podcast index, or an ingest job that pulls fresh posts into a database on a schedule.

Skip this when

The site has no feed and you need to read the HTML itself (scrape the page to Markdown or extract structured fields instead); you want every article's full body rather than the feed summary (follow each item.link and extract the page); the feed is behind authentication (handle the session first); or you are generating a feed from a page rather than reading one (use a feed builder like the Feed library).

How to parse and scrape RSS feeds in Node.js ​

The complete script ​

How it works ​

Related guides ​

Skip the code, just get the data Simplescraper turns any website into structured data in seconds.

How to parse and scrape RSS feeds in Node.js

The complete script

How it works

Related guides

Skip the code, just get the data
Simplescraper turns any website into structured data in seconds.