Simplescraper
Skip to content

How to parse and scrape RSS feeds in Node.js

How to parse and scrape RSS feeds in Node.js

Updated 2026-06-24 · 5 min read

If you've built a feed reader on a generic XML parser, you're probably watching it work until the first Atom feed lands and then break field by field. Atom has no <channel>, its items are <entry> not <item>, its links live in a <link href> attribute instead of element text, and its dates are <updated> not <pubDate>, so your reader fills up with if/else branches and RSS 1.0 breaks them again. Feeds come in three dialects that look similar and differ in every detail, and a parser that already maps them removes the branching.

The solution is to read each feed through rss-parser, which folds RSS 2.0, RSS 1.0, and Atom into one shape so you get a single array of items with stable keys across all three dialects. It maps Atom <entry> to item, resolves the link attribute to item.link, and exposes the published date as an already-parsed ISO 8601 string, with custom fields for the namespaced bits like dc:creator and media:content. It comes to about 30 lines of Node.js with one library, rss-parser.

Key terms

  • Feed dialects. The three similar-but-incompatible XML formats a feed can use, RSS 2.0, RSS 1.0 (RDF), and Atom, which differ in element names like <item> versus <entry>.
  • isoDate. The published date that rss-parser normalizes to an ISO 8601 string, whichever date element the feed actually used, so you can sort and compare without timezone parsing.
  • Custom fields. Parser-config entries that surface and rename non-standard elements, since rss-parser drops anything you do not explicitly declare.
  • Namespaced fields. Prefixed elements like dc:creator, media:content, and content:encoded that belong to an XML namespace and are absent from item until registered as custom fields.
  • contentSnippet. The plain-text, tags-stripped version of an item's content, distinct from content (HTML) and the often fuller content:encoded.

Here is what the script does:

  • Fetch and parse a feed URL with rss-parser, which reads RSS 2.0, RSS 1.0, and Atom through the same call.
  • Read the feed-level metadata (title, description, link) and the per-item list in one pass.
  • Normalize the fields that differ between RSS and Atom so downstream code does not have to branch on the feed's dialect.
  • Map custom and namespaced fields (like media:content or dc:creator) that the default parser drops.

The complete script

js
// parse-rss.mjs
import Parser from 'rss-parser'

// rss-parser does not name namespaced fields by default.
// Register the ones you want and alias them to plain keys.
const parser = new Parser({
  customFields: {
    feed: [['language', 'language']],
    item: [
      ['dc:creator', 'creator'],
      ['media:content', 'media', { keepArray: true }]
    ]
  }
})

const feedUrl = 'https://hnrss.org/frontpage'

const feed = await parser.parseURL(feedUrl)

console.log(`Feed: ${feed.title} (${feed.items.length} items)`)

// RSS uses <pubDate>, Atom uses <updated>/<published>.
// rss-parser exposes both as item.pubDate and item.isoDate.
// isoDate is already a normalized ISO 8601 string, so prefer it.
const items = feed.items.map(item => ({
  title: item.title ?? '(untitled)',
  link: item.link ?? null,
  date: item.isoDate ?? null,
  author: item.creator ?? item.author ?? null,
  summary: item.contentSnippet ?? null
}))

for (const item of items.slice(0, 5)) {
  console.log(`${item.date}  ${item.title}`)
}
bash
npm install rss-parser
node parse-rss.mjs

What each step does

Construct the parser once with custom fields. rss-parser keeps the standard fields (title, link, pubDate, content) but drops anything namespaced. Pass customFields.item to alias dc:creator to a plain creator key, and customFields.feed for feed-level extras like language. The keepArray: true option matters for repeated elements like media:content, where a single item can carry several images.

Call parseURL and read feed.items. parseURL fetches and parses in one await. The returned object splits into feed-level metadata (feed.title, feed.description, feed.link) and the per-item array (feed.items). There is also parseString if you already hold the XML, which is the path to take when you fetch the feed yourself through a proxy before parsing.

Prefer isoDate over pubDate. pubDate is the raw string the feed shipped, in whatever timezone and format the publisher chose. isoDate is that value parsed to ISO 8601. Sort, store, and compare on isoDate; show pubDate only if you want the publisher's original wording.

Coalesce the fields that go missing. Real feeds omit fields. Use ?? to fall back: item.creator ?? item.author, item.title ?? '(untitled)'. This keeps one item with a missing author from throwing three steps later when you write to a database column.

Gotchas

  • Atom dates land in a different field than you expect.

    • Issue: Reading item.pubDate on an Atom feed gives undefined, because Atom uses <updated> and <published>, not RSS's <pubDate>.
    • Fix: read item.isoDate, which rss-parser populates from whichever date element the feed used, already normalized to ISO 8601.
  • Namespaced fields silently vanish.

    • Issue: media:content, dc:creator, and content:encoded are absent from item even though they are in the XML, because the parser only surfaces fields you declare.
    • Fix: register them in customFields.item, for example ['content:encoded', 'fullContent'], and add { keepArray: true } for elements that repeat.
  • A 403 or HTML error page parses as an empty feed.

    • Issue: Some hosts block the default User-Agent or return an HTML challenge page, and rss-parser either throws "Non-whitespace before first tag" or hands back zero items.
    • Fix: set a browser User-Agent on the request, new Parser({ headers: { 'User-Agent': 'Mozilla/5.0' } }), or fetch the XML yourself and pass it to parser.parseString(xml).
  • One malformed entry rejects the whole feed.

    • Issue: rss-parser is strict, so a single unescaped & or a stray tag makes parseURL reject and you lose every valid item in the feed.
    • Fix: for feeds known to be messy, parse with a lenient library like feed-parser, which is built to tolerate real-world non-standard markup.
  • content and contentSnippet are not the same thing.

    • Issue: item.content is HTML and may be truncated to the feed's <description>, while the full article body often sits in the namespaced content:encoded.
    • Fix: map content:encoded as a custom field for the full body, and use item.contentSnippet (tags stripped) only when you want plain-text preview.
  • Large or paginated feeds give you only the latest items.

    • Issue: A feed returns the publisher's most recent 10 to 50 items, so polling once and stopping misses everything older or anything published between polls if you wait too long.
    • Fix: poll on a schedule, dedupe on item.guid (or item.link when guid is absent), and follow RFC 5005 <link rel="next"> pagination when the feed advertises it.

Use this when

The site publishes a feed and you want its items: a blog reader, a news aggregator, a "new releases" notifier, a podcast index, or an ingest job that pulls fresh posts into a database on a schedule.

Skip this when

The site has no feed and you need to read the HTML itself (scrape the page to Markdown or extract structured fields instead); you want every article's full body rather than the feed summary (follow each item.link and extract the page); the feed is behind authentication (handle the session first); or you are generating a feed from a page rather than reading one (use a feed builder like the Feed library).

Skip the code, just get the data

Simplescraper turns any website into structured data in seconds.