How to parse and scrape RSS feeds in Node.js
If you've built a feed reader on a generic XML parser, you're probably watching it work until the first Atom feed lands and then break field by field. Atom has no <channel>, its items are <entry> not <item>, its links live in a <link href> attribute instead of element text, and its dates are <updated> not <pubDate>, so your reader fills up with if/else branches and RSS 1.0 breaks them again. Feeds come in three dialects that look similar and differ in every detail, and a parser that already maps them removes the branching.
The solution is to read each feed through rss-parser, which folds RSS 2.0, RSS 1.0, and Atom into one shape so you get a single array of items with stable keys across all three dialects. It maps Atom <entry> to item, resolves the link attribute to item.link, and exposes the published date as an already-parsed ISO 8601 string, with custom fields for the namespaced bits like dc:creator and media:content. It comes to about 30 lines of Node.js with one library, rss-parser.
Key terms
- Feed dialects. The three similar-but-incompatible XML formats a feed can use, RSS 2.0, RSS 1.0 (RDF), and Atom, which differ in element names like
<item>versus<entry>. isoDate. The published date that rss-parser normalizes to an ISO 8601 string, whichever date element the feed actually used, so you can sort and compare without timezone parsing.- Custom fields. Parser-config entries that surface and rename non-standard elements, since rss-parser drops anything you do not explicitly declare.
- Namespaced fields. Prefixed elements like
dc:creator,media:content, andcontent:encodedthat belong to an XML namespace and are absent fromitemuntil registered as custom fields. contentSnippet. The plain-text, tags-stripped version of an item's content, distinct fromcontent(HTML) and the often fullercontent:encoded.
Here is what the script does:
- Fetch and parse a feed URL with rss-parser, which reads RSS 2.0, RSS 1.0, and Atom through the same call.
- Read the feed-level metadata (title, description, link) and the per-item list in one pass.
- Normalize the fields that differ between RSS and Atom so downstream code does not have to branch on the feed's dialect.
- Map custom and namespaced fields (like
media:contentordc:creator) that the default parser drops.
The complete script
// parse-rss.mjs
import Parser from 'rss-parser'
// rss-parser does not name namespaced fields by default.
// Register the ones you want and alias them to plain keys.
const parser = new Parser({
customFields: {
feed: [['language', 'language']],
item: [
['dc:creator', 'creator'],
['media:content', 'media', { keepArray: true }]
]
}
})
const feedUrl = 'https://hnrss.org/frontpage'
const feed = await parser.parseURL(feedUrl)
console.log(`Feed: ${feed.title} (${feed.items.length} items)`)
// RSS uses <pubDate>, Atom uses <updated>/<published>.
// rss-parser exposes both as item.pubDate and item.isoDate.
// isoDate is already a normalized ISO 8601 string, so prefer it.
const items = feed.items.map(item => ({
title: item.title ?? '(untitled)',
link: item.link ?? null,
date: item.isoDate ?? null,
author: item.creator ?? item.author ?? null,
summary: item.contentSnippet ?? null
}))
for (const item of items.slice(0, 5)) {
console.log(`${item.date} ${item.title}`)
}npm install rss-parser
node parse-rss.mjsWhat each step does
Construct the parser once with custom fields. rss-parser keeps the standard fields (title, link, pubDate, content) but drops anything namespaced. Pass customFields.item to alias dc:creator to a plain creator key, and customFields.feed for feed-level extras like language. The keepArray: true option matters for repeated elements like media:content, where a single item can carry several images.
Call parseURL and read feed.items. parseURL fetches and parses in one await. The returned object splits into feed-level metadata (feed.title, feed.description, feed.link) and the per-item array (feed.items). There is also parseString if you already hold the XML, which is the path to take when you fetch the feed yourself through a proxy before parsing.
Prefer isoDate over pubDate. pubDate is the raw string the feed shipped, in whatever timezone and format the publisher chose. isoDate is that value parsed to ISO 8601. Sort, store, and compare on isoDate; show pubDate only if you want the publisher's original wording.
Coalesce the fields that go missing. Real feeds omit fields. Use ?? to fall back: item.creator ?? item.author, item.title ?? '(untitled)'. This keeps one item with a missing author from throwing three steps later when you write to a database column.
Gotchas
Atom dates land in a different field than you expect.
- Issue: Reading
item.pubDateon an Atom feed givesundefined, because Atom uses<updated>and<published>, not RSS's<pubDate>. - Fix: read
item.isoDate, which rss-parser populates from whichever date element the feed used, already normalized to ISO 8601.
- Issue: Reading
Namespaced fields silently vanish.
- Issue:
media:content,dc:creator, andcontent:encodedare absent fromitemeven though they are in the XML, because the parser only surfaces fields you declare. - Fix: register them in
customFields.item, for example['content:encoded', 'fullContent'], and add{ keepArray: true }for elements that repeat.
- Issue:
A 403 or HTML error page parses as an empty feed.
- Issue: Some hosts block the default User-Agent or return an HTML challenge page, and rss-parser either throws "Non-whitespace before first tag" or hands back zero items.
- Fix: set a browser User-Agent on the request,
new Parser({ headers: { 'User-Agent': 'Mozilla/5.0' } }), or fetch the XML yourself and pass it toparser.parseString(xml).
One malformed entry rejects the whole feed.
- Issue: rss-parser is strict, so a single unescaped
&or a stray tag makesparseURLreject and you lose every valid item in the feed. - Fix: for feeds known to be messy, parse with a lenient library like feed-parser, which is built to tolerate real-world non-standard markup.
- Issue: rss-parser is strict, so a single unescaped
contentandcontentSnippetare not the same thing.- Issue:
item.contentis HTML and may be truncated to the feed's<description>, while the full article body often sits in the namespacedcontent:encoded. - Fix: map
content:encodedas a custom field for the full body, and useitem.contentSnippet(tags stripped) only when you want plain-text preview.
- Issue:
Large or paginated feeds give you only the latest items.
- Issue: A feed returns the publisher's most recent 10 to 50 items, so polling once and stopping misses everything older or anything published between polls if you wait too long.
- Fix: poll on a schedule, dedupe on
item.guid(oritem.linkwhen guid is absent), and follow RFC 5005<link rel="next">pagination when the feed advertises it.
Use this when
The site publishes a feed and you want its items: a blog reader, a news aggregator, a "new releases" notifier, a podcast index, or an ingest job that pulls fresh posts into a database on a schedule.
Skip this when
The site has no feed and you need to read the HTML itself (scrape the page to Markdown or extract structured fields instead); you want every article's full body rather than the feed summary (follow each item.link and extract the page); the feed is behind authentication (handle the session first); or you are generating a feed from a page rather than reading one (use a feed builder like the Feed library).