How to deduplicate scraped records in JavaScript

Updated 2026-06-24 · 6 min read

If you've scraped the same site more than once, or followed pagination that overlaps at the edges, you have probably seen the same item show up two or three times in your results: the same product reached from different category pages, or the same article under a URL with a tracking parameter tacked on. Left in, those duplicates inflate your counts, skew any analysis, and break anything that expects one row per deduplicated item.

The solution is to walk the records once and keep only the first time you see each item. We'll build a small script that derives a stable key from the fields that identify a record rather than comparing whole objects, normalizes a URL into a canonical form so two links to the same page collapse together, falls back to a content hash of the normalized title and body when the URL is not a reliable identifier, and remembers what it has kept in a Set so any later copy that repeats either key is skipped. It comes to about 30 lines of plain JavaScript, with nothing to install.

The complete script

// dedupe.mjs

// a batch of scraped records. note the duplicates:
// - records 1 and 4 are the same page, one carries a ?utm_source tag
// - records 2 and 5 are the same article on two different URLs
const records = [
  { url: 'https://example.com/post/a', title: 'Hello World', body: 'First post body.' },
  { url: 'https://example.com/post/b', title: 'Second Post', body: 'Second post body.' },
  { url: 'https://example.com/post/c', title: 'Third Post', body: 'Third post body.' },
  { url: 'https://example.com/post/a?utm_source=newsletter', title: 'Hello World', body: 'First post body.' },
  { url: 'https://syndicated.example.net/2026/second-post', title: 'Second Post', body: 'Second post body.' }
]

// normalize a URL into a stable key: lowercase the host, drop the
// fragment, and strip tracking params that do not change the page.
const TRACKING_PARAMS = new Set(['utm_source', 'utm_medium', 'utm_campaign', 'utm_term', 'utm_content', 'gclid', 'fbclid'])

function urlKey(rawUrl) {
  const u = new URL(rawUrl)
  u.hostname = u.hostname.toLowerCase()
  u.hash = ''
  for (const param of TRACKING_PARAMS) {
    u.searchParams.delete(param)
  }
  u.searchParams.sort() // ?a=1&b=2 and ?b=2&a=1 are the same page
  return u.toString()
}

// when the URL is not a reliable identifier (syndicated content,
// session IDs in the path), key on the content instead.
function contentKey(record) {
  const normalized = (record.title + '\n' + record.body)
    .toLowerCase()
    .replace(/\s+/g, ' ')
    .trim()
  return 'content:' + fnv1a(normalized)
}

// FNV-1a: a tiny, fast, non-cryptographic hash. good enough to bucket
// normalized strings; do not use it where collisions are a security issue.
function fnv1a(str) {
  let hash = 0x811c9dc5
  for (let i = 0; i < str.length; i++) {
    hash ^= str.charCodeAt(i)
    hash = Math.imul(hash, 0x01000193)
  }
  return (hash >>> 0).toString(16)
}

const seen = new Set()
const unique = []

for (const record of records) {
  // check both identity signals. a repeat URL or a repeat normalized body is a duplicate.
  const keys = record.url ? [urlKey(record.url), contentKey(record)] : [contentKey(record)]
  if (keys.some((key) => seen.has(key))) {
    continue
  }
  for (const key of keys) {
    seen.add(key)
  }
  unique.push(record)
}

console.log(`${records.length} records in, ${unique.length} unique out`)
for (const record of unique) {
  console.log(record.url)
}

bash

node dedupe.mjs

How it works

Build the key from identifying fields, not the object. Two scraped records are rarely identical byte for byte, so comparing whole objects fails. Passing an array of objects to new Set() compares by reference, so two records with identical fields are different objects and both survive; instead pick the fields that identify the item (a URL, an SKU, a title plus author) and derive one string from them. A bad key either merges distinct records or lets duplicates through.

Normalize the URL before you key on it. https://example.com/post/a and https://example.com/post/a?utm_source=newsletter are the same page. Lowercase the host, drop the #fragment, delete known tracking parameters, and sort the remaining query string so parameter order does not matter. Left in, parameters like utm_source, gclid, and fbclid make every share of a URL look unique and store the same article once per referrer. The URL class does the parsing; you only decide the policy.

Hash the content when the URL lies. Syndicated articles, session IDs baked into the path, and CMS permalink changes all break URL-based identity. Concatenate the fields that carry the meaning, collapse whitespace, lowercase, and run a fast non-cryptographic hash like FNV-1a; a trailing newline or a capitalized title in one copy would otherwise diverge the hash and let both through. Records with the same normalized title and body now share a key regardless of where they were found. A per-record crypto.createHash('sha256') adds measurable overhead across millions of items and buys little here, since these keys are not a security boundary; reserve it for cases where an adversary could craft collisions, and reach for the xxhash package on large bodies.

Register both keys per record. Adding the URL key and the content key to the same Set means a later copy is caught whether it repeats the URL or the normalized title and body. The first record through wins and is pushed to unique; every later record with either key already present is skipped. Keeping the first record per key discards a later duplicate that may have a cleaner title or fuller body, so when records compete, merge instead of skip: keep the key, and on a hit pick the better field values rather than dropping the new record. The seen set also lives in heap, so a crash or a fresh process starts empty and re-emits everything already written; persist the keys across runs in an embedded store such as a UNIQUE column in better-sqlite3 and load it on startup.

Use this when

You are merging scraped records from multiple runs, pages, or sources and need one entry per item: deduplicating a crawl frontier, collapsing syndicated articles, or cleaning a product feed before it lands in a database. A Set is the right tool while every key fits in memory; in V8, a Set of string keys commonly uses roughly 50-100 bytes per entry before counting the string contents.

Skip this when

The keyset is too large for heap and a missed record is unacceptable, in which case use a persistent UNIQUE index in better-sqlite3 or LMDB; the duplicates are similar rather than matching on the normalized title and body, such as related product titles, where you need similarity scoring or MinHash instead of a single hash; the database already enforces uniqueness, where an upsert on a unique constraint deduplicates at write time; or you need exact dedup at a scale where in-memory state will not fit, where a Redis SET shared across workers is the right primitive.

How to deduplicate scraped records in JavaScript ​

The complete script ​

How it works ​

Related guides ​

Skip the code, just get the data Simplescraper turns any website into structured data in seconds.

How to deduplicate scraped records in JavaScript

The complete script

How it works

Related guides

Skip the code, just get the data
Simplescraper turns any website into structured data in seconds.