Simplescraper
Skip to content

How to deduplicate scraped records in JavaScript

How to deduplicate scraped records in JavaScript

Updated 2026-06-24 · 6 min read

If you've scraped the same site more than once, or followed pagination that overlaps at the edges, you have probably seen the same item show up two or three times in your results: the same product reached from different category pages, or the same article under a URL with a tracking parameter tacked on. Left in, those duplicates inflate your counts, skew any analysis, and break anything that expects one row per deduplicated item.

The solution is to walk the records once and keep only the first time you see each item, using a Set to remember what you have already kept and a hash of the normalized title and body as a secondary key to catch the copies that came in under a different URL. It comes to about 30 lines of plain JavaScript, with nothing to install.

Key terms

  • Dedup key. A short string derived from the fields that identify a record, used to decide whether two records are the same thing rather than comparing whole objects.
  • Normalization. Reducing a value to a canonical form (lowercase the host, drop the fragment, strip tracking params, collapse whitespace) so cosmetic variants produce the same key.
  • Tracking parameters. Query parameters like utm_source, gclid, and fbclid that change the URL without changing the page, so they are stripped before keying.
  • Content hash. A short fingerprint of the normalized title and body, used as a fallback key when the URL is not a reliable identifier.
  • FNV-1a. A fast non-cryptographic hash used to turn a normalized string into a content key, fine for bucketing but not for security boundaries.

Here is what the script does:

  • Build a stable dedup key for each record from the fields that identify it, not the whole object.
  • Track keys you have already seen in a Set and keep only the first record for each key.
  • Normalize the key (trim, lowercase, strip tracking query parameters) so two URLs that point to the same page collapse together.
  • Register the URL key and a hash of the normalized title and body as a secondary key, so a later record is skipped if either key was already seen.

The complete script

js
// dedupe.mjs

// A batch of scraped records. Note the duplicates:
// - records 1 and 4 are the same page, one carries a ?utm_source tag
// - records 2 and 5 are the same article on two different URLs
const records = [
  { url: 'https://example.com/post/a', title: 'Hello World', body: 'First post body.' },
  { url: 'https://example.com/post/b', title: 'Second Post', body: 'Second post body.' },
  { url: 'https://example.com/post/c', title: 'Third Post', body: 'Third post body.' },
  { url: 'https://example.com/post/a?utm_source=newsletter', title: 'Hello World', body: 'First post body.' },
  { url: 'https://syndicated.example.net/2026/second-post', title: 'Second Post', body: 'Second post body.' }
]

// Normalize a URL into a stable key: lowercase the host, drop the
// fragment, and strip tracking params that do not change the page.
const TRACKING_PARAMS = new Set(['utm_source', 'utm_medium', 'utm_campaign', 'utm_term', 'utm_content', 'gclid', 'fbclid'])

function urlKey(rawUrl) {
  const u = new URL(rawUrl)
  u.hostname = u.hostname.toLowerCase()
  u.hash = ''
  for (const param of TRACKING_PARAMS) {
    u.searchParams.delete(param)
  }
  u.searchParams.sort() // ?a=1&b=2 and ?b=2&a=1 are the same page
  return u.toString()
}

// When the URL is not a reliable identifier (syndicated content,
// session IDs in the path), key on the content instead.
function contentKey(record) {
  const normalized = (record.title + '\n' + record.body)
    .toLowerCase()
    .replace(/\s+/g, ' ')
    .trim()
  return 'content:' + fnv1a(normalized)
}

// FNV-1a: a tiny, fast, non-cryptographic hash. Good enough to bucket
// normalized strings; do not use it where collisions are a security issue.
function fnv1a(str) {
  let hash = 0x811c9dc5
  for (let i = 0; i < str.length; i++) {
    hash ^= str.charCodeAt(i)
    hash = Math.imul(hash, 0x01000193)
  }
  return (hash >>> 0).toString(16)
}

const seen = new Set()
const unique = []

for (const record of records) {
  // Check both identity signals. A repeat URL or a repeat normalized body is a duplicate.
  const keys = record.url ? [urlKey(record.url), contentKey(record)] : [contentKey(record)]
  if (keys.some((key) => seen.has(key))) {
    continue
  }
  for (const key of keys) {
    seen.add(key)
  }
  unique.push(record)
}

console.log(`${records.length} records in, ${unique.length} unique out`)
for (const record of unique) {
  console.log(record.url)
}
bash
node dedupe.mjs

What each step does

Build the key from identifying fields, not the object. Two scraped records are rarely identical byte for byte, so comparing whole objects fails. Pick the fields that identify the item (a URL, an SKU, a title plus author) and derive one string from them. A bad key either merges distinct records or lets duplicates through.

Normalize the URL before you key on it. https://example.com/post/a and https://example.com/post/a?utm_source=newsletter are the same page. Lowercase the host, drop the #fragment, delete known tracking parameters, and sort the remaining query string so parameter order does not matter. The URL class does the parsing; you only decide the policy.

Hash the content when the URL lies. Syndicated articles, session IDs baked into the path, and CMS permalink changes all break URL-based identity. Concatenate the fields that carry the meaning, collapse whitespace, lowercase, and run a fast non-cryptographic hash like FNV-1a. Records with the same normalized title and body now share a key regardless of where they were found.

Register both keys per record. Adding the URL key and the content key to the same Set means a later copy is caught whether it repeats the URL or the normalized title and body. The first record through wins and is pushed to unique; every later record with either key already present is skipped.

Gotchas

  • new Set(records) deduplicates nothing.

    • Issue: Passing an array of objects to new Set() compares by reference, so two records with identical fields are different objects and both survive.
    • Fix: build a primitive key string per record and add the key to the Set, not the object: seen.has(urlKey(record.url)).
  • Tracking parameters split one page into many keys.

    • Issue: ?utm_source, ?gclid, and ?fbclid make every share of a URL look unique, so the same article is stored once per referrer.
    • Fix: strip known tracking params in the key function and searchParams.sort() so parameter order does not create false distinctions.
  • A cryptographic hash on every record is usually unnecessary unless collision resistance matters.

    • Issue: Reaching for crypto.createHash('sha256') per record adds measurable overhead across millions of items and usually buys little, since dedup keys are not a security boundary.
    • Fix: use a fast non-cryptographic hash (FNV-1a, or the xxhash package for large bodies); reserve SHA-256 for cases where an adversary could craft collisions.
  • The first-seen record may not be the one you want.

    • Issue: Keeping the first record per key means a later duplicate with a cleaner title or a fuller body is thrown away.
    • Fix: when records compete, merge instead of skip: keep the key, and on a hit pick the better field values (longer body, non-null author) rather than discarding the new record outright.
  • A Bloom filter can drop records it has not seen.

    • Issue: filter.has(key) can return true for keys that were not added (a false positive), so a Bloom-filtered run silently loses a fraction of genuinely unique records.
    • Fix: size the filter for the true expected count and accept the stated error rate only where a missed record is recoverable; for exact dedup that must fit on disk, use a persistent store like better-sqlite3 with a UNIQUE index instead.
  • A restart loses the in-memory Set and re-emits everything.

    • Issue: The seen set lives in heap, so a crash or a fresh process starts empty and every record already written is treated as new.
    • Fix: persist seen keys across runs in an embedded store, such as a UNIQUE column in better-sqlite3, and load it on startup.
  • Whitespace and case differences defeat the content key.

    • Issue: Two copies of the same article differ by a trailing newline or a capitalized title, so their hashes diverge and both pass through.
    • Fix: normalize before hashing: lowercase, collapse runs of whitespace to a single space, and trim, so cosmetic differences do not change the key.

Use this when

You are merging scraped records from multiple runs, pages, or sources and need one entry per item: deduplicating a crawl frontier, collapsing syndicated articles, or cleaning a product feed before it lands in a database. A Set is the right tool while every key fits in memory; in V8, a Set of string keys commonly uses roughly 50-100 bytes per entry before counting the string contents.

Skip this when

The keyset is too large for heap and a missed record is unacceptable, in which case use a persistent UNIQUE index in better-sqlite3 or LMDB; the duplicates are similar rather than matching on the normalized title and body, such as related product titles, where you need similarity scoring or MinHash instead of a single hash; the database already enforces uniqueness, where an upsert on a unique constraint deduplicates at write time; or you need exact dedup at a scale where in-memory state will not fit, where a Redis SET shared across workers is the right primitive.

Skip the code, just get the data

Simplescraper turns any website into structured data in seconds.