How to extract all links from a page in JavaScript

Updated 2026-06-25 · 4 min read

If you have tried to pull the links off a page, you have probably hit the part nobody mentions: half the hrefs come back as /wiki/Main_Page or ../about, not full URLs, and the same destination shows up more than once because one copy has #section on the end. Reading the raw href attribute gives you a list you then have to clean by hand before it is usable.

The solution is to parse the HTML and resolve every anchor against the page URL so the relative paths become absolute, then drop the duplicates and the schemes that are not real links. We'll build a small script that fetches the page with a normal browser header so the server returns the full markup, reads the href off every anchor and turns it into an absolute URL while honoring a <base> tag if the page sets one, and then strips fragments, drops non-web schemes, and dedupes down to a clean same-host list. It takes about 20 lines of Node.js with cheerio, a server-side HTML parser with a jQuery-style API; the same approach works with jsdom if you already have a DOM in hand.

The complete script

// extract-links.mjs
import * as cheerio from 'cheerio'

const url = 'https://en.wikipedia.org/wiki/Web_scraping'

// fetch the HTML with a normal browser User-Agent so the server returns the full page.
const html = await fetch(url, {
  headers: { 'User-Agent': 'Mozilla/5.0' }
}).then(r => r.text())

const $ = cheerio.load(html)

// a <base href> on the page overrides the document URL for relative links. honor it.
const base = $('base[href]').attr('href') || url

const links = new Set()
$('a[href]').each((_, el) => {
  const raw = $(el).attr('href')
  let resolved
  try {
    resolved = new URL(raw, base) // resolves "/wiki/Foo" against the base
  } catch {
    return // skip malformed values the parser rejects
  }
  if (resolved.protocol !== 'http:' && resolved.protocol !== 'https:') return // drop mailto:, tel:, javascript:
  resolved.hash = '' // treat "/page" and "/page#section" as one link
  links.add(resolved.href)
})

// keep only links on the same host as the page you fetched.
const origin = new URL(url).origin
const internal = [...links].filter(link => new URL(link).origin === origin)

console.log('total unique links:', links.size)
console.log('same-host links:', internal.length)
console.log(internal.slice(0, 10).join('\n'))

bash

npm install cheerio
node extract-links.mjs

How it works

Set a normal browser User-Agent. A bare fetch() from Node sends node as its User-Agent, and some sites return a blocked stub for that. A plain Mozilla string makes the server hand back the same HTML a browser would get. This is politeness, not stealth.

Read the raw href with cheerio. $(el).attr('href') returns the attribute exactly as written in the markup, so a link to /wiki/Foo comes back as /wiki/Foo, not the full URL. That is the step people skip, and it is why a plain attribute grab gives you a half-relative list. One thing this approach can't see: fetch only gets the server's initial HTML, so a React or Vue page that builds its nav and listing in the browser hands back a near-empty <a> set, and you need to render it with Puppeteer or Playwright first and pass page.content() to cheerio.

Resolve against the base. new URL(raw, base) turns /wiki/Foo into https://en.wikipedia.org/wiki/Foo and handles ../, query-only, and protocol-relative //host paths the way a browser does. If the page carries a <base href> tag, relative links resolve against that instead of the document URL, so the script reads the tag first and falls back to the fetched URL when there is none; ignore it and a whole crawl points at the wrong paths. If you already parse with jsdom elsewhere, skip cheerio and reuse that DOM: jsdom resolves links for you when you pass { url }, so a.href is already absolute and it honors the <base> tag too.

Filter, strip, and dedupe. The protocol check drops mailto:, tel:, and javascript: links that the URL constructor parses without complaint. Clearing hash collapses /page and /page#section into one entry, which also turns a bare href="#" into the page's own URL rather than a duplicate. The Set removes the rest of the duplicates, and the origin filter keeps only same-host links. On the Wikipedia example that takes 464 anchors down to 318 unique links and 222 same-host links.

Use this when

You want the anchor links from a server-rendered page as a clean, deduplicated, absolute-URL list, for seeding a crawler, auditing internal links, or building a sitemap from a page that does not publish one.

Skip this when

The page builds its links in the browser (render with Puppeteer first, then parse); you need to follow every page across a whole site rather than one page (use a sitemap walker with a queue); you want URLs mentioned in body text rather than in <a> tags (match them out of the text instead); or you need link previews with titles and thumbnails (fetch each target and read its Open Graph tags).

How to extract all links from a page in JavaScript ​

The complete script ​

How it works ​

Related guides ​

Skip the code, just get the data Simplescraper turns any website into structured data in seconds.

How to extract all links from a page in JavaScript

The complete script

How it works

Related guides

Skip the code, just get the data
Simplescraper turns any website into structured data in seconds.