How to extract all links from a page in JavaScript
If you have tried to pull the links off a page, you have probably hit the part nobody mentions: half the hrefs come back as /wiki/Main_Page or ../about, not full URLs, and the same destination shows up more than once because one copy has #section on the end. Reading the raw href attribute gives you a list you then have to clean by hand before it is usable.
The solution is to parse the HTML, read every anchor's href, resolve it against the page URL with the WHATWG URL constructor so relative paths become absolute, then drop the duplicates and the non-link schemes. It takes about 20 lines of Node.js with cheerio, and a jsdom version is included below for when you already have a DOM in hand.
Key terms
- cheerio. A server-side HTML parser with a jQuery-style API, built on parse5 and htmlparser2, which loads a string and lets you query it with CSS selectors.
- href attribute vs href property. The attribute is the raw value in the markup (often relative); the
URL-resolved property is the absolute address. cheerio gives you the attribute, so you resolve it yourself. - base tag. A
<base href>element in the page head that changes what relative links resolve against, which silently shifts every URL if you ignore it.
Here is what the script does:
- Fetch the page's HTML with a normal browser User-Agent so the server returns the full page.
- Parse the HTML with cheerio and read the
hrefoff every<a>element. - Resolve each
hrefto an absolute URL with theURLconstructor, honoring a<base>tag if the page has one. - Drop non-http schemes, strip the fragment, and dedupe with a
Set, then filter to same-host links.
The complete script
// extract-links.mjs
import * as cheerio from 'cheerio'
const url = 'https://en.wikipedia.org/wiki/Web_scraping'
// Fetch the HTML with a normal browser User-Agent so the server returns the full page.
const html = await fetch(url, {
headers: { 'User-Agent': 'Mozilla/5.0' }
}).then(r => r.text())
const $ = cheerio.load(html)
// A <base href> on the page overrides the document URL for relative links. Honor it.
const base = $('base[href]').attr('href') || url
const links = new Set()
$('a[href]').each((_, el) => {
const raw = $(el).attr('href')
let resolved
try {
resolved = new URL(raw, base) // resolves "/wiki/Foo" against the base
} catch {
return // skip malformed values the parser rejects
}
if (resolved.protocol !== 'http:' && resolved.protocol !== 'https:') return // drop mailto:, tel:, javascript:
resolved.hash = '' // treat "/page" and "/page#section" as one link
links.add(resolved.href)
})
// Keep only links on the same host as the page you fetched.
const origin = new URL(url).origin
const internal = [...links].filter(link => new URL(link).origin === origin)
console.log('total unique links:', links.size)
console.log('same-host links:', internal.length)
console.log(internal.slice(0, 10).join('\n'))npm install cheerio
node extract-links.mjsWhat each step does
Set a normal browser User-Agent. A bare fetch() from Node sends node as its User-Agent, and some sites return a blocked stub for that. A plain Mozilla string makes the server hand back the same HTML a browser would get. This is politeness, not stealth.
Read the raw href with cheerio. $(el).attr('href') returns the attribute exactly as written in the markup, so a link to /wiki/Foo comes back as /wiki/Foo, not the full URL. That is the step people skip, and it is why a plain attribute grab gives you a half-relative list.
Resolve against the base. new URL(raw, base) turns /wiki/Foo into https://en.wikipedia.org/wiki/Foo. If the page carries a <base href> tag, relative links resolve against that instead of the document URL, so the script reads the tag first and falls back to the fetched URL when there is none.
Filter, strip, and dedupe. The protocol check drops mailto:, tel:, and javascript: links that the URL constructor parses without complaint. Clearing hash collapses /page and /page#section into one entry. The Set removes the rest of the duplicates, and the origin filter keeps only same-host links. On the Wikipedia example that takes 464 anchors down to 318 unique links and 222 same-host links.
Gotchas
Relative links come back relative.
- Issue:
$(el).attr('href')returns the raw attribute, so a list of/about,../docs, and?page=2strings is what you get, and string concatenation against the origin breaks on../and query-only hrefs. - Fix: resolve every value with
new URL(raw, base), which handles../,?, and protocol-relative//hostpaths the way a browser does.
- Issue:
A
<base>tag silently moves every URL.- Issue: when the page has
<base href="https://example.com/docs/">, a browser resolvespage1to/docs/page1, but cheerio code that resolves against the document URL produces/page1, so a whole crawl points at the wrong paths. - Fix: read
$('base[href]').attr('href')and pass it as the base, which is what the script does. jsdom honors the tag for you when you pass{ url }.
- Issue: when the page has
Non-link schemes survive the
URLparser.- Issue:
new URL('mailto:hi@example.com')andnew URL('tel:+123')parse without throwing, so email and phone anchors land in your link list. - Fix: check
resolved.protocoland keep onlyhttp:andhttps:.
- Issue:
Fragment-only links double-count.
- Issue:
/pageand/page#sectionare different strings, so aSetkeeps both even though they point at the same document, inflating the count. - Fix: set
resolved.hash = ''before adding, which the script does. Skip this step if you actually want the in-page anchors.
- Issue:
JavaScript-rendered pages return few links.
- Issue:
fetchonly sees the server's initial HTML, so a React or Vue page that builds its nav and listing client-side hands back a near-empty<a>set. - Fix: render with Puppeteer or Playwright first, then pass
page.content()to cheerio. See How to scrape a JavaScript-rendered page in Node.js.
- Issue:
Some hrefs are not real navigations.
- Issue: anchors with
href="#"orhref="javascript:void(0)"are click handlers, not destinations, and they add noise to the list. - Fix: the protocol filter already drops
javascript:, and clearing the hash turns a bare#into the page's own URL, which the same-host filter keeps; add an explicit skip if you want them gone entirely.
- Issue: anchors with
You already have a jsdom DOM and want to reuse it.
- Issue: pulling in cheerio when the rest of your script parses with jsdom adds a second parser for the same HTML.
- Fix: jsdom resolves links for you when you pass the URL, so
a.hrefis already absolute and it honors any<base>tag. The same filter and dedupe logic applies:const { document } = new JSDOM(html, { url }).window; const links = new Set(); for (const a of document.querySelectorAll('a[href]')) { const u = new URL(a.href); if (u.protocol !== 'http:' && u.protocol !== 'https:') continue; u.hash = ''; links.add(u.href); }. On the same Wikipedia page this returns the same 318 unique links.
Use this when
You want the anchor links from a server-rendered page as a clean, deduplicated, absolute-URL list, for seeding a crawler, auditing internal links, or building a sitemap from a page that does not publish one.
Skip this when
The page builds its links in the browser (render with Puppeteer first, then parse); you need to follow every page across a whole site rather than one page (use a sitemap walker with a queue); you want URLs mentioned in body text rather than in <a> tags (match them out of the text instead); or you need link previews with titles and thumbnails (fetch each target and read its Open Graph tags).