How to detect if you've been soft-blocked while scraping
If you're checking only the status code to tell whether a scrape worked, you're probably letting blocks slip through as successes. A site that does not want your traffic often does not return a 403. It returns a 200 with a Cloudflare interstitial, a Datadome "Please enable JS" page, or a near-empty shell where the article used to be, and your pipeline stores that garbage as a good result. By the time you notice, you have a few thousand rows of "Just a moment..." in your database.
The fix is to treat detection as three signals read together rather than one status code: the HTTP status, the size of the response body against what a real page should weigh, and a scan for the content fingerprints that challenge pages leave behind. A response only counts as a soft block when those signals line up, which is what keeps a genuinely short page from tripping the check. It runs in about 70 lines of Node.js with no dependencies beyond the built-in fetch.
Key terms
- Soft block. A response that returns content (often with a 200 status) but withholds the data you asked for, usually a challenge page, a JavaScript wall, or a stripped stub.
- Challenge page. The interstitial an anti-bot service serves instead of the real page, such as Cloudflare's "Just a moment..." or Datadome's verification screen, identifiable by stable marker strings in the HTML.
- Content fingerprint. A short, distinctive string that reliably appears in a known block page (a vendor's challenge title, a script src, a ray-id label) and rarely appears in legitimate content.
- Server header. The
Serverandcf-mitigatedresponse headers, which name the fronting service and, on Cloudflare, flag an active challenge before you even read the body.
Here is what the script does:
- Fetch the URL with
fetchand read both the response headers and the full body text. - Score the status code, treating 401, 403, 429, and 503 as block-leaning and 200 as inconclusive on its own.
- Compare the body length against a floor, since a challenge page is usually far smaller than the article you expected.
- Scan the body and headers for known challenge fingerprints from a lookup table of vendor markers.
- Combine the three signals into a single verdict and the reasons behind it, so a caller can log, retry, or escalate.
The complete script
// detect-soft-block.mjs
/* Known challenge-page fingerprints, keyed by the anti-bot vendor that emits them.
Each marker is a string that appears in the block page's HTML or headers and
is unlikely to appear in legitimate article content. Extend per target site. */
const CHALLENGE_FINGERPRINTS = {
cloudflare: ['cf-chl-', 'Just a moment...', 'Checking your browser before accessing', '/cdn-cgi/challenge-platform/'],
datadome: ['datadome', 'dd_cookie', 'geo.captcha-delivery.com'],
perimeterx: ['_px', 'Access to this page has been denied', 'px-captcha'],
imperva: ['Incapsula incident', '_Incapsula_Resource', 'Request unsuccessful'],
akamai: ['ak_bmsc', 'Access Denied', 'Reference '],
generic: ['Attention Required', 'Enable JavaScript and cookies to continue', 'unusual traffic from your computer']
}
/* Status codes that lean toward a block on their own. 200 is deliberately absent:
a soft block hides behind 200, so a 200 is inconclusive until the body is read. */
const BLOCK_STATUS = new Set([401, 403, 429, 503])
/* Below this many bytes, a page that should carry an article is suspiciously thin.
Tune per target: a normal article body is tens of kilobytes, a challenge page is a few. */
const MIN_BODY_BYTES = 2000
/* Scan body text and headers for any vendor fingerprint. Returns the matched
vendor and marker, or null when nothing matches. */
function matchFingerprint(bodyText, headerBlob) {
const haystack = bodyText + '\n' + headerBlob
for (const [vendor, markers] of Object.entries(CHALLENGE_FINGERPRINTS)) {
for (const marker of markers) {
if (haystack.includes(marker)) return { vendor, marker }
}
}
return null
}
/* Read the three signals and combine them into one verdict.
blocked is true only when the signals corroborate, not on any single weak hint. */
async function detectSoftBlock(url) {
const res = await fetch(url, {
headers: { 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36' },
redirect: 'follow'
})
const body = await res.text()
const headerBlob = [...res.headers].map(([k, v]) => k + ': ' + v).join('\n')
/* Signal 1: status code. A block-leaning code is strong but not sufficient. */
const statusBlocked = BLOCK_STATUS.has(res.status)
/* Signal 2: body size. A thin body where you expected an article is a hint, not a verdict. */
const bodyBytes = Buffer.byteLength(body, 'utf8')
const bodyThin = bodyBytes < MIN_BODY_BYTES
/* Signal 3: content fingerprint. A vendor marker in the body or headers is the strongest signal. */
const fingerprint = matchFingerprint(body, headerBlob)
/* A Cloudflare-specific header that flags an active challenge before the body is parsed. */
const cfMitigated = res.headers.get('cf-mitigated') === 'challenge'
/* Combine: a fingerprint or the cf-mitigated header is decisive on its own;
otherwise a block-leaning status backed by a thin body counts as a block. */
const blocked = Boolean(fingerprint) || cfMitigated || (statusBlocked && bodyThin)
const reasons = []
if (fingerprint) reasons.push('fingerprint:' + fingerprint.vendor + ':' + fingerprint.marker)
if (cfMitigated) reasons.push('header:cf-mitigated=challenge')
if (statusBlocked) reasons.push('status:' + res.status)
if (bodyThin) reasons.push('body-thin:' + bodyBytes + 'b')
return { url, blocked, status: res.status, bodyBytes, vendor: fingerprint?.vendor ?? null, reasons }
}
const verdict = await detectSoftBlock('https://en.wikipedia.org/wiki/Web_scraping')
console.log(JSON.stringify(verdict, null, 2))
if (verdict.blocked) process.exitCode = 1node detect-soft-block.mjsWhat each step does
Read the body and the headers together. A status-only check throws away the two signals that catch a 200-OK block. The script awaits res.text() for the body and flattens res.headers into a single string so the fingerprint scan can match against header values like Server: cloudflare and cf-mitigated, not just the HTML.
Score the status code without trusting it alone. BLOCK_STATUS holds 401, 403, 429, and 503, the codes that lean toward a block. A 200 is deliberately left out of that set, because the soft block this page is about hides behind a 200. A block-leaning status raises suspicion but does not decide the verdict by itself.
Measure the body against a floor. MIN_BODY_BYTES is set to 2000 here. Most challenge pages weigh a few kilobytes while a real article weighs tens, so a body under the floor is a hint that something other than the page came back. Tune the floor to the size of the pages you actually scrape; a 2 KB floor is right for articles and wrong for a sparse API JSON response.
Match against vendor fingerprints. CHALLENGE_FINGERPRINTS is a lookup table keyed by anti-bot vendor, each value a list of marker strings that show up in that vendor's block page. matchFingerprint walks the table and returns the first vendor and marker it finds in the combined body-plus-headers blob, so the verdict can name which service blocked you. Add markers for the sites you scrape as you observe new block pages.
Combine the signals into one verdict. A fingerprint match or a cf-mitigated: challenge header is decisive on its own, since both are specific to block pages. Absent those, the script requires a block-leaning status and a thin body together before it calls the response blocked, which stops a short but legitimate page from being flagged. The returned reasons array records every signal that fired so the caller can log why.
Gotchas
A soft block returns 200, so a status check passes it.
- Issue: Cloudflare's "Just a moment..." interstitial, Datadome's verification page, and many JavaScript walls return
200 OK, soif (res.ok)treats the block as a success and stores the challenge HTML. - Fix: read the body on every response and run the fingerprint scan even when
res.status === 200; theblockedverdict here never depends on the status alone.
- Issue: Cloudflare's "Just a moment..." interstitial, Datadome's verification page, and many JavaScript walls return
A genuinely short page trips the body-size floor.
- Issue: A sparse listing page, a redirect stub, or a small JSON endpoint can fall under
MIN_BODY_BYTESand look thin even though it is the real response. - Fix: the thin-body signal never decides on its own; it only counts alongside a block-leaning status. Set
MIN_BODY_BYTESto roughly a third of a normal page for the site you scrape, or drop the floor to 0 for endpoints whose real size varies.
- Issue: A sparse listing page, a redirect stub, or a small JSON endpoint can fall under
Fingerprint strings drift as vendors update their pages.
- Issue: Anti-bot vendors change their challenge markup, so a marker like
Checking your browser before accessingcan stop appearing and your scan goes quiet while blocks keep landing. - Fix: keep
CHALLENGE_FINGERPRINTSin one place and add theServerheader and thecf-mitigatedheader as backstops, since the fronting service name in the header changes far less often than the body copy.
- Issue: Anti-bot vendors change their challenge markup, so a marker like
fetchfollows the redirect to the challenge and hides the original status.- Issue: With
redirect: 'follow', a 403 that bounces to a challenge URL can surface as a 200 from the final hop, sores.statusreports the challenge page's code, not the block. - Fix: the fingerprint and
cf-mitigatedchecks catch the challenge regardless of which hop's status you end on; if you need the first status, setredirect: 'manual'and inspect the initial response.
- Issue: With
A real browser session passes the same URL a bare fetch fails on.
- Issue: This detector runs a plain HTTP request, so a site that gates on TLS fingerprint or a JavaScript challenge will block the script even when a headless Chrome would get through, producing a block verdict that reflects the client, not a ban.
- Fix: when the verdict is
blockedwith acloudflareordatadomevendor, retry through Puppeteer or a patched browser before deciding the site is unreachable. See How to patch headless Chrome to avoid detection.
Per-request detection misses a slow ramp into rate limiting.
- Issue: A site can serve real pages for the first few hundred requests and then start returning 429s, so a one-shot check on a single URL says everything is fine right up until the block lands.
- Fix: call
detectSoftBlockon a sample of responses across the run and track the blocked rate over a sliding window, then back off when it climbs. See How to rate-limit requests with backoff in JavaScript.
Use this when
You want a single function that tells you whether a scraped response is the page you asked for or a block dressed up as one, so your pipeline can retry, escalate to a browser, or quarantine the row instead of saving challenge HTML as data.
Skip this when
The site never fronts an anti-bot service and a status check is enough (just read res.status); you need to get past the block rather than name it (render with Puppeteer or a patched browser); you are fighting TLS-level fingerprinting before the body is even returned (use an impersonating HTTP client); or you need to recover automatically after detection (pair this with a backoff-and-retry loop).