Simplescraper
Skip to content

How to detect if you've been soft-blocked while scraping

How to detect if you've been soft-blocked while scraping

Updated 2026-06-25 · 5 min read

If you're checking only the status code to tell whether a scrape worked, you're probably letting blocks slip through as successes. A site that does not want your traffic often does not return a 403. It returns a 200 with a Cloudflare interstitial, a Datadome "Please enable JS" page, or a near-empty shell where the article used to be, and your pipeline stores that garbage as a good result. By the time you notice, you have a few thousand rows of "Just a moment..." in your database.

The fix is to treat detection as three signals read together rather than one status code: the HTTP status, the size of the response body against what a real page should weigh, and a scan for the content fingerprints that challenge pages leave behind. A response only counts as a soft block when those signals line up, which is what keeps a genuinely short page from tripping the check. It runs in about 70 lines of Node.js with no dependencies beyond the built-in fetch.

Key terms

  • Soft block. A response that returns content (often with a 200 status) but withholds the data you asked for, usually a challenge page, a JavaScript wall, or a stripped stub.
  • Challenge page. The interstitial an anti-bot service serves instead of the real page, such as Cloudflare's "Just a moment..." or Datadome's verification screen, identifiable by stable marker strings in the HTML.
  • Content fingerprint. A short, distinctive string that reliably appears in a known block page (a vendor's challenge title, a script src, a ray-id label) and rarely appears in legitimate content.
  • Server header. The Server and cf-mitigated response headers, which name the fronting service and, on Cloudflare, flag an active challenge before you even read the body.

Here is what the script does:

  • Fetch the URL with fetch and read both the response headers and the full body text.
  • Score the status code, treating 401, 403, 429, and 503 as block-leaning and 200 as inconclusive on its own.
  • Compare the body length against a floor, since a challenge page is usually far smaller than the article you expected.
  • Scan the body and headers for known challenge fingerprints from a lookup table of vendor markers.
  • Combine the three signals into a single verdict and the reasons behind it, so a caller can log, retry, or escalate.

The complete script

js
// detect-soft-block.mjs

/* Known challenge-page fingerprints, keyed by the anti-bot vendor that emits them.
   Each marker is a string that appears in the block page's HTML or headers and
   is unlikely to appear in legitimate article content. Extend per target site. */
const CHALLENGE_FINGERPRINTS = {
  cloudflare: ['cf-chl-', 'Just a moment...', 'Checking your browser before accessing', '/cdn-cgi/challenge-platform/'],
  datadome: ['datadome', 'dd_cookie', 'geo.captcha-delivery.com'],
  perimeterx: ['_px', 'Access to this page has been denied', 'px-captcha'],
  imperva: ['Incapsula incident', '_Incapsula_Resource', 'Request unsuccessful'],
  akamai: ['ak_bmsc', 'Access Denied', 'Reference '],
  generic: ['Attention Required', 'Enable JavaScript and cookies to continue', 'unusual traffic from your computer']
}

/* Status codes that lean toward a block on their own. 200 is deliberately absent:
   a soft block hides behind 200, so a 200 is inconclusive until the body is read. */
const BLOCK_STATUS = new Set([401, 403, 429, 503])

/* Below this many bytes, a page that should carry an article is suspiciously thin.
   Tune per target: a normal article body is tens of kilobytes, a challenge page is a few. */
const MIN_BODY_BYTES = 2000

/* Scan body text and headers for any vendor fingerprint. Returns the matched
   vendor and marker, or null when nothing matches. */
function matchFingerprint(bodyText, headerBlob) {
  const haystack = bodyText + '\n' + headerBlob
  for (const [vendor, markers] of Object.entries(CHALLENGE_FINGERPRINTS)) {
    for (const marker of markers) {
      if (haystack.includes(marker)) return { vendor, marker }
    }
  }
  return null
}

/* Read the three signals and combine them into one verdict.
   blocked is true only when the signals corroborate, not on any single weak hint. */
async function detectSoftBlock(url) {
  const res = await fetch(url, {
    headers: { 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36' },
    redirect: 'follow'
  })
  const body = await res.text()
  const headerBlob = [...res.headers].map(([k, v]) => k + ': ' + v).join('\n')

  /* Signal 1: status code. A block-leaning code is strong but not sufficient. */
  const statusBlocked = BLOCK_STATUS.has(res.status)

  /* Signal 2: body size. A thin body where you expected an article is a hint, not a verdict. */
  const bodyBytes = Buffer.byteLength(body, 'utf8')
  const bodyThin = bodyBytes < MIN_BODY_BYTES

  /* Signal 3: content fingerprint. A vendor marker in the body or headers is the strongest signal. */
  const fingerprint = matchFingerprint(body, headerBlob)

  /* A Cloudflare-specific header that flags an active challenge before the body is parsed. */
  const cfMitigated = res.headers.get('cf-mitigated') === 'challenge'

  /* Combine: a fingerprint or the cf-mitigated header is decisive on its own;
     otherwise a block-leaning status backed by a thin body counts as a block. */
  const blocked = Boolean(fingerprint) || cfMitigated || (statusBlocked && bodyThin)

  const reasons = []
  if (fingerprint) reasons.push('fingerprint:' + fingerprint.vendor + ':' + fingerprint.marker)
  if (cfMitigated) reasons.push('header:cf-mitigated=challenge')
  if (statusBlocked) reasons.push('status:' + res.status)
  if (bodyThin) reasons.push('body-thin:' + bodyBytes + 'b')

  return { url, blocked, status: res.status, bodyBytes, vendor: fingerprint?.vendor ?? null, reasons }
}

const verdict = await detectSoftBlock('https://en.wikipedia.org/wiki/Web_scraping')
console.log(JSON.stringify(verdict, null, 2))
if (verdict.blocked) process.exitCode = 1
bash
node detect-soft-block.mjs

What each step does

Read the body and the headers together. A status-only check throws away the two signals that catch a 200-OK block. The script awaits res.text() for the body and flattens res.headers into a single string so the fingerprint scan can match against header values like Server: cloudflare and cf-mitigated, not just the HTML.

Score the status code without trusting it alone. BLOCK_STATUS holds 401, 403, 429, and 503, the codes that lean toward a block. A 200 is deliberately left out of that set, because the soft block this page is about hides behind a 200. A block-leaning status raises suspicion but does not decide the verdict by itself.

Measure the body against a floor. MIN_BODY_BYTES is set to 2000 here. Most challenge pages weigh a few kilobytes while a real article weighs tens, so a body under the floor is a hint that something other than the page came back. Tune the floor to the size of the pages you actually scrape; a 2 KB floor is right for articles and wrong for a sparse API JSON response.

Match against vendor fingerprints. CHALLENGE_FINGERPRINTS is a lookup table keyed by anti-bot vendor, each value a list of marker strings that show up in that vendor's block page. matchFingerprint walks the table and returns the first vendor and marker it finds in the combined body-plus-headers blob, so the verdict can name which service blocked you. Add markers for the sites you scrape as you observe new block pages.

Combine the signals into one verdict. A fingerprint match or a cf-mitigated: challenge header is decisive on its own, since both are specific to block pages. Absent those, the script requires a block-leaning status and a thin body together before it calls the response blocked, which stops a short but legitimate page from being flagged. The returned reasons array records every signal that fired so the caller can log why.

Gotchas

  • A soft block returns 200, so a status check passes it.

    • Issue: Cloudflare's "Just a moment..." interstitial, Datadome's verification page, and many JavaScript walls return 200 OK, so if (res.ok) treats the block as a success and stores the challenge HTML.
    • Fix: read the body on every response and run the fingerprint scan even when res.status === 200; the blocked verdict here never depends on the status alone.
  • A genuinely short page trips the body-size floor.

    • Issue: A sparse listing page, a redirect stub, or a small JSON endpoint can fall under MIN_BODY_BYTES and look thin even though it is the real response.
    • Fix: the thin-body signal never decides on its own; it only counts alongside a block-leaning status. Set MIN_BODY_BYTES to roughly a third of a normal page for the site you scrape, or drop the floor to 0 for endpoints whose real size varies.
  • Fingerprint strings drift as vendors update their pages.

    • Issue: Anti-bot vendors change their challenge markup, so a marker like Checking your browser before accessing can stop appearing and your scan goes quiet while blocks keep landing.
    • Fix: keep CHALLENGE_FINGERPRINTS in one place and add the Server header and the cf-mitigated header as backstops, since the fronting service name in the header changes far less often than the body copy.
  • fetch follows the redirect to the challenge and hides the original status.

    • Issue: With redirect: 'follow', a 403 that bounces to a challenge URL can surface as a 200 from the final hop, so res.status reports the challenge page's code, not the block.
    • Fix: the fingerprint and cf-mitigated checks catch the challenge regardless of which hop's status you end on; if you need the first status, set redirect: 'manual' and inspect the initial response.
  • A real browser session passes the same URL a bare fetch fails on.

    • Issue: This detector runs a plain HTTP request, so a site that gates on TLS fingerprint or a JavaScript challenge will block the script even when a headless Chrome would get through, producing a block verdict that reflects the client, not a ban.
    • Fix: when the verdict is blocked with a cloudflare or datadome vendor, retry through Puppeteer or a patched browser before deciding the site is unreachable. See How to patch headless Chrome to avoid detection.
  • Per-request detection misses a slow ramp into rate limiting.

    • Issue: A site can serve real pages for the first few hundred requests and then start returning 429s, so a one-shot check on a single URL says everything is fine right up until the block lands.
    • Fix: call detectSoftBlock on a sample of responses across the run and track the blocked rate over a sliding window, then back off when it climbs. See How to rate-limit requests with backoff in JavaScript.

Use this when

You want a single function that tells you whether a scraped response is the page you asked for or a block dressed up as one, so your pipeline can retry, escalate to a browser, or quarantine the row instead of saving challenge HTML as data.

Skip this when

The site never fronts an anti-bot service and a status check is enough (just read res.status); you need to get past the block rather than name it (render with Puppeteer or a patched browser); you are fighting TLS-level fingerprinting before the body is even returned (use an impersonating HTTP client); or you need to recover automatically after detection (pair this with a backoff-and-retry loop).

Skip the code, just get the data

Simplescraper turns any website into structured data in seconds.