Skip to content

Extracting Markdown data

Simplescraper enables you to easily extract a webpage, or an entire website's text content, in Markdown format. Markdown retains the page formatting and is a preferred format when analyzing web date using AI models such as OpenAI's ChatGPT and Anthropic's Claude.

There are a number of ways to extract website data in Markdown format with Simplescraper:

Via Auto-Crawl

image-auto0crawl

  • Visit the Simplescraper dashboard and click Get Data in the sidebar, then select Auto Crawl
  • Enter the URL of the website you wish to save as Markdown
  • Under Auto Crawl options, choose the maximum number of pages to scrape and any URL patterns to restrict the crawl to
  • Click the 'test' button to run a quick crawl on two pages. After a few seconds, results will appear - check them to ensure the data looks correct before proceeding
  • Click 'run' and Simplescraper will navigate through the website, saving a Markdown version of each page
  • When the auto crawl is completed, download options for Markdown, JSON, and CSV will appear

Continuing an Auto Crawl

If you originally set a limit on the number of pages or ran out of credits and want to scrape more:

  • On the Auto Crawl menu, click the 'continue' button next to one of your previous crawls in the history section
  • The website's URL will load. Increase the page limit as needed (for example, if you scraped 1000 pages before and now want to scrape 3000, enter 3000)
  • Click 'run' and Simplescraper will continue to crawl the website, scraping only those pages that were not previously scraped
  • When the task is completed, updated Markdown, JSON and CSV download options will appear

Via a scrape recipe

  • When saving a scrape recipe (this guide covers saving recipes), click the Advanced options section and toggle the 'Extract Markdown' button to the on position

    • Extract Markdown
  • Run your recipe and the Markdown will appear in its own column and a 'download Markdown' button will be available

    • Preview Markdown
  • Note that if the Markdown is very large (over 10MB), the file will be downloaded as a zip

Via the API

  • When calling the Simplescraper API, include extractMarkdown: true in the body of the request

    • js
      const apikey = 'ap1k3y';
      
      const requestBody = {
        extractMarkdown: true,
      };
      
      const response = fetch(`https://api.simplescraper.io/v1/recipes/${recipeId}/run`, {
        method: 'POST',
        headers: {
            'Authorization': `Bearer ${apikey}`,
            'Content-Type': 'application/json'
        },
        body: JSON.stringify(requestBody)
      })
  • Please read the full API guide for more details on data extraction via the API: https://simplescraper.io/docs/api-guide