Extracting Markdown data
Simplescraper enables you to easily extract a webpage, or an entire website's text content, in Markdown format. Markdown retains the page formatting and is a preferred format when analyzing web date using AI models such as OpenAI's ChatGPT and Anthropic's Claude.
There are a number of ways to extract website data in Markdown format with Simplescraper:
Via Auto-Crawl
- Visit the Simplescraper dashboard and click Get Data in the sidebar, then select Auto Crawl
- Enter the URL of the website you wish to save as Markdown
- Under Auto Crawl options, choose the maximum number of pages to scrape and any URL patterns to restrict the crawl to
- Click the 'test' button to run a quick crawl on two pages. After a few seconds, results will appear - check them to ensure the data looks correct before proceeding
- Click 'run' and Simplescraper will navigate through the website, saving a Markdown version of each page
- When the auto crawl is completed, download options for Markdown, JSON, and CSV will appear
Continuing an Auto Crawl
If you originally set a limit on the number of pages or ran out of credits and want to scrape more:
- On the Auto Crawl menu, click the 'continue' button next to one of your previous crawls in the history section
- The website's URL will load. Increase the page limit as needed (for example, if you scraped 1000 pages before and now want to scrape 3000, enter 3000)
- Click 'run' and Simplescraper will continue to crawl the website, scraping only those pages that were not previously scraped
- When the task is completed, updated Markdown, JSON and CSV download options will appear
Via a scrape recipe
When saving a scrape recipe (this guide covers saving recipes), click the Advanced options section and toggle the 'Extract Markdown' button to the on position
Run your recipe and the Markdown will appear in its own column and a 'download Markdown' button will be available
Note that if the Markdown is very large (over 10MB), the file will be downloaded as a zip
Via the API
When calling the Simplescraper API, include
extractMarkdown: true
in the body of the request- js
const apikey = 'ap1k3y'; const requestBody = { extractMarkdown: true, }; const response = fetch(`https://api.simplescraper.io/v1/recipes/${recipeId}/run`, { method: 'POST', headers: { 'Authorization': `Bearer ${apikey}`, 'Content-Type': 'application/json' }, body: JSON.stringify(requestBody) })
Please read the full API guide for more details on data extraction via the API: https://simplescraper.io/docs/api-guide