Skip to content

How to use Smart Extract

SimpleScraper Smart Extract uses AI to accurately extract data and generate reusable CSS selectors from any website. All that's required is a URL and a data schema that lists the properties you wish to extract.

Smart Extract can be accessed via:


Dashboard Usage

  • Navigate to https://simplescraper.io/new
  • In the top input field, enter the URL of the page you wish to extract data from
  • In the bottom input field, enter a data schema (comma-seperated list of properties you wish to extract)
  • Click the 'Extract Data' button
  • After a few seconds, the data will be returned in CSV and JSON format
  • Click the 'Save as a scrape recipe' button to convert the smart extraction into a regular scrape recipe, allowing you to scrape at scale using Simplescraper

API Usage


Tips on writing your data schema

  • The schema provided should be a short, accurate list of each of the visible data points on the website that you wish to extract.

    • For example, if extracting data from a jobs board: 'Role, salary, location, job type, company, description, experience required' is a good schema.
  • Including a hint of what data is being extracted can increase accuracy.

    • For example, instead of: 'name, location, price, size, bedrooms, bathrooms', including a reference to the type of data can improve results. Example: 'property name, location, price, size (sqm), bedrooms, bathrooms'.
  • A schema is not a prompt.

    • This works: "title, old price, current price, discount, review count, description, num capsules, rating".
    • This does not: "visit the website and extract everything on the page beginning with A".

Current limitations of Smart Extract

  • Images URLs are not extracted (will be possible soon)
  • The URL is required to be publically available and not behind a login

Examples of using Simplescraper Smart Extract

The following are a list of websites and example schemas that would return accurate data. Use similar style schema on the sites you wish to extract data from.

WebsiteSchema
https://carsandbids.com/car name, details, time remaining, bid price, location
https://www.nike.com/gb/t/air-force-1-07-next-nature-shoes-67bFZC/DV3808-107price, old price, name, num of colors
https://jobs.careers.microsoft.com/global/en/searchjob title, location, remote possible, description
https://www.realestate.com.au/international/id/bali/price aud, price us, location, size (m2)
https://x.com/emollickname, @tag, tagline, joined date, link, number of posts, top tweet text

Notes:

  • SimpleScraper Smart Extract is in beta and may not be 100% accurate. If you encounter any issue or incorrect data please contact us via chat