How to use Smart Extract
SimpleScraper Smart Extract uses AI to accurately extract data and generate reusable CSS selectors from any website. All that's required is a URL and a data schema that lists the properties you wish to extract.
Smart Extract can be accessed via:
- The dashboard at https://simplescraper.io/new (or the scrape.new shortcut)
- Programatically via the API. Please see the docs here: https://simplescraper.io/docs/api-guide#post-smart-extract.
Dashboard Usage
- Navigate to https://simplescraper.io/new
- In the top input field, enter the URL of the page you wish to extract data from
- In the bottom input field, enter a data schema (comma-seperated list of properties you wish to extract)
- Click the 'Extract Data' button
- After a few seconds, the data will be returned in CSV and JSON format
- Click the 'Save as a scrape recipe' button to convert the smart extraction into a regular scrape recipe, allowing you to scrape at scale using Simplescraper
API Usage
- API docs can be found here: https://simplescraper.io/docs/api-guide#post-smart-extract.
Tips on writing your data schema
The schema provided should be a short, accurate list of each of the visible data points on the website that you wish to extract.
- For example, if extracting data from a jobs board: 'Role, salary, location, job type, company, description, experience required' is a good schema.
Including a hint of what data is being extracted can increase accuracy.
- For example, instead of: 'name, location, price, size, bedrooms, bathrooms', including a reference to the type of data can improve results. Example: 'property name, location, price, size (sqm), bedrooms, bathrooms'.
A schema is not a prompt.
- This works: "title, old price, current price, discount, review count, description, num capsules, rating".
- This does not: "visit the website and extract everything on the page beginning with A".
Current limitations of Smart Extract
- Images URLs are not extracted (will be possible soon)
- The URL is required to be publically available and not behind a login
Examples of using Simplescraper Smart Extract
The following are a list of websites and example schemas that would return accurate data. Use similar style schema on the sites you wish to extract data from.
Website | Schema |
---|---|
https://carsandbids.com/ | car name, details, time remaining, bid price, location |
https://www.nike.com/gb/t/air-force-1-07-next-nature-shoes-67bFZC/DV3808-107 | price, old price, name, num of colors |
https://jobs.careers.microsoft.com/global/en/search | job title, location, remote possible, description |
https://www.realestate.com.au/international/id/bali/ | price aud, price us, location, size (m2) |
https://x.com/emollick | name, @tag, tagline, joined date, link, number of posts, top tweet text |
Notes:
- SimpleScraper Smart Extract is in beta and may not be 100% accurate. If you encounter any issue or incorrect data please contact us via chat