Deep scraping URLs

Note: Before diving in, you may want to read the Crawling a List of URLs guide, as this section involves that concept.

Deep Scraping in SimpleScraper allows you to not only scrape a list of links from a main webpage, but also collect data from the respective subpages. This is useful for extracting data from platforms like job boards or hotel listings, where each listing's details are on individual subpages.

To accomplish this, you'll use two recipes: one recipe to scrape the links and another recipe to extract the data behind each link.

How to Use Deep Scraping

  1. First Recipe: Create and save a recipe that scrapes a complete list of links that you're interested in. For example, if you're scraping job listings, this recipe scrapes the link of each job post

  2. Second Recipe: Create and save another recipe that will scrape the content behind a single link from this list. For instance, if you're scraping job listings, this second recipe would be set to extract job information (salary, job description, requirements etc) from a single job description page

  3. Run and Import: Run the first recipe via cloud scraping to get a list of links. Import these links into your second recipe via the second recipe's 'Crawl' tab. Then, run the second recipe to scrape the content from each link

Note: When extracting a property that includes a link, SimpleScraper will automatically save this link to a column named 'propertyname_link'.

Example: Shopify App Store Deep Scraping

To scrape app links and their contact details from the Shopify App Store, you'll use two SimpleScraper recipes:

  • First, create a recipe to scrape the list of app links available on https://apps.shopify.com/

  • Then create a second recipe focused on extracting contact details from a single app page

  • Save the first recipe as 'Shopify - app links' and the second as 'Shopify - contact info'

The video below shows how to select elements for the first recipe.

Once both recipes are saved, start by running 'Shopify - app links'. The result will be a column filled with the extracted links.

Next, switch to the 'Shopify - contact info' recipe, navigate to the 'Crawl' tab, and choose 'Import URLs'.

From the list, select 'Shopify - app links' and the property containing the links (in this example, 'App link'). Confirm the import, and you're set to begin deep scraping.

Now all you need to do is click 'Run recipe' on the second recipe and the contact details for each of the apps will be scraped.

The video below shows the steps to import the links from the first recipe into the second recipe.

Automatically running an imported recipe

When running a recipe that is importing links from another recipe you may wish to run the imported recipe first so that the most recent links are scraped.

To do this automatically, navigate to the 'Crawl' tab of the recipe that is importing the links (the 'second recipe') and toggle the 'run imported recipe first' option to on. This ensures the link-collecting recipe runs automatically before scraping the individual pages, keeping your scraped data current.

Note:

  • The 'run imported recipe first' option only becomes visible on the crawl tab once a recipe has been imported
  • 'run imported recipe first' is not applicable for requests made via the API. It's best suited for recipes that are scheduled or run manually from the dashboard. To recreate this via the API, call the first recipe and the second recipe sequentially