Deep scraping URLs

Note: this section involves crawling a list of URLs so be sure to also read the Crawling a List of URLs guide

You may find yourself wanting to scrape a list of URLs and then the contents behind each of those individual URLs. For example a job board or hotel listings website where we aim to extract the content behind every link in a list of results.

Here's how to accomplish this using Simplescraper:

  1. Create a recipe that scrapes the full list of URLs
  2. Create a second recipe that scrapes the contents of one of these URLs. So if we wanted to scrape a list of job descriptions for example, this recipe would have instructions for how to scrape a single job description page
  3. Run the first recipe so that the scrape results consist of a list of URLs and then import these URLs into the second recipe via the second recipe's crawler page and then run the second recipe to scrape each page

So we use two recipes. The first recipe returns a list of URLs and the second recipe scrapes each of those individual URLs.

A detailed example:

  • Let's say we want to extract the contact details for all of the apps on the Shopify app store
  • We create two recipes: one to scrape the URLs of every app page at https://apps.shopify.com/ and one to scrape the contact details on a single app page. Below you can see us select the elements for each page
  • We save these recipes and name them 'Shopify - app links' and 'Shopify - contact info', respectively
  • Now that we've saved both recipes, we must first run 'Shopify - app links' to retrieve the lists of URLs
  • Once that's completed, open 'Shopify - contact info', go to the Crawler tab and click 'import URLs'. Select 'Shopify - app links' from the list, choose the property name that contains the URLs ('App link') and confirm that we want to import them
  • The URLs will be imported and we can now click run to scrape each URL in the crawler which will return the contact details of every app on the Shopify store

Automatically running an imported recipe

When running a recipe that is importing URLs from another recipe you may wish to run the imported recipe first so that the most recent URLs are scraped.

To do this automatically, navigate to the crawl tab of the recipe that is importing the URLs and toggle the 'run imported recipe first' option. Now any time you run that recipe it will first run the imported recipe so that the latest URLs are always scraped.

Note that this option only becomes visible on the crawl tab once a recipe has been imported.