Crawling lists of URLs

The crawler allows you scrape up to 5000 URLs at a time to with SimpleScraper. This method is recommended as it is faster than navigating individually through pages.

To use the crawler, save a recipe as normal and then click the 'Crawl' tab and paste the URLs that you wish to scrape into the text area. SimpleScraper will detect that there are URLs in the crawler and will scrape using these URLs instead of the original URL and pagination settings.

If you wish to scrape multiple pages but only have the URL of the first page, click to the second page of the website and note the URL. There should be a structured pattern like 'page=2', 'p=2', 'page/2' etc. This value should be incremented for each page that you intend to scrape. For example if you're scraping 100 pages then the last URL will contain a value like 'page=100'. To help with this you can use the Generate URLs tool as outlined below.

Generating URLs

SimpleScraper includes a URL generator which allows you to easily create a list of URLs for multiple pages on a website, which can then be scraped in one go.

To use it, follow these steps:

  • Identify the base URL that you intend to replicate for multiple pages. In the video example above, we use https://pubmed.ncbi.nlm.nih.gov/?term=zinc

  • Navigate to the second page of that URL so that we can determine the pagination structure of the URL. In the video example we click to the second page and the URL changes to https://pubmed.ncbi.nlm.nih.gov/?term=zinc&page=2. Notice that this page is represented as page=2 in the URL. This tells us that the URL of each subsequent page will follow the format page=3, page=4 etc. Copy this URL

  • Navigate to the 'Crawl' tab of your scrape recipe and click the 'Generate URLs' button. Enter the URL of the second page and replace the page number with an [x]. So in our example https://pubmed.ncbi.nlm.nih.gov/?term=zinc&page=2 becomes https://pubmed.ncbi.nlm.nih.gov/?term=zinc&page=[x]

  • In the input fields enter the following:

    • Enter the 'start value' (which is usually 1 if you wish to begin scraping from the first page)
    • Enter the 'end value' representing the last page you wish to scrape (ensure the website has this many pages)
    • Enter the 'increment' value which is how much each page number should increase by (this is typically 1 but can be 10, 20, 100 etc depending on the website)
  • Click generate

  • Once the URLs are generated, copy the list and paste them into the input field on the 'Crawl' tab. You can now click 'Run recipe' to scrape each of the individual URLs

If you'd like help with any step, contact us via chat.

Alternatively you can import a list of URLs that you scraped with a different recipe as explained in this Deep Scraping guide.