Skip to content

Using the Simplescraper API

The Simplescraper API allows you to extract structured data programatically from web pages. This guide covers how to effectively use the API for your data extraction needs.

If you haven't already created a scrape recipe, please read this guide before continuing.

Authentication

All requests to the API require the inclusion of an API key. The API key should be sent in the Authorization header using the Bearer token format:

js
Authorization: Bearer your_api_key

An API key is provided when you sign up to Simplescraper and can be found on the API tab of each recipe that you create. Code examples of how to include the API key in requests are provided in the sections below.

Request structure


POST Request structure

For POST requests, the API key should be sent in the Authorization header using the Bearer token format and the Content-Type header must be set to application/json.

Here's an example of a POST request to the /recipes/:recipeId/run endpoint:

js
const apikey = 'ap1k3y';
const recipeId = '12345';
const url = `https://api.simplescraper.io/v1/recipes/${recipeId}/run`;

const requestBody = {
  sourceUrl: sourceUrl,
  // other optional properties can be included here
  extractMarkdown: false,
  runAsync: false,
};

const response = fetch(url, {
  method: 'POST',
  headers: {
      'Authorization': `Bearer ${apikey}`,
      'Content-Type': 'application/json'
  },
  body: JSON.stringify(requestBody)
})
// handle response...

Notes:

  • URL encoding of the sourceUrl is not necessary when sent in the request body, as JSON handles special characters automatically.
  • Set the Content-Type header to application/json for POST requests.


GET Request structure

For GET requests, the API key should be sent in the Authorization header using the Bearer token format. Other properties should also be passed in as query parameters, and encoded where necessary.

As an example, here's a GET request to the /results/:resultsID endpoint:

js
const apikey = 'ap1k3y';
const resultsId = '12345';
const requestUrl = `https://api.simplescraper.io/v1/results/${resultsId}`;

const response = fetch(requestUrl, {
  headers: {
    'Authorization': `Bearer ${apikey}`
  }
});
// handle response...

Endpoints

The base URL for all API requests is: https://api.simplescraper.io/v1, followed by the specific endpoint.

Below is a table of all available endpoints. To use them, append these endpoints to the base URL.

endpointMethoddescription
/recipes/:recipeIdGETGet information about a recipe
/recipes/:recipeId/runPOSTRun the specified scrape recipe and return results or status
/recipes/:recipeId/results-latestGETRetrieve the most recent result for a specific recipe
/recipes/:recipeId/results-historyGETRetrieve recipe ids, scrape date, and number of pages scraped of last 100 scrape runs of specified recipe
/recipes/:recipeId/batch/urlsPOSTUpdate or replace batch scraper (crawler) URL list
/results/:resultsIdGETView results or scrape progress for specified results ID. Status key indicates the progress of scrape task
/smart-extractPOSTExtract data and reusable CSS selectors from any website using AI

For example, the full URL to run a scrape recipe looks like this:

js
https://api.simplescraper.io/v1/recipes/:recipeId/run

Note: Replace :recipeId and :resultsId with actual IDs when making requests.

Further information about interacting with each endpoint is explained in the 'Endpoint details' section below.

Endpoint details

POST /recipe/:recipeId/run

Send POST requests to this endpoint to initiate a scrape run for a recipe and return data.

If the scrape time exceeds 90 seconds, a JSON object containing a results_id and status of 'running' is returned which can then be polled at https://api.simplescraper.io/v1/results/:resultsId.


Request body properties

The following properties should be sent in the request body as JSON:

PropertyRequiredTypeDescription
sourceUrlYesURLThe URL of the page to be scraped. This will update the current URL of the recipe. If not included, the existing recipe URL will be used.
runAsyncNoBooleanIf true, returns a result ID immediately and runs the scrape task asynchronously. The result ID can then be used to poll the /results/:resultsID endpoint.
extractMarkdownNoBooleanIf true, a markdown version of the page will be extracted in addition to structured data (see https://simplescraper.io/docs/extract-markdown)
offsetNoNumberThe starting point for results retrieval.
useCrawlerNoBooleanIf true, scrape URLs that have been added to the crawler via the crawler tab of a recipe, instead of the source URL. Default false.

Example POST request
js
async function runRecipe(apikey, recipeId, sourceUrl) {
  
  const url = `https://api.simplescraper.io/v1/recipes/${recipeId}/run`;
  
  const requestBody = {
    sourceUrl: sourceUrl,
    runAsync: false
  };

  try {
    const response = await fetch(url, {
      method: 'POST',
      headers: {
        'Authorization': `Bearer ${apikey}`,
        'Content-Type': 'application/json'
      },
      body: JSON.stringify(requestBody)
    });
    const data = await response.json();
    console.log(data);
  } catch (error) {
    console.error('error:', error);
  }
}

// call function
runRecipe('YOUR_API_KEY', 'YOUR_RECIPE_ID', 'https://example.com');

Notes:

GET requests to /recipe/:recipeId/run are also supported for systems that can't make POST requests, however POST is recommended whenever possible. When sending a GET request, pass properties as URL parameters and ensure proper URL encoding of the sourceUrl.


Example GET request
js
async function runRecipe(apiKey, recipeId, sourceUrl) {
const url = `https://api.simplescraper.io/v1/recipes/${recipeId}/run?sourceUrl=${encodeURIComponent(sourceUrl)}`;

try {
   const response = fetch(url, {
    headers: {
      'Authorization': `Bearer ${apikey}`
    }
   });
 const data = await response.json();
 console.log(data);
} catch (error) {
 console.error('error:', error);
}
}

// call function
runRecipe('YOUR_API_KEY', 'YOUR_RECIPE_ID', 'https://example.com');

Response structure
PropertyExample ValueExplanationOptions/Types
recipe_id"rtJjthGverod4EQkt4t4d"ID of the recipe being scrapedString
results_id"pAioZevQJaqpjod4EQkd"Unique identifier for the results of the scrape taskString
name"Example recipe"Name of the recipeString
url"https://example.com"source URL of the recipeString (valid URL)
date_scraped"2024-08-22T09:41:00.000Z"Start time of the scrape operationString (ISO 8601 date format)
status"completed"Current status of the scrape jobString: "completed", "failed", "running"
data[...]Main payload of scraped dataArray of objects (structure depends on scrape target)
screenshots[{ "url_uid": 1, "url": "https://...", "screenshot": "https://" }]Screenshots of each page scrapedArray of objects
errors[ { url: '', page_message: 'cannot find element', response_code: 200, screenshot: '' } ]Error details for pages that did not return data successfullyArray of objects

Example response (no timeout)
js
{
  "recipe_id": "rtJjthGverod4EQkt4t4d",
  "results_id": "pAioZevQJaqpjod4EQkd",
  "name": "Example scrape recipe",
  "url": "https://example.com",
  "date_scraped": "2024-08-22T09:41:00.000Z",
  "status": "completed",
  "status_code": 200,
  "data": [...],
  "screenshots": [
    {
      "url_uid": 1,
      "url": "https://...",
      "screenshot": "https://..."
    }
  ],
  "errors": [
	{
    "url_uid": 1,
    "url": "https://...",
    "page_message": "",
	"response_code": "",
    "screenshot": "https://..."
   }
  ]
}

Example response (timeout)
js
{
  "status": "running",
  "results_id": "r4t9iyofr234rtr9j",
  "message": "The task is still running. Please check status at https://api.simplescraper.io/v1/results/r4t9iyofr234rtr9j or await webhook notification if configured."
}



GET /results/:resultsID

Get results for a particular scrape task based on the result ID.

Check status property for value of completedto determine if task has finished. status of running indicates the task is still in progress.


Parameters
ParameterRequiredDescription
apikeyYesThe API key for user authentication.
limitNoThe maximum number of results to return.
offsetNoThe starting point for results retrieval.

Example request
js
async function getResults(apiKey, resultsId) {
  const url = `https://api.simplescraper.io/v1/results/${resultsId}`;
  
  try {
    const response = fetch(url, {
    headers: {
      'Authorization': `Bearer ${apikey}`
    }
   });
    const data = await response.json();
    console.log(data);
  } catch (error) {
    console.error('error:', error);
  }
}

// Usage
getResults('YOUR_API_KEY', 'YOUR_RESULTS_ID');

Example response
js
// same as successful call to /recipe/:recipeId/run
{
    "recipe_id": "rtJjthGverod4EQkt4t4d",
  	"results_id": "pAioZevQJaqpjod4EQkd",
  	"name": "Example scrape recipe",
  	"url": "https://example.com",
  	"date_scraped": "2024-08-22T09:41:00.000Z",
  	"status": "completed",
    "data": [...],
    "screenshots": [...],
    "errors": [...]
  }


POST /recipes/:recipeId/batch/urls

Update or replace the batch scraper (crawler) URLs. The endpoint allows adding new URLs or replacing existing ones in the batch collection. Use the batch_mode parameter to specify whether to append or replace URLs.

The endpoint returns the result of the operation, including success status, a summary of new and existing URLs, and any invalid URLs with detailed error information.

Note that by default, when running a recipe, the Simplescraper API does not use the batch scraper (crawler) unless the useCrawler flag is specified. See '/recipes/:recipeId/run' endpoint section for more details.


Parameters
ParameterRequiredDescription
apikeyYesThe API key for user authentication.
batch_modeNoOperation mode, either append (default) or replace.
batch_urlsYesAn array of URLs to be processed in the batch operation.

Example request
js
async function addBatchUrls(apiKey, recipeId, batchMode, urls) {
  const url = `https://api.simplescraper.io/v1/recipes/${recipeId}/batch/urls`;
  
  try {
    const response = await fetch(url, {
      method: 'POST',
      headers: {
        'Authorization': `Bearer ${apiKey}`,
        'Content-Type': 'application/json'
      },
      body: JSON.stringify({
        batch_mode: batchMode,
        batch_urls: urls
      })
    });
    const data = await response.json();
    console.log(data);
  } catch (error) {
    console.error('error:', error);
  }
}

// Usage
addBatchUrls('YOUR_API_KEY', 'YOUR_RECIPE_ID', 'append', ['https://zombo.com/page1', 'https://zombo.com/page2']);

Example response
js
{
  "success": true, // true, false 'partial'
  "summary": {
    "totalExisting": 150,
    "totalNew": 30,
    "totalErrors": 2
  },
  "data": {
    "newUrls": [
      "https://zombo.com/page1",
      "https://zombo.com/page2"
    ],
    "errorDetails": [
      {
        "url": "https://zombo.com/badurl",
        "message": "Invalid URL format",
        "type": "INVALID_URL_FORMAT"
      },
      {
        "url": "https://zombo.com/anotherbadurl",
        "message": "Invalid URL format",
        "type": "INVALID_URL_FORMAT"
      }
    ]
  }
}

Notes
  • batch_mode:
    • append mode will add new URLs to the existing list, with a cap of 5000 URLs.
    • replace mode will clear existing URLs and replace them with the provided list, enforcing the same limit.
  • Validation: URLs are validated for format, length, protocol, and TLD correctness. Invalid URLs are listed in the response's errorDetails array.
  • Limit: A maximum of 5000 URLs can be stored in a batch at any time. Attempting to exceed this limit will result in trimming of the excess URLs.


POST /smart-extract

Smart Extract uses AI to accurately extract data and generate reusable CSS selectors from any website using only a list of the data points you need (data schema). Read more about this feature here: https://simplescraper.io/docs/smart-data-extract.


Request body properties
PropertyRequiredDescription
urlYesThe URL of the page to be scraped.
schemaYesA comma-separated list of properties to extract.

Example request
js
async function runSmartExtract(apikey, url, schema) {
  const endpoint = 'https://api.simplescraper.io/v1/smart-extract';
  
  const requestBody = {
    url: url,
    schema: schema
  };

  try {
    const response = await fetch(endpoint, {
      method: 'POST',
      headers: {
        'Content-Type': 'application/json',
        'Authorization': `Bearer ${apikey}`
      },
      body: JSON.stringify(requestBody)
    });
    const data = await response.json();
    console.log(data);
  } catch (error) {
    console.error('error:', error);
  }
}

// usage
let url = 'https://example.com/product-page';
let schema = 'product name, price, description, rating';

runSmartExtract('YOUR_API_KEY',url,schema);

Example response
json
{
  "extract_uid": "12345",
  "results_uid": "54321",
  "date_completed": "2024-01-01T00:00:00.000Z",
  "data": [
    {
      "product_name": "Amazon apples",
      "price": "USD $1,400",
      "description": "A great selection of Gala, Fuji, Honeycrisp, Golden Delicious & more",
      "rating": "4.5",
    },
  ],
  "selectors": [
    {
      "name": "product_name",
      "selector": "div.displayProduct",
      "uid": "e5de-60cd-4386-917b"
    },
    {
      "name": "price",
      "selector": "div.displayListingPrice",
      "uid": "ea74-d519-4508-b5ca"
    },
    {
      "name": "description",
      "selector": "div.description",
      "uid": "4cc4-d406-4dd1-887f"
    },
    {
      "name": "rating",
      "selector": "div.feature-item:nth-child(2)",
      "uid": "c720-a90e-44dd-9eb4"
    }
  ],
  "status": "completed"
}


Request timeouts

Requests to the /recipes/:recipeId/run endpoint will timeout after 90 seconds and a JSON object containing a results_id and status of 'running' will be returned.

The results_id can then be used to poll the /results/:resultsId endpoint for a status of 'completed'. In general, most scrape requests are completed within 90 seconds.

Error Handling

The API uses standard HTTP response codes to indicate the success or failure of an API request. In general:

  • Codes in the 2xx range indicate success
  • Codes in the 4xx range indicate an error that resulted from the provided information or the account (eg, a required parameter was missing, insufficient credits etc.)
  • Codes in the 5xx range indicate an error with our servers

In addition to the HTTP status code, all error responses include a JSON object in the response body with an error key containing a human-readable error message.

Error Response Format

All error responses have the following structure:

json
{
    "error": {
        "type": "api-key-not-included",
        "message": "API key was not included in the request."
    }
}

Error Codes

Error TypeHTTP Status CodeError Message
Successful Call200
api-key-not-included403API key was not included in the request.
out-of-credits402Credits expired.
out-of-api-reads402API reads expired.
results-not-found404Results not found. Ensure the results ID is correct and the recipe was run.
recipe-not-found404The recipe was not found. Please make sure it exists and that the recipe ID is correct.
user-not-found404User not found.
request-timeout408The request timed out after 5 minutes.
invalid-request400Invalid request format.
invalid-value400Invalid value provided.
rate-limit-exceeded429You have exceeded the rate limit.
method-not-allowed405This method is not allowed for this endpoint.
default500An unexpected error occurred.

For persistent errors or issues not covered here, please contact customer support for assistance.

Code examples

Calling multiple URLs and handling timeout/async

js
// function to initiate scraping for a single URL
async function runScrapeForUrl(apiKey, recipeId, sourceUrl) {
    const url = `https://api.simplescraper.io/v1/recipes/${recipeId}/run`;

    const requestBody = {
        apikey: apiKey,
        sourceUrl: sourceUrl,
        runAsync: false
    };

    try {
        const response = await fetch(url, {
            method: 'POST',
            headers: {
        		'Content-Type': 'application/json',
         		'Authorization': `Bearer ${apikey}`
      		},
            body: JSON.stringify(requestBody)
        });

        const data = await response.json();

        if (data.status === 'running') {
            return { status: 'running', resultsId: data.results_id };
        } else if (data.status === 'error') {
            return { status: 'error', error: data.error };
        } else {
            return data;
        }
    } catch (error) {
        return { status: 'error', error: error.message };
    }
}

// main function to process multiple URLs
async function main() {
    const apikey = 'your-api-key';
    const recipeId = 'your-recipe-id';
    const urls = [
        'https://example.com/page1',
        'https://example.com/page2',
        'https://example.com/page3'
    ];

    for (const url of urls) {
        let result = await runScrapeForUrl(apikey, recipeId, url);

        // if scrape is running asynchronously, poll for results
        if (result.status === 'running') {
            result = await pollForResults(result.resultsId, apiKey); // example covered later
        } else if (result.status === 'error') {
            console.error(`Error scraping ${url}:`, result.error);
        } else if (result.status === 'completed') {
            console.log(`Successfully scraped ${url}`);
            // process successful result here
        }

    }
}

main();

Polling for Results

To handle tasks that are still processing, implement a polling mechanism on the client side. Use a sensible interval of a few seconds to avoid overloading the endpoint.

js
// calling the v1/results/:resultsId endpoint

async function pollForResults(resultsId, apikey, maxAttempts = 10) {
  const url = `https://api.simplescraper.io/v1/results/${resultsId}`;
  
  for (let attempt = 0; attempt < maxAttempts; attempt++) {
    try {
     const response = await fetch(url, {
  	 	headers: {
      		'Authorization': `Bearer ${apikey}`
  	   		}
	  });
      const data = await response.json();

      if (data.status === 'completed') {
        console.log('Scraping completed successfully:', data);
        return data;
      }

      console.log('Job still processing, retrying in 5 seconds...');
      await new Promise(resolve => setTimeout(resolve, 5000));
    } catch (error) {
      console.error('Error polling for result:', error);
    }
  }

  console.error('Max polling attempts reached. Please check the job status manually.');
  return null;
}

// call function
async function main() {
   const result = await pollForResults('your-results-id', 'your-api-key');
   if (result) {
     console.log('Final result:', result);
   } else {
     console.log('Failed to retrieve results');
   }
 }
 main();

Handling Errors

When working with our API, we recommend checking both the HTTP status code and the presence of an error key in the response body. Here's an example of how you might handle errors in your code:

js
async function makeApiRequest(apikey, endpoint) {
  try {
     const response = fetch('https://api.simplescraper.io/v1/${endpoint}', {
  	 	headers: {
    		'Authorization': `Bearer ${apikey}`
  		}
	 });
    
    if (!response.ok) {
      const errorData = await response.json();
      throw new Error(errorData.error || `HTTP error status: ${response.status}`);
    }
    
    const data = await response.json();
    // process successful response
  } catch (error) {
    console.error('There was an error:', error.message);
    // handle the error
  }
}

// call function
async function main() {
  try {
    const data = await makeSimplescraperRequest('your-api-key', 'recipes/123456/run');
    console.log('API response:', data);
  } catch (error) {
    console.error('Error in main:', error.message);
  }
}

main();