Using the Simplescraper API
The Simplescraper API allows you to extract structured data programatically from web pages. This guide covers how to effectively use the API for your data extraction needs.
If you haven't already created a scrape recipe, please read this guide before continuing.
Authentication
All requests to the API require the inclusion of an API key. The API key should be sent in the Authorization
header using the Bearer
token format:
Authorization: Bearer your_api_key
An API key is provided when you sign up to Simplescraper and can be found on the API tab of each recipe that you create. Code examples of how to include the API key in requests are provided in the sections below.
Request structure
POST Request structure
For POST requests, the API key should be sent in the Authorization
header using the Bearer
token format and the Content-Type header must be set to application/json.
Here's an example of a POST request to the /recipes/:recipeId/run
endpoint:
const apikey = 'ap1k3y';
const recipeId = '12345';
const url = `https://api.simplescraper.io/v1/recipes/${recipeId}/run`;
const requestBody = {
sourceUrl: sourceUrl,
// other optional properties can be included here
extractMarkdown: false,
runAsync: false,
};
const response = fetch(url, {
method: 'POST',
headers: {
'Authorization': `Bearer ${apikey}`,
'Content-Type': 'application/json'
},
body: JSON.stringify(requestBody)
})
// handle response...
Notes:
- URL encoding of the sourceUrl is not necessary when sent in the request body, as JSON handles special characters automatically.
- Set the Content-Type header to application/json for POST requests.
GET Request structure
For GET requests, the API key should be sent in the Authorization
header using the Bearer
token format. Other properties should also be passed in as query parameters, and encoded where necessary.
As an example, here's a GET request to the /results/:resultsID
endpoint:
const apikey = 'ap1k3y';
const resultsId = '12345';
const requestUrl = `https://api.simplescraper.io/v1/results/${resultsId}`;
const response = fetch(requestUrl, {
headers: {
'Authorization': `Bearer ${apikey}`
}
});
// handle response...
Endpoints
The base URL for all API requests is: https://api.simplescraper.io/v1, followed by the specific endpoint.
Below is a table of all available endpoints. To use them, append these endpoints to the base URL.
endpoint | Method | description |
---|---|---|
/recipes/:recipeId | GET | Get information about a recipe |
/recipes/:recipeId/run | POST | Run the specified scrape recipe and return results or status |
/recipes/:recipeId/results-latest | GET | Retrieve the most recent result for a specific recipe |
/recipes/:recipeId/results-history | GET | Retrieve recipe ids, scrape date, and number of pages scraped of last 100 scrape runs of specified recipe |
/recipes/:recipeId/batch/urls | POST | Update or replace batch scraper (crawler) URL list |
/results/:resultsId | GET | View results or scrape progress for specified results ID. Status key indicates the progress of scrape task |
/smart-extract | POST | Extract data and reusable CSS selectors from any website using AI |
For example, the full URL to run a scrape recipe looks like this:
https://api.simplescraper.io/v1/recipes/:recipeId/run
Note: Replace :recipeId
and :resultsId
with actual IDs when making requests.
Further information about interacting with each endpoint is explained in the 'Endpoint details' section below.
Endpoint details
POST /recipe/:recipeId/run
Send POST requests to this endpoint to initiate a scrape run for a recipe and return data.
If the scrape time exceeds 90 seconds, a JSON object containing a results_id and status of 'running' is returned which can then be polled at https://api.simplescraper.io/v1/results/:resultsId.
Request body properties
The following properties should be sent in the request body as JSON:
Property | Required | Type | Description |
---|---|---|---|
sourceUrl | Yes | URL | The URL of the page to be scraped. This will update the current URL of the recipe. If not included, the existing recipe URL will be used. |
runAsync | No | Boolean | If true, returns a result ID immediately and runs the scrape task asynchronously. The result ID can then be used to poll the /results/:resultsID endpoint. |
extractMarkdown | No | Boolean | If true, a markdown version of the page will be extracted in addition to structured data (see https://simplescraper.io/docs/extract-markdown) |
offset | No | Number | The starting point for results retrieval. |
useCrawler | No | Boolean | If true, scrape URLs that have been added to the crawler via the crawler tab of a recipe, instead of the source URL. Default false. |
Example POST request
async function runRecipe(apikey, recipeId, sourceUrl) {
const url = `https://api.simplescraper.io/v1/recipes/${recipeId}/run`;
const requestBody = {
sourceUrl: sourceUrl,
runAsync: false
};
try {
const response = await fetch(url, {
method: 'POST',
headers: {
'Authorization': `Bearer ${apikey}`,
'Content-Type': 'application/json'
},
body: JSON.stringify(requestBody)
});
const data = await response.json();
console.log(data);
} catch (error) {
console.error('error:', error);
}
}
// call function
runRecipe('YOUR_API_KEY', 'YOUR_RECIPE_ID', 'https://example.com');
Notes:
GET requests to
/recipe/:recipeId/run
are also supported for systems that can't make POST requests, however POST is recommended whenever possible. When sending a GET request, pass properties as URL parameters and ensure proper URL encoding of thesourceUrl
.Example GET request
jsasync function runRecipe(apiKey, recipeId, sourceUrl) { const url = `https://api.simplescraper.io/v1/recipes/${recipeId}/run?sourceUrl=${encodeURIComponent(sourceUrl)}`; try { const response = fetch(url, { headers: { 'Authorization': `Bearer ${apikey}` } }); const data = await response.json(); console.log(data); } catch (error) { console.error('error:', error); } } // call function runRecipe('YOUR_API_KEY', 'YOUR_RECIPE_ID', 'https://example.com');
Response structure
Property | Example Value | Explanation | Options/Types |
---|---|---|---|
recipe_id | "rtJjthGverod4EQkt4t4d" | ID of the recipe being scraped | String |
results_id | "pAioZevQJaqpjod4EQkd" | Unique identifier for the results of the scrape task | String |
name | "Example recipe" | Name of the recipe | String |
url | "https://example.com" | source URL of the recipe | String (valid URL) |
date_scraped | "2024-08-22T09:41:00.000Z" | Start time of the scrape operation | String (ISO 8601 date format) |
status | "completed" | Current status of the scrape job | String: "completed", "failed", "running" |
data | [...] | Main payload of scraped data | Array of objects (structure depends on scrape target) |
screenshots | [{ "url_uid": 1, "url": "https://...", "screenshot": "https://" }] | Screenshots of each page scraped | Array of objects |
errors | [ { url: '', page_message: 'cannot find element', response_code: 200, screenshot: '' } ] | Error details for pages that did not return data successfully | Array of objects |
Example response (no timeout)
{
"recipe_id": "rtJjthGverod4EQkt4t4d",
"results_id": "pAioZevQJaqpjod4EQkd",
"name": "Example scrape recipe",
"url": "https://example.com",
"date_scraped": "2024-08-22T09:41:00.000Z",
"status": "completed",
"status_code": 200,
"data": [...],
"screenshots": [
{
"url_uid": 1,
"url": "https://...",
"screenshot": "https://..."
}
],
"errors": [
{
"url_uid": 1,
"url": "https://...",
"page_message": "",
"response_code": "",
"screenshot": "https://..."
}
]
}
Example response (timeout)
{
"status": "running",
"results_id": "r4t9iyofr234rtr9j",
"message": "The task is still running. Please check status at https://api.simplescraper.io/v1/results/r4t9iyofr234rtr9j or await webhook notification if configured."
}
GET /results/:resultsID
Full endpoint:
Get results for a particular scrape task based on the result ID.
Check status property for value of completed
to determine if task has finished. status of running
indicates the task is still in progress.
Parameters
Parameter | Required | Description |
---|---|---|
apikey | Yes | The API key for user authentication. |
limit | No | The maximum number of results to return. |
offset | No | The starting point for results retrieval. |
Example request
async function getResults(apiKey, resultsId) {
const url = `https://api.simplescraper.io/v1/results/${resultsId}`;
try {
const response = fetch(url, {
headers: {
'Authorization': `Bearer ${apikey}`
}
});
const data = await response.json();
console.log(data);
} catch (error) {
console.error('error:', error);
}
}
// Usage
getResults('YOUR_API_KEY', 'YOUR_RESULTS_ID');
Example response
// same as successful call to /recipe/:recipeId/run
{
"recipe_id": "rtJjthGverod4EQkt4t4d",
"results_id": "pAioZevQJaqpjod4EQkd",
"name": "Example scrape recipe",
"url": "https://example.com",
"date_scraped": "2024-08-22T09:41:00.000Z",
"status": "completed",
"data": [...],
"screenshots": [...],
"errors": [...]
}
POST /recipes/:recipeId/batch/urls
Update or replace the batch scraper (crawler) URLs. The endpoint allows adding new URLs or replacing existing ones in the batch collection. Use the batch_mode
parameter to specify whether to append
or replace
URLs.
The endpoint returns the result of the operation, including success status, a summary of new and existing URLs, and any invalid URLs with detailed error information.
Note that by default, when running a recipe, the Simplescraper API does not use the batch scraper (crawler) unless the useCrawler
flag is specified. See '/recipes/:recipeId/run' endpoint section for more details.
Parameters
Parameter | Required | Description |
---|---|---|
apikey | Yes | The API key for user authentication. |
batch_mode | No | Operation mode, either append (default) or replace . |
batch_urls | Yes | An array of URLs to be processed in the batch operation. |
Example request
async function addBatchUrls(apiKey, recipeId, batchMode, urls) {
const url = `https://api.simplescraper.io/v1/recipes/${recipeId}/batch/urls`;
try {
const response = await fetch(url, {
method: 'POST',
headers: {
'Authorization': `Bearer ${apiKey}`,
'Content-Type': 'application/json'
},
body: JSON.stringify({
batch_mode: batchMode,
batch_urls: urls
})
});
const data = await response.json();
console.log(data);
} catch (error) {
console.error('error:', error);
}
}
// Usage
addBatchUrls('YOUR_API_KEY', 'YOUR_RECIPE_ID', 'append', ['https://zombo.com/page1', 'https://zombo.com/page2']);
Example response
{
"success": true, // true, false 'partial'
"summary": {
"totalExisting": 150,
"totalNew": 30,
"totalErrors": 2
},
"data": {
"newUrls": [
"https://zombo.com/page1",
"https://zombo.com/page2"
],
"errorDetails": [
{
"url": "https://zombo.com/badurl",
"message": "Invalid URL format",
"type": "INVALID_URL_FORMAT"
},
{
"url": "https://zombo.com/anotherbadurl",
"message": "Invalid URL format",
"type": "INVALID_URL_FORMAT"
}
]
}
}
Notes
- batch_mode:
append
mode will add new URLs to the existing list, with a cap of 5000 URLs.replace
mode will clear existing URLs and replace them with the provided list, enforcing the same limit.
- Validation: URLs are validated for format, length, protocol, and TLD correctness. Invalid URLs are listed in the response's
errorDetails
array. - Limit: A maximum of 5000 URLs can be stored in a batch at any time. Attempting to exceed this limit will result in trimming of the excess URLs.
POST /smart-extract
Full endpoint
Smart Extract uses AI to accurately extract data and generate reusable CSS selectors from any website using only a list of the data points you need (data schema). Read more about this feature here: https://simplescraper.io/docs/smart-data-extract.
Request body properties
Property | Required | Description |
---|---|---|
url | Yes | The URL of the page to be scraped. |
schema | Yes | A comma-separated list of properties to extract. |
Example request
async function runSmartExtract(apikey, url, schema) {
const endpoint = 'https://api.simplescraper.io/v1/smart-extract';
const requestBody = {
url: url,
schema: schema
};
try {
const response = await fetch(endpoint, {
method: 'POST',
headers: {
'Content-Type': 'application/json',
'Authorization': `Bearer ${apikey}`
},
body: JSON.stringify(requestBody)
});
const data = await response.json();
console.log(data);
} catch (error) {
console.error('error:', error);
}
}
// usage
let url = 'https://example.com/product-page';
let schema = 'product name, price, description, rating';
runSmartExtract('YOUR_API_KEY',url,schema);
Example response
{
"extract_uid": "12345",
"results_uid": "54321",
"date_completed": "2024-01-01T00:00:00.000Z",
"data": [
{
"product_name": "Amazon apples",
"price": "USD $1,400",
"description": "A great selection of Gala, Fuji, Honeycrisp, Golden Delicious & more",
"rating": "4.5",
},
],
"selectors": [
{
"name": "product_name",
"selector": "div.displayProduct",
"uid": "e5de-60cd-4386-917b"
},
{
"name": "price",
"selector": "div.displayListingPrice",
"uid": "ea74-d519-4508-b5ca"
},
{
"name": "description",
"selector": "div.description",
"uid": "4cc4-d406-4dd1-887f"
},
{
"name": "rating",
"selector": "div.feature-item:nth-child(2)",
"uid": "c720-a90e-44dd-9eb4"
}
],
"status": "completed"
}
Request timeouts
Requests to the /recipes/:recipeId/run
endpoint will timeout after 90 seconds and a JSON object containing a results_id
and status of 'running'
will be returned.
The results_id
can then be used to poll the /results/:resultsId
endpoint for a status of 'completed'. In general, most scrape requests are completed within 90 seconds.
Error Handling
The API uses standard HTTP response codes to indicate the success or failure of an API request. In general:
- Codes in the
2xx
range indicate success - Codes in the
4xx
range indicate an error that resulted from the provided information or the account (eg, a required parameter was missing, insufficient credits etc.) - Codes in the
5xx
range indicate an error with our servers
In addition to the HTTP status code, all error responses include a JSON object in the response body with an error
key containing a human-readable error message.
Error Response Format
All error responses have the following structure:
{
"error": {
"type": "api-key-not-included",
"message": "API key was not included in the request."
}
}
Error Codes
Error Type | HTTP Status Code | Error Message |
---|---|---|
Successful Call | 200 | |
api-key-not-included | 403 | API key was not included in the request. |
out-of-credits | 402 | Credits expired. |
out-of-api-reads | 402 | API reads expired. |
results-not-found | 404 | Results not found. Ensure the results ID is correct and the recipe was run. |
recipe-not-found | 404 | The recipe was not found. Please make sure it exists and that the recipe ID is correct. |
user-not-found | 404 | User not found. |
request-timeout | 408 | The request timed out after 5 minutes. |
invalid-request | 400 | Invalid request format. |
invalid-value | 400 | Invalid value provided. |
rate-limit-exceeded | 429 | You have exceeded the rate limit. |
method-not-allowed | 405 | This method is not allowed for this endpoint. |
default | 500 | An unexpected error occurred. |
For persistent errors or issues not covered here, please contact customer support for assistance.
Code examples
Calling multiple URLs and handling timeout/async
// function to initiate scraping for a single URL
async function runScrapeForUrl(apiKey, recipeId, sourceUrl) {
const url = `https://api.simplescraper.io/v1/recipes/${recipeId}/run`;
const requestBody = {
apikey: apiKey,
sourceUrl: sourceUrl,
runAsync: false
};
try {
const response = await fetch(url, {
method: 'POST',
headers: {
'Content-Type': 'application/json',
'Authorization': `Bearer ${apikey}`
},
body: JSON.stringify(requestBody)
});
const data = await response.json();
if (data.status === 'running') {
return { status: 'running', resultsId: data.results_id };
} else if (data.status === 'error') {
return { status: 'error', error: data.error };
} else {
return data;
}
} catch (error) {
return { status: 'error', error: error.message };
}
}
// main function to process multiple URLs
async function main() {
const apikey = 'your-api-key';
const recipeId = 'your-recipe-id';
const urls = [
'https://example.com/page1',
'https://example.com/page2',
'https://example.com/page3'
];
for (const url of urls) {
let result = await runScrapeForUrl(apikey, recipeId, url);
// if scrape is running asynchronously, poll for results
if (result.status === 'running') {
result = await pollForResults(result.resultsId, apiKey); // example covered later
} else if (result.status === 'error') {
console.error(`Error scraping ${url}:`, result.error);
} else if (result.status === 'completed') {
console.log(`Successfully scraped ${url}`);
// process successful result here
}
}
}
main();
Polling for Results
To handle tasks that are still processing, implement a polling mechanism on the client side. Use a sensible interval of a few seconds to avoid overloading the endpoint.
// calling the v1/results/:resultsId endpoint
async function pollForResults(resultsId, apikey, maxAttempts = 10) {
const url = `https://api.simplescraper.io/v1/results/${resultsId}`;
for (let attempt = 0; attempt < maxAttempts; attempt++) {
try {
const response = await fetch(url, {
headers: {
'Authorization': `Bearer ${apikey}`
}
});
const data = await response.json();
if (data.status === 'completed') {
console.log('Scraping completed successfully:', data);
return data;
}
console.log('Job still processing, retrying in 5 seconds...');
await new Promise(resolve => setTimeout(resolve, 5000));
} catch (error) {
console.error('Error polling for result:', error);
}
}
console.error('Max polling attempts reached. Please check the job status manually.');
return null;
}
// call function
async function main() {
const result = await pollForResults('your-results-id', 'your-api-key');
if (result) {
console.log('Final result:', result);
} else {
console.log('Failed to retrieve results');
}
}
main();
Handling Errors
When working with our API, we recommend checking both the HTTP status code and the presence of an error
key in the response body. Here's an example of how you might handle errors in your code:
async function makeApiRequest(apikey, endpoint) {
try {
const response = fetch('https://api.simplescraper.io/v1/${endpoint}', {
headers: {
'Authorization': `Bearer ${apikey}`
}
});
if (!response.ok) {
const errorData = await response.json();
throw new Error(errorData.error || `HTTP error status: ${response.status}`);
}
const data = await response.json();
// process successful response
} catch (error) {
console.error('There was an error:', error.message);
// handle the error
}
}
// call function
async function main() {
try {
const data = await makeSimplescraperRequest('your-api-key', 'recipes/123456/run');
console.log('API response:', data);
} catch (error) {
console.error('Error in main:', error.message);
}
}
main();