Crawler API (1.0.0)
Download OpenAPI specification:Download
The Crawler API lets you manage and run your crawlers.
The base URL for making requests to the Crawler API is:
https://crawler.algolia.com/api
All requests must use HTTPS.
Acess to the Crawler API is available with the Crawler add-on.
To authenticate your API requests, use the basic authentication header:
Authorization: Basic <credentials>
where <credentials>
is a base64-encoded string <user-id>:<api-key>
.
<user-id>
. The Crawler user ID.<api-key>
. The Crawler API key.
You can find both in the Crawler dashboard. The Crawler dashboard and API key are different from the regular Algolia dashboard and API keys.
Parameters are passed as query parameters for GET requests, and in the request body for POST and PATCH requests.
Query parameters must be URL-encoded. Non-ASCII characters must be UTF-8 encoded.
The Crawler API returns JSON responses. Since JSON doesn't guarantee any specific ordering, don't rely on the order of attributes in the API response.
Successful responses return a 2xx
status. Client errors return a 4xx
status. Server errors are indicated by a 5xx
status.
Error responses have a message
property with more information.
The current version of the Crawler API is version 1, as indicated by the /1/
in each endpoint's URL.
Actions change the state of crawlers, such as pausing and unpausing crawl schedules or testing the crawler with specific URLs.
Unpause a crawler
Unpauses the specified crawler. Previously ongoing crawls will be resumed. Otherwise, the crawler waits for its next scheduled run.
path Parameters
id required | string Example: e0f6db8a-24f5-4092-83a4-1b2c6cb6d809 Crawler ID. |
Responses
Response samples
- 200
- 400
{- "taskId": "98458796-b7bb-4703-8b1b-785c1080b110"
}
Test crawling a URL
Tests a URL with the crawler's configuration and shows the extracted records.
You can override parts of the configuration to test your changes before updating the configuration.
path Parameters
id required | string Example: e0f6db8a-24f5-4092-83a4-1b2c6cb6d809 Crawler ID. |
Request Body schema: application/json
url required | string URL to test. |
object Crawler configuration to update.
You can only update top-level configuration properties.
To update a nested configuration, such as |
Responses
Response samples
- 200
- 400
{- "startDate": "2024-04-02T15:34:29Z",
- "endDate": "2024-04-02T15:34:29Z",
- "logs": [
- [
- "Processing url 'https://www.algolia.com/blog'"
]
], - "records": [
- {
- "indexName": "testIndex",
- "recordsPerExtractor": [
]
}
], - "externalData": {
- "externalData1": {
- "data1": "val1",
- "data2": "val2"
}, - "externalData2": {
- "data1": "val1",
- "data2": "val2"
}
}, - "error": { }
}
Crawl URLs
Crawls the specified URLs, extracts records from them, and adds them to the index.
If a crawl is currently running (the crawler's reindexing
property is true),
the records are added to a temporary index.
path Parameters
id required | string Example: e0f6db8a-24f5-4092-83a4-1b2c6cb6d809 Crawler ID. |
Request Body schema: application/json
urls required | Array of strings URLs to crawl. |
save | boolean Whether the specified URLs should be added to the |
Responses
Response samples
- 200
- 400
{- "taskId": "98458796-b7bb-4703-8b1b-785c1080b110"
}
In the Crawler configuration, you specify which URLs to crawl, when to crawl, how to extract records from the crawl, and where to index the extracted records. The configuration is versioned, so you can always restore a previous version. It's easiest to make configuration changes in the Crawler dashboard. The editor has autocomplete and builtin validation so you can try your configuration changes before comitting them.
Update crawler configuration
Updates the configuration of the specified crawler. Every time you update the configuration, a new version is created.
path Parameters
id required | string Example: e0f6db8a-24f5-4092-83a4-1b2c6cb6d809 Crawler ID. |
Request Body schema: application/json
required | Array of objects [ 1 .. 30 ] items Instructions how to process crawled URLs. Each action defines:
A single web page can match multiple actions. In this case, the crawler produces one record for each matched action. |
appId required | string Algolia application ID where the crawler creates and updates indices. The Crawler add-on must be enabled for this application. |
rateLimit required | number [ 1 .. 100 ] Number of concurrent tasks per second. If processing each URL takes n seconds,
your crawler can process Higher numbers mean faster crawls but they also increase your bandwidth and server load. |
apiKey | string Algolia API key for indexing the records. The API key must have the following access control list (ACL) permissions:
|
exclusionPatterns | Array of strings <= 100 items URLs to exclude from crawling. |
externalData | Array of strings <= 10 items References to external data sources for enriching the extracted records. For more information, see Enrich extrated records with external data. |
extraUrls | Array of strings <= 9999 items URLs from where to start crawling. These are the same as |
boolean or Array of strings | |
ignoreNoFollowTo | boolean Whether to ignore the |
ignoreNoIndex | boolean Whether to ignore the |
ignoreQueryParams | Array of strings <= 9999 items Query parameters to ignore while crawling. All URLs with the matching query parameters will be treated as identical. This prevents indexing duplicated URLs, that just differ by their query parameters. |
ignoreRobotsTxtRules | boolean Whether to ignore rules defined in your |
indexPrefix | string <= 64 characters A prefix for all indices created by this crawler. It's combined with the |
object Initial index settings, one settings object per index. This setting is only applied when the index is first created. Settings are not re-applied. This prevents overriding any settings changes after the index was created. | |
object Function for extracting URLs for links found on crawled pages. | |
fetchRequest (object) or browserRequest (object) or oauthRequest (object) Authorization method and credentials for crawling protected content. | |
maxDepth | number [ 1 .. 100 ] Maximum path depth of crawled URLs.
For example, if |
maxUrls | number [ 1 .. 15000000 ] Maximum number of crawled URLs. Setting |
boolean or Array of strings or object Crawl JavaScript-rendered pages by rendering them with a headless browser. Rendering JavaScript-based pages is slower than crawling regular HTML pages. | |
object Options to add to all HTTP requests made by the crawler. | |
object Safety checks for ensuring data integrity between crawls. | |
saveBackup | boolean Whether to back up your index before the crawler overwrites it with new records. |
schedule | string Schedule for running the crawl, expressed in Later.js syntax. If omitted, you must start crawls manually.
|
sitemaps | Array of strings <= 9999 items Sitemaps with URLs from where to start crawling. |
startUrls | Array of strings <= 9999 items URLs from where to start crawling. |
Responses
Response samples
- 200
- 400
{- "taskId": "98458796-b7bb-4703-8b1b-785c1080b110"
}
List configuration versions
Lists previous versions of the specified crawler's configuration, including who authored the change. Every time you update the configuration of a crawler, a new version is added.
path Parameters
id required | string Example: e0f6db8a-24f5-4092-83a4-1b2c6cb6d809 Crawler ID. |
query Parameters
itemsPerPage | integer [ 1 .. 100 ] Default: 20 Number of items per page to retrieve. |
page | integer [ 1 .. 100 ] Default: 1 Page to retrieve. |
Responses
Response samples
- 200
{- "itemsPerPage": 20,
- "page": 1,
- "total": 100,
- "items": [
- {
- "version": 1,
- "createdAt": "2023-07-04T12:49:15Z",
- "authorId": "7d79f0dd-2dab-4296-8098-957a1fdc0637"
}
]
}
Retrieve a configuration version
Retrieves the specified version of the crawler configuration.
You can use this to restore a previous version of the configuration.
path Parameters
id required | string Example: e0f6db8a-24f5-4092-83a4-1b2c6cb6d809 Crawler ID. |
version required | integer The version of the targeted Crawler revision. |
Responses
Response samples
- 200
{- "version": 1,
- "config": {
- "actions": [
- {
- "autoGenerateObjectIDs": true,
- "cache": {
- "enabled": true
}, - "fileTypesToMatch": [
- "html",
- "pdf"
], - "hostnameAliases": {
- "dev.example.com": "example.com"
}, - "indexName": "algolia_website",
- "name": "string",
- "pathAliases": {
- "example.com": {
- "/foo": "/bar"
}
}, - "recordExtractor": {
- "__type": "function",
- "source": "string"
}, - "selectorsToMatch": [
- ".products",
- "!.featured"
]
}
], - "apiKey": "string",
- "appId": "string",
- "exclusionPatterns": [
- "!https://www.example.com/this-one-url",
], - "externalData": [
- "testCSV"
], - "extraUrls": [
- "string"
], - "ignoreCanonicalTo": true,
- "ignoreNoFollowTo": true,
- "ignoreNoIndex": true,
- "ignoreQueryParams": [
- "ref",
- "utm_*"
], - "ignoreRobotsTxtRules": true,
- "indexPrefix": "crawler_",
- "initialIndexSettings": {
- "indexName1": {
- "attributesForFaceting": [
- "author",
- "filterOnly(isbn)",
- "searchable(edition)",
- "afterDistinct(category)",
- "afterDistinct(searchable(publisher))"
], - "replicas": [
- "virtual(prod_products_price_asc)",
- "dev_products_replica"
], - "paginationLimitedTo": 100,
- "unretrievableAttributes": [
- "total_sales"
], - "disableTypoToleranceOnWords": [
- "wheel",
- "1X2BCD"
], - "attributesToTransliterate": [
- "name",
- "description"
], - "camelCaseAttributes": [
- "description"
], - "decompoundedAttributes": {
- "de": [
- "name"
]
}, - "indexLanguages": [
- "ja"
], - "disablePrefixOnAttributes": [
- "sku"
], - "allowCompressionOfIntegerArray": false,
- "numericAttributesForFiltering": [
- "equalOnly(quantity)",
- "popularity"
], - "separatorsToIndex": "+#",
- "searchableAttributes": [
- "title,alternative_title",
- "author",
- "unordered(text)",
- "emails.personal"
], - "userData": {
- "settingID": "f2a7b51e3503acc6a39b3784ffb84300",
- "pluginVersion": "1.6.0"
}, - "customNormalization": {
- "default": {
- "ä": "ae",
- "ü": "ue"
}
}, - "attributeForDistinct": "url",
- "attributesToRetrieve": [
- "author",
- "title",
- "content"
], - "ranking": [
- "typo",
- "geo",
- "words",
- "filters",
- "proximity",
- "attribute",
- "exact",
- "custom"
], - "customRanking": [
- "desc(popularity)",
- "asc(price)"
], - "relevancyStrictness": 90,
- "attributesToHighlight": [
- "author",
- "title",
- "conten",
- "content"
], - "attributesToSnippet": [
- "content:80",
- "description"
], - "highlightPreTag": "<em>",
- "highlightPostTag": "</em>",
- "snippetEllipsisText": "…",
- "restrictHighlightAndSnippetArrays": false,
- "hitsPerPage": 20,
- "minWordSizefor1Typo": 4,
- "minWordSizefor2Typos": 8,
- "typoTolerance": true,
- "allowTyposOnNumericTokens": true,
- "disableTypoToleranceOnAttributes": [
- "sku"
], - "ignorePlurals": [
- "ca",
- "es"
], - "removeStopWords": [
- "ca",
- "es"
], - "keepDiacriticsOnCharacters": "øé",
- "queryLanguages": [
- "es"
], - "decompoundQuery": true,
- "enableRules": true,
- "enablePersonalization": false,
- "queryType": "prefixAll",
- "removeWordsIfNoResults": "firstWords",
- "mode": "keywordSearch",
- "semanticSearch": {
- "eventSources": [
- "string"
]
}, - "advancedSyntax": false,
- "optionalWords": [
- "blue",
- "iphone case"
], - "disableExactOnAttributes": [
- "description"
], - "exactOnSingleWordQuery": "attribute",
- "alternativesAsExact": [
- "ignorePlurals",
- "singleWordSynonym"
], - "advancedSyntaxFeatures": [
- "exactPhrase",
- "excludeWords"
], - "distinct": 1,
- "replaceSynonymsInHighlight": false,
- "minProximity": 1,
- "responseFields": [
- "*"
], - "maxFacetHits": 10,
- "maxValuesPerFacet": 100,
- "sortFacetValuesBy": "count",
- "attributeCriteriaComputedByMinProximity": false,
- "renderingContent": {
- "facetOrdering": {
- "facets": {
- "order": [
- "string"
]
}, - "values": {
- "facet1": {
- "order": [
- "string"
], - "sortRemainingBy": "alpha",
- "hide": [
- "string"
]
}, - "facet2": {
- "order": [
- "string"
], - "sortRemainingBy": "alpha",
- "hide": [
- "string"
]
}
}
}, - "redirect": {
- "url": "string"
}
}, - "enableReRanking": true,
- "reRankingApplyFilter": [
- null
]
}, - "indexName2": {
- "attributesForFaceting": [
- "author",
- "filterOnly(isbn)",
- "searchable(edition)",
- "afterDistinct(category)",
- "afterDistinct(searchable(publisher))"
], - "replicas": [
- "virtual(prod_products_price_asc)",
- "dev_products_replica"
], - "paginationLimitedTo": 100,
- "unretrievableAttributes": [
- "total_sales"
], - "disableTypoToleranceOnWords": [
- "wheel",
- "1X2BCD"
], - "attributesToTransliterate": [
- "name",
- "description"
], - "camelCaseAttributes": [
- "description"
], - "decompoundedAttributes": {
- "de": [
- "name"
]
}, - "indexLanguages": [
- "ja"
], - "disablePrefixOnAttributes": [
- "sku"
], - "allowCompressionOfIntegerArray": false,
- "numericAttributesForFiltering": [
- "equalOnly(quantity)",
- "popularity"
], - "separatorsToIndex": "+#",
- "searchableAttributes": [
- "title,alternative_title",
- "author",
- "unordered(text)",
- "emails.personal"
], - "userData": {
- "settingID": "f2a7b51e3503acc6a39b3784ffb84300",
- "pluginVersion": "1.6.0"
}, - "customNormalization": {
- "default": {
- "ä": "ae",
- "ü": "ue"
}
}, - "attributeForDistinct": "url",
- "attributesToRetrieve": [
- "author",
- "title",
- "content"
], - "ranking": [
- "typo",
- "geo",
- "words",
- "filters",
- "proximity",
- "attribute",
- "exact",
- "custom"
], - "customRanking": [
- "desc(popularity)",
- "asc(price)"
], - "relevancyStrictness": 90,
- "attributesToHighlight": [
- "author",
- "title",
- "conten",
- "content"
], - "attributesToSnippet": [
- "content:80",
- "description"
], - "highlightPreTag": "<em>",
- "highlightPostTag": "</em>",
- "snippetEllipsisText": "…",
- "restrictHighlightAndSnippetArrays": false,
- "hitsPerPage": 20,
- "minWordSizefor1Typo": 4,
- "minWordSizefor2Typos": 8,
- "typoTolerance": true,
- "allowTyposOnNumericTokens": true,
- "disableTypoToleranceOnAttributes": [
- "sku"
], - "ignorePlurals": [
- "ca",
- "es"
], - "removeStopWords": [
- "ca",
- "es"
], - "keepDiacriticsOnCharacters": "øé",
- "queryLanguages": [
- "es"
], - "decompoundQuery": true,
- "enableRules": true,
- "enablePersonalization": false,
- "queryType": "prefixAll",
- "removeWordsIfNoResults": "firstWords",
- "mode": "keywordSearch",
- "semanticSearch": {
- "eventSources": [
- "string"
]
}, - "advancedSyntax": false,
- "optionalWords": [
- "blue",
- "iphone case"
], - "disableExactOnAttributes": [
- "description"
], - "exactOnSingleWordQuery": "attribute",
- "alternativesAsExact": [
- "ignorePlurals",
- "singleWordSynonym"
], - "advancedSyntaxFeatures": [
- "exactPhrase",
- "excludeWords"
], - "distinct": 1,
- "replaceSynonymsInHighlight": false,
- "minProximity": 1,
- "responseFields": [
- "*"
], - "maxFacetHits": 10,
- "maxValuesPerFacet": 100,
- "sortFacetValuesBy": "count",
- "attributeCriteriaComputedByMinProximity": false,
- "renderingContent": {
- "facetOrdering": {
- "facets": {
- "order": [
- "string"
]
}, - "values": {
- "facet1": {
- "order": [
- "string"
], - "sortRemainingBy": "alpha",
- "hide": [
- "string"
]
}, - "facet2": {
- "order": [
- "string"
], - "sortRemainingBy": "alpha",
- "hide": [
- "string"
]
}
}
}, - "redirect": {
- "url": "string"
}
}, - "enableReRanking": true,
- "reRankingApplyFilter": [
- null
]
}
}, - "linkExtractor": {
- "__type": "function",
- "source": "({ $, url, defaultExtractor }) => {\n if (/example.com\\/doc\\//.test(url.href)) {\n // For all pages under `/doc`, only extract the first found URL.\n return defaultExtractor().slice(0, 1)\n }\n // For all other pages, use the default.\n return defaultExtractor()\n}\n"
}, - "login": {
- "requestOptions": {
- "method": "POST",
- "headers": {
- "Accept-Language": "fr-FR",
- "Authorization": "Bearer Aerehdf==",
- "Cookie": "session=1234"
}, - "body": "id=user&password=s3cr3t",
- "timeout": 0
}
}, - "maxDepth": 1,
- "maxUrls": 1,
- "rateLimit": 4,
- "renderJavaScript": true,
- "requestOptions": {
- "proxy": "string",
- "timeout": 30000,
- "retries": 3,
- "headers": {
- "Accept-Language": "fr-FR",
- "Authorization": "Bearer Aerehdf==",
- "Cookie": "session=1234"
}
}, - "safetyChecks": {
- "beforeIndexPublishing": {
- "maxLostRecordsPercentage": 10
}
}, - "saveBackup": true,
- "schedule": "every weekday at 12:00 pm",
}, - "createdAt": "2023-07-04T12:49:15Z",
- "authorId": "7d79f0dd-2dab-4296-8098-957a1fdc0637"
}
A crawler is an object with a name and a configuration. Use these endpoints to create, rename, and delete crawlers.
List crawlers
Lists all your crawlers.
query Parameters
appID | string Algolia application ID for filtering the API response. |
itemsPerPage | integer [ 1 .. 100 ] Default: 20 Number of items per page to retrieve. |
name | string <= 64 characters Example: name=test-crawler Name of the crawler for filtering the API response. |
page | integer [ 1 .. 100 ] Default: 1 Page to retrieve. |
Responses
Response samples
- 200
- 400
{- "itemsPerPage": 20,
- "page": 1,
- "total": 100,
- "items": [
- {
- "id": "e0f6db8a-24f5-4092-83a4-1b2c6cb6d809",
- "name": "test-crawler"
}
]
}
Create a crawler
Creates a new crawler with the provided configuration.
Request Body schema: application/json
required | object Crawler configuration. |
name required | string <= 64 characters Name of the crawler. |
Responses
Response samples
- 200
- 400
{- "id": "e0f6db8a-24f5-4092-83a4-1b2c6cb6d809"
}
Retrieve crawler details
Retrieves details about the specified crawler, optionally with its configuration.
path Parameters
id required | string Example: e0f6db8a-24f5-4092-83a4-1b2c6cb6d809 Crawler ID. |
query Parameters
withConfig | boolean Whether the response should include the crawler's configuration. |
Responses
Response samples
- 200
- 400
{- "name": "test-crawler",
- "createdAt": "2023-07-04T12:49:15Z",
- "updatedAt": "2023-07-04T12:49:15Z",
- "running": true,
- "reindexing": true,
- "blocked": true,
- "blockingError": "Error: Failed to fetch external data for source 'testCSV': 404\n",
- "blockingTaskId": "string",
- "lastReindexStartAt": null,
- "lastReindexEndedAt": null
}
Update crawler
Updates the crawler, either its name or its configuration.
Use this endpoint to update the crawler's name. While you can use this endpoint to completely replace the crawler's configuration, you should update the crawler configuration instead.
If you replace the configuration, you must provide the full configuration, including the settings you want to keep. Configuration changes from this endpoint aren't versioned.
path Parameters
id required | string Example: e0f6db8a-24f5-4092-83a4-1b2c6cb6d809 Crawler ID. |
Request Body schema: application/json
object Crawler configuration. | |
name | string <= 64 characters Name of the crawler. |
Responses
Response samples
- 200
- 400
{- "taskId": "98458796-b7bb-4703-8b1b-785c1080b110"
}
List registered domains
Lists registered domains.
Crawlers will only run if the URLs match any of the registered domains.
query Parameters
appID | string Algolia application ID for filtering the API response. |
itemsPerPage | integer [ 1 .. 100 ] Default: 20 Number of items per page to retrieve. |
page | integer [ 1 .. 100 ] Default: 1 Page to retrieve. |
Responses
Response samples
- 200
- 400
- 403
{- "itemsPerPage": 20,
- "page": 1,
- "total": 100,
- "items": [
- {
- "appId": "string",
- "domain": "wwww.algolia.com",
- "validated": true
}
]
}
Retrieve task status
Retrieves the status of the specified tasks, whether they're pending or completed.
path Parameters
id required | string Example: e0f6db8a-24f5-4092-83a4-1b2c6cb6d809 Crawler ID. |
taskID required | string Example: 98458796-b7bb-4703-8b1b-785c1080b110 Task ID. |
Responses
Response samples
- 200
{- "pending": true
}
Cancel a blocking task
Cancels a blocking task.
Tasks that ran into an error block the futher schedule of your Crawler. To unblock the crawler, you can cancel the blocking task.
path Parameters
id required | string Example: e0f6db8a-24f5-4092-83a4-1b2c6cb6d809 Crawler ID. |
taskID required | string Example: 98458796-b7bb-4703-8b1b-785c1080b110 Task ID. |
Responses
Response samples
- 400
{- "error": {
- "code": "malformed_id"
}
}