Web crawler (beta) API reference

edit

Web crawler (beta) API reference

edit

The Elastic Enterprise Search web crawler is a beta feature. Beta features are subject to change and are not covered by the support SLA of general release (GA) features. Elastic plans to promote this feature to GA in a future release.

App Search provides API operations for the web crawler. This document provides a reference for each API operation, as well as shared concerns:

Shared concerns

edit

All web crawler API operations share the following concerns.

Engine

edit

Most endpoints within the crawler API are scoped to a particular App Search engine. The engine is identified by the engine name value provided in the URL of the request. If an engine could not be found for any API request, an empty HTTP 404 response will be returned.

Access

edit

Unless security is disabled, you must provide credentials to access API operations.

See Authentication.

Crawler

edit

Responds with domain objects configured for a given App Search engine.

GET /api/as/v0/engines/{ENGINE_NAME}/crawler
# 200 OK
{
  "domains": [
    {
      "id": "{DOMAIN_ID}",
      "name": "{DOMAIN_NAME}",
      "document_count": 0,
      "entry_points": [
        {
          "id": "6087cec06dda9bdfb4a49e39",
          "value": "/"
        }
      ],
      "crawl_rules": [],
      "default_crawl_rule": {
        "id": "-",
        "order": 0,
        "policy": "allow",
        "rule": "regex",
        "pattern": ".*"
      },
      "sitemaps": []
    }
  ]
}

Crawl requests

edit

Each crawl performed by the Enterprise Search web crawler has an associated crawl request object. The crawl requests API allows operators to create new crawl requests and to view and control the state of existing crawl requests.

Get current active crawl request

edit

Returns a crawl request object for an active crawl or returns an HTTP 404 response if there is no active crawl for a given App Search engine.

GET /api/as/v0/engines/{ENGINE_NAME}/crawler/crawl_requests/active

For successful calls, the response is going to look like this:

# 200 OK
{
  "id": "601b21adbeae67679b3b760a",
  "status": "running",
  "created_at": "Wed, 03 Feb 2021 22:20:29 +0000",
  "begun_at": "Wed, 03 Feb 2021 22:20:31 +0000",
  "completed_at": null
}

For cases when there is no active crawl for a given engine, the API responds with a 404 error:

# 404 Not Found
{
  "error": "There are no active crawl requests for this engine"
}

Cancel an active crawl

edit

Cancels an active crawl for a given App Search engine or returns an HTTP 404 response if there is no active crawl for a given App Search engine.

It may take some time for the crawler to detect the cancellation request and gracefully stop the crawl. During the time, the status of the crawl request will remain canceling.

POST /api/as/v0/engines/{ENGINE_NAME}/crawler/crawl_requests/active/cancel

In case of success, the response contains a single crawl request object with a canceling state:

# 200 OK
{
  "id": "601b21adbeae67679b3b760a",
  "status": "canceling",
  "created_at": "Wed, 03 Feb 2021 22:20:29 +0000",
  "begun_at": "Wed, 03 Feb 2021 22:20:31 +0000",
  "completed_at": null
}

For cases when there is no active crawl for a given engine, the API responds with a 404 error:

# 404 Not Found
{
  "error": "There are no active crawl requests for this engine"
}

List crawl requests

edit

Returns a list of the most recent crawl requests for a given engine. The number of items returned (default: 10) can be changed by using the limit argument.

GET /api/as/v0/engines/{ENGINE_NAME}/crawler/crawl_requests
GET /api/as/v0/engines/{ENGINE_NAME}/crawler/crawl_requests?limit=25
# 200 OK
[
  {
    "id": "601b21adbeae67679b3b760a",
    "status": "running",
    "created_at": "Wed, 03 Feb 2021 22:20:29 +0000",
    "begun_at": "Wed, 03 Feb 2021 22:20:31 +0000",
    "completed_at": null
  },
  {
    "id": "60147e93beae67bf7ef72e86",
    "status": "success",
    "created_at": "Fri, 29 Jan 2021 21:30:59 +0000",
    "begun_at": "Fri, 29 Jan 2021 21:31:00 +0000",
    "completed_at": "Fri, 29 Jan 2021 21:35:20 +0000"
  },
  {
    "id": "60146c07beae67f397300128",
    "status": "canceled",
    "created_at": "Fri, 29 Jan 2021 20:11:51 +0000",
    "begun_at": "Fri, 29 Jan 2021 20:11:52 +0000",
    "completed_at": "Fri, 29 Jan 2021 20:12:51 +0000"
  }
]

Create a new crawl request

edit

Requests a new crawl for a given App Search engine. If there is already an active crawl, the request returns an HTTP 400 response with an error message.

POST /api/as/v0/engines/{ENGINE_NAME}/crawler/crawl_requests

In case of success, the response contains a single crawl request object with a pending state:

# 200 OK
{
  "id": "601b21adbeae67679b3b760a",
  "status": "pending",
  "created_at": "Wed, 03 Feb 2021 22:20:29 +0000",
  "begun_at": null,
  "completed_at": null
}

When there is already an active crawl, the API returns an HTTP 400 response:

# 400 Bad Request
{
  "error": "There is an active crawl for the engine \"your-engine\", please wait for it to finish or abort it before requesting another one"
}

View details for a crawl request

edit

Returns details of a given crawl request. The crawl request is identified with a unique Crawl Request ID value.

GET /api/as/v0/engines/{ENGINE_NAME}/crawler/crawl_requests/{CRAWL_REQUEST_ID}
# 200 OK
{
  "id": "60147e93beae67bf7ef72e86",
  "status": "success",
  "created_at": "Fri, 29 Jan 2021 21:30:59 +0000",
  "begun_at": "Fri, 29 Jan 2021 21:31:00 +0000",
  "completed_at": "Fri, 29 Jan 2021 21:35:20 +0000"
}

Crawl schedules

edit

Each engine using the Enterprise Search web crawler has an associated crawl schedule object. The crawl schedule API allows operators to specify a frequency at which new crawls will be started. If there is an active crawl, new crawls will be skipped.

Get current crawl schedule

edit

Returns a crawl schedule object or returns an HTTP 404 response if there is no crawl schedule object for a given App Search engine.

GET /api/as/v0/engines/{ENGINE_NAME}/crawler/crawl_schedule

For successful calls, the response is going to look like this:

# 200 OK
{
  "engine": {ENGINE_NAME},
  "frequency": 2,
  "unit": "week"
}

For cases when there is no crawl schedule for a given engine, the API responds with a 404 error:

# 404 Not Found
{
  "errors": ["No crawl schedule found"]
}

Create or update a crawl schedule

edit

Upserts a crawl schedule for a given App Search engine.

PUT /api/as/v0/engines/{ENGINE_NAME}/crawler/crawl_schedule
{
  "frequency": {INTEGER},
  "unit": {ENUM}
}
frequency (required)
A positive integer.
unit (required)
Should be one of: hour, day, week, month.

In case of success, the response contains the crawl schedule object:

# 200 OK
{
  "engine": {ENGINE_NAME},
  "frequency": 2,
  "unit": "week"
}

When the parameters are invalid, the API returns an HTTP 400 response:

# 400 Bad Request
{
  "errors": [
    "Crawl schedule frequency must be an integer",
    "Crawl schedule unit must be one of hour, day, week, month"
  ]
}

Delete a crawl schedule

edit

Deletes a crawl schedule for a given App Search engine.

DELETE /api/as/v0/engines/{ENGINE_NAME}/crawler/crawl_schedule

In case of success, the response contains an object showing the crawl schedule has been deleted:

# 200 OK
{
  "deleted": true
}

For cases when there is no crawl schedule for a given engine, the API responds with a 404 error:

# 404 Not Found
{
  "errors": ["No crawl schedule found"]
}

User agent

edit

Responds with the User-Agent header used by the crawler.

GET /api/as/v0/crawler/user_agent
# 200 OK
{
  "user_agent": "Elastic Crawler (0.0.1)"
}

Domains

edit

Create a new domain

edit

Create a (crawler) domain for a given App Search engine.

POST /api/as/v0/engines/{ENGINE_NAME}/crawler/domains
{
  "name": {STRING}
}
name (required)
The domain URL.

For successful calls, the response is going to look like this:

# 200 OK
{
  "id": "{DOMAIN_ID}",
  "name": "{DOMAIN_NAME}",
  "document_count": 0,
  "entry_points": [
    {
      "id": "6087cec06dda9bdfb4a49e39",
      "value": "/"
    }
  ],
  "crawl_rules": [],
  "default_crawl_rule": {
    "id": "-",
    "order": 0,
    "policy": "allow",
    "rule": "regex",
    "pattern": ".*"
  },
  "sitemaps": []
}

View details for a domain

edit

Get domain object for a given App Search engine.

GET /api/as/v0/engines/{ENGINE_NAME}/crawler/domains/{DOMAIN_ID}

For successful calls, the response is going to look like this:

# 200 OK
{
  "id": "{DOMAIN_ID}",
  "name": "{DOMAIN_NAME}",
  "document_count": 0,
  "entry_points": [
    {
      "id": "6087cec06dda9bdfb4a49e39",
      "value": "/"
    }
  ],
  "crawl_rules": [],
  "default_crawl_rule": {
    "id": "-",
    "order": 0,
    "policy": "allow",
    "rule": "regex",
    "pattern": ".*"
  },
  "sitemaps": []
}

Update a domain

edit

Updates a domain for a given App Search engine.

PUT /api/as/v0/engines/{ENGINE_NAME}/crawler/domains/{DOMAIN_ID}
{
  "name": {STRING}
}
name
The domain URL.

For successful calls, the response is going to look like this:

# 200 OK
{
  "id": "{DOMAIN_ID}",
  "name": "{DOMAIN_NAME}",
  "document_count": 0,
  "entry_points": [
    {
      "id": "6087cec06dda9bdfb4a49e39",
      "value": "/"
    }
  ],
  "crawl_rules": [],
  "default_crawl_rule": {
    "id": "-",
    "order": 0,
    "policy": "allow",
    "rule": "regex",
    "pattern": ".*"
  },
  "sitemaps": []
}

Delete a domain

edit

Deletes a domain for a given App Search engine.

DELETE /api/as/v0/engines/{ENGINE_NAME}/crawler/domains/{DOMAIN_ID}

In case of success, the response contains an object showing the domain has been deleted:

# 200 OK
{
  "deleted": true
}

Entry points

edit

Create a new entry point

edit

Create an entry point for a domain.

POST /api/as/v0/engines/{ENGINE_NAME}/crawler/domains/{DOMAIN_ID}/entry_points
{
  "value": {STRING}
}
value (required)
The entry point path.

For successful calls, the response is going to look like this:

# 200 OK
{
  "id": "{ENTRY_POINT_ID}",
  "value": "/blog"
}

Update an entry point

edit

Updates an entry point for a domain.

PUT /api/as/v0/engines/{ENGINE_NAME}/crawler/domains/{DOMAIN_ID}/entry_points/{ENTRY_POINT_ID}
{
  "value": {STRING}
}
value
The entry point path.

For successful calls, the response is going to look like this:

# 200 OK
{
  "id": "{ENTRY_POINT_ID}",
  "value": "/blog"
}

Delete an entry point

edit

Deletes an entry point for a domain.

DELETE /api/as/v0/engines/{ENGINE_NAME}/crawler/domains/{DOMAIN_ID}/entry_points/{ENTRY_POINT_ID}

In case of success, the response contains an object showing the entry point has been deleted:

# 200 OK
{
  "deleted": true
}

Crawl rules

edit

Create a new crawl rule

edit

Create a crawl rule for a domain.

POST /api/as/v0/engines/{ENGINE_NAME}/crawler/domains/{DOMAIN_ID}/crawl_rules
{
  "policy": {ENUM},
  "rule": {ENUM},
  "pattern": {STRING},
  "order": {INTEGER}
}
policy (required)
Accepted values are allow and deny.
rule (required)
Accepted values are begins, ends, contains and regex.
pattern (required)
The path pattern to match against.
order (optional)
An integer representing this crawl rule’s position within the list of crawl rules for the domain. The order of crawl rules is significant.

For successful calls, the response is going to look like this:

# 200 OK
{
  "id": "{CRAWL_RULE_ID}",
  "order": 0,
  "policy": "allow",
  "rule": "begins",
  "pattern": "/ignore"
}

Update a crawl rule

edit

Updates a crawl rule for a domain.

PUT /api/as/v0/engines/{ENGINE_NAME}/crawler/domains/{DOMAIN_ID}/crawl_rules/{CRAWL_RULE_ID}
{
  "policy": {ENUM},
  "rule": {ENUM},
  "pattern": {STRING},
  "order": {INTEGER}
}
policy
Accepted values are allow and deny.
rule
Accepted values are begins, ends, contains and regex.
pattern
The path pattern to match against.
order
An integer representing this crawl rule’s position within the list of crawl rules for the domain. The order of crawl rules is significant.

For successful calls, the response is going to look like this:

# 200 OK
{
  "id": "{CRAWL_RULE_ID}",
  "order": 0,
  "policy": "allow",
  "rule": "begins",
  "pattern": "/ignore"
}

Delete a crawl rule

edit

Deletes a crawl rule for a domain.

DELETE /api/as/v0/engines/{ENGINE_NAME}/crawler/domains/{DOMAIN_ID}/crawl_rules/{CRAWL_RULE_ID}

In case of success, the response contains an object showing the crawl rule has been deleted:

# 200 OK
{
  "deleted": true
}

Sitemaps

edit

Create a new sitemap

edit

Create a sitemap for a domain.

POST /api/as/v0/engines/{ENGINE_NAME}/crawler/domains/{DOMAIN_ID}/sitemaps
{
  "url": {STRING}
}
url (required)
The sitemap URL.

For successful calls, the response is going to look like this:

# 200 OK
{
  "id": "{SITEMAP_ID}",
  "url": "https://elastic.co/sitemap2.xml"
}

Update a sitemap

edit

Updates a sitemap for a domain.

PUT /api/as/v0/engines/{ENGINE_NAME}/crawler/domains/{DOMAIN_ID}/sitemaps/{SITEMAP_ID}
{
  "url": {STRING}
}
url
The sitemap URL.

For successful calls, the response is going to look like this:

# 200 OK
{
  "id": "{SITEMAP_ID}",
  "url": "https://elastic.co/sitemap2.xml"
}

Delete a sitemap

edit

Deletes a sitemap for a domain.

DELETE /api/as/v0/engines/{ENGINE_NAME}/crawler/domains/{DOMAIN_ID}/sitemaps/{SITEMAP_ID}

In case of success, the response contains an object showing the sitemap has been deleted:

# 200 OK
{
  "deleted": true
}