Elastic web crawler

edit

Elastic web crawler

edit

Looking for the App Search web crawler? See the App Search documentation.

To compare the web crawler with the App Search web crawler, see the reference table on this page.

Use the web crawler to programmatically discover, extract, and index searchable content from websites and knowledge bases. When you ingest data with the web crawler a search-optimized Elasticsearch index is created to hold and sync webpage content.

The web crawler is a native Elasticsearch solution. It reads and writes directly to Elasticsearch indices in a format that enables developers to build intuitive, relevant search experiences using App Search engines and the Search UI library.

In the Kibana UI, go to Search > Content > Web crawlers to create new web crawlers and to manage and monitor crawls.

Availability and prerequisites

edit

The Elastic web crawler was introduced in Elastic version 8.4.0.

The crawler is available to all Elastic Cloud deployments.

Your deployment must include the Elasticsearch, Kibana, and Enterprise Search services. Your Enterprise Search service should have at least 4 GB RAM per zone. See Infrastructure requirements to learn how to verify and change the RAM for your Enterprise Search service.

The web crawler is also available to self-managed deployments when the subscription requirements are satisfied. View the requirements for this feature under the Elastic Search section of the Elastic Stack subscriptions page.

Web crawler documentation

edit

Version history

edit

The following is a list of significant changes affecting this feature:

  • 8.5.0: The web crawler’s default ingest pipeline changes

    Since version 8.5.0, newly created Elastic web crawler indices use a new default pipeline that indexes extracted binary content into the body field. This differs from the usual body_content field that HTML content is indexed into, and may result in unexpected search results. This change does not affect existing Elastic web crawler indices created prior to 8.5.0.

    The following workarounds may apply:

    • Search experiences that expect content only in the body_content field can be updated to search across the body field as well.
    • You may "Copy and customize" the default pipeline of your crawler index, adding a set processor to copy the body field into the body_content field, or vice versa as needed.
    • Any App Search engines that are built on top of an Elastic web crawler index should double check that boosts and weights applied to the body_content field have also been applied to the body field, where applicable.
  • 8.4.0: Web crawler is generally available (GA)

Web crawler and App Search web crawler feature comparison

edit

App Search web crawler

Web crawler

Interface

GUI / API

GUI-only

Binary content extraction

Yes

Yes

Search

App Search search APIs

Elasticsearch search APIs

Ingest pipelines

Yes

Yes

ML inference pipelines

No

Yes

Monitoring

Yes

Yes

APM

Yes

Yes

Audit logging

Yes

No

Event logging

Yes

Yes

Public REST API

Yes

No

Content extraction rules

No

Yes