Elastic web crawler
editElastic web crawler
editLooking for the App Search web crawler? See the App Search documentation.
To compare the web crawler with the App Search web crawler, see the reference table on this page.
Use the web crawler to programmatically discover, extract, and index searchable content from websites and knowledge bases. When you ingest data with the web crawler a search-optimized Elasticsearch index is created to hold and sync webpage content.
The web crawler is a native Elasticsearch solution. It reads and writes directly to Elasticsearch indices in a format that enables developers to build intuitive, relevant search experiences using App Search engines and the Search UI library.
Availability and prerequisites
editThe Elastic web crawler was introduced in Elastic version 8.4.0.
The crawler is available to all Elastic Cloud deployments.
Your deployment must include the Elasticsearch, Kibana, and Enterprise Search services. Your Enterprise Search service should have at least 4 GB RAM per zone. See Infrastructure requirements to learn how to verify and change the RAM for your Enterprise Search service.
The web crawler is also available to self-managed deployments when the subscription requirements are satisfied. View the requirements for this feature under the Elastic Enterprise Search section of the Elastic Stack subscriptions page.
Web crawler documentation
edit- Website search tutorial: Concrete guide to building a website search experience, using the crawler UI
-
Managing crawls: Detailed reference for managing crawls using the Kibana UI Learn how to:
- Manage duplicated documents
- Extract binary content such as PDFs from webpages.
- Schedule automated crawls
-
Optimizing web content: Optimize your web content source files for the web crawler, to manage webpage discovery and content extraction Learn about:
- Custom field values using ingest pipeline: How to customize crawler field values using an ingest pipeline
- Custom fields using proxy: How to extract custom fields from webpages using a proxy server
- Troubleshooting crawls: Detailed troubleshooting reference
- Web crawler events logs reference: Detailed web crawler events logs reference
- View web crawler events logs: How to view web crawler events logs in Kibana
Version history
editThe following is a list of significant changes affecting this feature:
-
8.5.0: The web crawler’s default ingest pipeline changes
Since version 8.5.0, newly created Elastic web crawler indices use a new default pipeline that indexes extracted binary content into the
body
field. This differs from the usualbody_content
field that HTML content is indexed into, and may result in unexpected search results. This change does not affect existing Elastic web crawler indices created prior to 8.5.0.The following workarounds may apply:
-
Search experiences that expect content only in the
body_content
field can be updated to search across thebody
field as well. -
You may "Copy and customize" the default pipeline of your crawler index, adding a
set
processor to copy thebody
field into thebody_content
field, or vice versa as needed. -
Any App Search engines that are built on top of an Elastic web crawler index should double check that boosts and weights applied to the
body_content
field have also been applied to thebody
field, where applicable.
-
Search experiences that expect content only in the
- 8.4.0: Web crawler is generally available (GA)
Web crawler and App Search web crawler feature comparison
editApp Search web crawler |
Web crawler |
|
Interface |
GUI / API |
GUI-only |
Binary content extraction |
Yes |
Yes |
Search |
App Search |
|
Yes |
Yes |
|
No |
Yes |
|
Monitoring |
Yes |
Yes |
APM |
Yes |
Yes |
Audit logging |
Yes |
No |
Event logging |
Yes |
Yes |
Public REST API |
Yes |
No |