Elastic web crawler known issues
editElastic web crawler known issues
editThe Elastic web crawler has the following known issues:
-
The crawler does not crawl pure JavaScript single-page applications (SPAs).
We recommend looking at dynamic rendering to help your crawler properly index your JavaScript websites. Another option is to serve a static HTML version of your Javascript website, using a solution such as Prerender.
-
The crawler does not support dynamic content.
The crawler does not execute JavaScript, and it only pulls text from HTML elements.
-
The crawler does not support form-based authentication.
The crawler currently only supports basic authentication and authentication header (e.g. bearer tokens) authentication methods.
-
URLs being indexed despite having duplicate content and a canonical URL setting.
Canonical URL link tags are embedded within HTML source for pages that duplicate the content of other pages. Refer to Duplicate document handling for details. The crawler identifies duplicate content by hashing the content of default deduplication fields derived from the page. These fields are defined by the configuration setting
connector.crawler.extraction.default_deduplication_fields
.The web crawler checks your index for an existing document with the same content hash. Users have faced issues where they set canonical link tags for a page that does not have identical content, because the hashes are different. However, upon inspection, the content is the same.
Use the following workaround:
You can manage which fields the web crawler uses to create the content hash. If your pages all define canonical URLs, you could safely change your deduplication fields settings to include only the
url
field. Otherwise, you may need more fields to help check for duplicates. By default, the web crawler checksbody_content
,headings
,links
,meta_description
,meta_keywords
, andtitle
fields. -
Custom scheduling might break when upgrading from version 8.6 or earlier.
If you encounter the error
'custom_schedule_triggered': undefined method 'each' for nil:NilClass (NoMethodError)
, it means the custom scheduling feature migration failed. You can use the following manual workaround:POST /.elastic-connectors/_update/<connector-id> { "doc": { "custom_scheduling": {} } }
This error can appear on Connectors or Crawlers that aren’t the cause of the issue. If the error continues, try running the above command for every document in the
.elastic-connectors
index. -
The web crawler ignores uppercase
noindex
tags.Make sure these tags are lowercase.
-
Updates to the default
connector.crawler.http.user_agent
are not applied.A workaround is to remove the
connector
prefix and update thecrawler.http_agent
setting in your Enterprise Search configuration file. -
The web crawler uses a non-deterministic method to determine thread pool size, which can lead to unexpected behavior.
This can be worked around by overriding the
crawler.workers.pool_size.limit
value in theelasticsearch.yml
file. -
Entry points should not have leading spaces.
Whitespace is not stripped from entry points, so leading spaces will be included in the URL, leading to errors.
-
Updates to the default
connector.crawler.http.user_agent
are not applied.A workaround is to remove the
connector
prefix and update thecrawler.http_agent
setting in your Enterprise Search configuration file.