App Search web crawler configuration
editApp Search web crawler configuration
editIf you are looking for the Elastic web crawler configuration documentation, see Elastic web crawler in the Enterprise Search configuration documentation. To compare features with the Elastic web crawler, see Elastic web crawler overview
-
crawler.http.user_agent
-
The User-Agent HTTP Header used for the App Search web crawler.
crawler.http.user_agent: Elastic-Crawler (<crawler_version_number>)
When running Elastic Web Crawler on Elastic Cloud, the default user agent value is
Elastic-Crawler Elastic Cloud
(https://www.elastic.co/guide/en/cloud/current/ec-get-help.html; <unique identifier>)
.
-
crawler.http.user_agent_platform
-
The user agent platform used for the App Search web crawler with identifying information. See User-Agent - Syntax in the MDN web docs.
This value will be added as a suffix to
crawler.http.user_agent
and used as the final User-Agent Header. This value is blank by default.
-
crawler.workers.pool_size.limit
-
The number of parallel crawls allowed per instance of Enterprise Search. By default, it is set to 2x the number of available logical CPU cores. On Intel CPUs, the default value is 4x the number of physical CPU cores due to hyper-threading. See Hyper-threading on Wikipedia.
crawler.workers.pool_size.limit: N
You cannot set crawler.workers.pool_size.limit
to more than 8x the number of physical CPU cores
available for the Enterprise Search instance.
Keep in mind that despite the setting above, you can still only have one crawl request running per engine at a time.
Per-crawl Resource Limits
editThese limits guard against infinite loops and other traps common to production web crawlers. If your crawler is hitting these limits, try changing your crawl rules or the content you’re crawling. Adjust these limits as a last resort.
Advanced Per-crawl Limits
edit-
crawler.crawl.threads.limit
-
The number of parallel threads to use for each crawl. The main effect from increasing this value will be an increased throughput of the App Search web crawler at the expense of higher CPU load on Enterprise Search and Elasticsearch instances as well as higher load on the website being crawled.
crawler.crawl.threads.limit: 10
-
crawler.crawl.url_queue.url_count.limit
-
The maximum size of the crawl frontier - the list of URLs the App Search web crawler needs to visit. The list is stored in Elasticsearch, so the limit could be increased as long as the Elasticsearch cluster has enough resources (disk space) to hold the queue index.
crawler.crawl.url_queue.url_count.limit: 100000
Per-Request Resource Limits
editApp Search web crawler HTTP Security Controls
edit-
crawler.security.ssl.certificate_authorities
-
A list of custom SSL Certificate Authority certificates to be used for all connections made by the App Search web crawler to your websites. These certificates are added to the standard list of CA certificates trusted by the JVM. Each item in this list could be a file name of a certificate in PEM format or a PEM-formatted certificate as a string.
crawler.security.ssl.certificate_authorities: []
-
crawler.security.ssl.verification_mode
-
Control SSL verification mode used by the App Search web crawler:
-
full
- validate both the SSL certificate and the hostname presented by the server (this is the default and the recommended value) -
certificate
- only validate the SSL certificate presented by the server -
none
- disable SSL validation completely (this is very dangerous and should never be used in production deployments).crawler.security.ssl.verification_mode: full
-
Enabling this setting could expose your Authorization headers to a man-in-the-middle attack and should never be used in production deployments. See https://en.wikipedia.org/wiki/Man-in-the-middle_attack for more details.
App Search web crawler DNS Security Controls
editThe settings in this section could make your deployment vulnerable to SSRF attacks (especially in cloud environments) from the owners of any domains you crawl. Do not enable any of the settings here unless you fully control DNS domains you access with the App Search web crawler. See Server Side Request Forgery on OWASP for more details on the SSRF attack and the risks associated with it.
-
crawler.security.dns.allow_loopback_access
-
Allow the App Search web crawler to access the localhost (127.0.0.0/8 IP namespace).
crawler.security.dns.allow_loopback_access: false
-
crawler.security.dns.allow_private_networks_access
-
Allow the App Search web crawler to access the private IP space: link-local, network-local addresses, etc. See Reserved IP addresses - IPv4 on Wikipedia for more details.
crawler.security.dns.allow_private_networks_access: false
App Search web crawler HTTP proxy settings
editIf you need the App Search web crawler to send HTTP requests through an HTTP proxy, use the following settings to provide the proxy information to Enterprise Search.
Your proxy connections are subject to the DNS security controls described in App Search web crawler DNS Security Controls. If your proxy server is running on a private address or a loopback address, you will need to explicitly allow the App Search web crawler to connect to it.
-
crawler.http.proxy.host
-
The host of the proxy.
crawler.http.proxy.host: example.com
-
crawler.http.proxy.port
-
The port of the proxy.
crawler.http.proxy.port: 8080
-
crawler.http.proxy.protocol
-
The protocol to be used when connecting to the proxy:
http
(default) orhttps
.crawler.http.proxy.protocol: http
-
crawler.http.proxy.username
-
The username portion of the Basic HTTP credentials to be used when connecting to the proxy.
crawler.http.proxy.username: kimchy
-
crawler.http.proxy.password
-
The password portion of the Basic HTTP credentials to be used when connecting to the proxy.
crawler.http.proxy.password: A3renEWhGVxgYFIqfPAV73ncUtPN1b