Web crawler FAQ
editWeb crawler FAQ
editView frequently asked questions about the Enterprise Search web crawler:
See Web crawler reference for detailed technical information about the web crawler.
We also welcome your feedback.
What functionality is supported?
edit-
Crawling HTTP/HTTPS websites
Includes support for both publicly-accessible and private/intranet web sites. Self-signed SSL certificates and custom Certificate Authorities are supported.
- Support for crawling multiple domains per-Engine
- Robots meta tag support
-
Robots "nofollow" support
Includes robots meta tags set to "nofollow" and links with rel="nofollow" attributes.
-
Robots.txt support
The web crawler honors directives within robots.txt files.
-
Sitemap support
The web crawler honors XML sitemaps, and fetches sitemaps identified within robots.txt files. Additional sitemaps can also be managed on the domain through the domain dashboard.
-
Configurable content extraction
The web crawler will extract a predefined, set of fields (url, body content, etc) from each page it visits. In addition to this, the crawler also supports extracting dynamic fields from meta tags.
-
"Entry points"
Entry points allow customers to specify where the web crawler begins crawling each domain.
-
"Crawl rules"
Crawl rules allow customers to control whether each URL the web crawler encounters will be visited and indexed.
-
Logging of each crawl
Logs are representative of an entire crawl, which encompasses all domains in an engine.
-
Automatic crawling
Configure the cadence for new crawls to start automatically if there isn’t an active crawl.
-
User interfaces and APIs for managing domains, entry points, and crawl rules
Crawler configuration can be managed via App Search dashboard UIs or via a set of public APIs provided by the product.
-
Crawl persistence
Crawler uses Elasticsearch to maintain its state during an active crawl, allowing crawls to be migrated between instances in case of an instance failure or a restart of an Enterprise Search instance running a crawl. Each unique URL is only visited once thanks to the Seen URLs list persisted in Elasticsearch. Crawl-specific indexes are automatically cleaned up after a crawl is finished.
What functionality is not supported?
edit-
Single-page app (SPA) support
The crawler cannot currently crawl pages that are pure JavaScript single-page apps. We recommend looking at dynamic rendering to help your crawler properly index your JavaScript websites.
-
Crawling websites behind authentication
Presently, the crawler cannot crawl websites that require authentication before accessing web content. The crawler supports custom
User-Agent
HTTP header values, which in some cases could be used to allow specific crawler requests and skip authentication. -
Extracting content from files
Currently, the web crawler will only extract content from HTML pages.