Web crawler FAQ

edit

Web crawler FAQ

edit

View frequently asked questions about the Enterprise Search web crawler:

See Web crawler reference for detailed technical information about the web crawler.

We also welcome your feedback.

What functionality is supported?

edit
  • Crawling HTTP/HTTPS websites

    Includes support for both publicly-accessible and private/intranet web sites. Self-signed SSL certificates and custom Certificate Authorities are supported.

  • Support for crawling multiple domains per-Engine
  • Robots meta tag support
  • Robots "nofollow" support

    Includes robots meta tags set to "nofollow" and links with rel="nofollow" attributes.

  • Robots.txt support

    The web crawler honors directives within robots.txt files.

  • Sitemap support

    The web crawler honors XML sitemaps, and fetches sitemaps identified within robots.txt files. Additional sitemaps can also be managed on the domain through the domain dashboard.

  • Configurable content extraction

    The web crawler will extract a predefined, set of fields (url, body content, etc) from each page it visits. In addition to this, the crawler also supports extracting dynamic fields from meta tags.

  • "Entry points"

    Entry points allow customers to specify where the web crawler begins crawling each domain.

  • "Crawl rules"

    Crawl rules allow customers to control whether each URL the web crawler encounters will be visited and indexed.

  • Logging of each crawl

    Logs are representative of an entire crawl, which encompasses all domains in an engine.

  • Automatic crawling

    Configure the cadence for new crawls to start automatically if there isn’t an active crawl.

  • User interfaces and APIs for managing domains, entry points, and crawl rules

    Crawler configuration can be managed via App Search dashboard UIs or via a set of public APIs provided by the product.

  • Crawl persistence

    Crawler uses Elasticsearch to maintain its state during an active crawl, allowing crawls to be migrated between instances in case of an instance failure or a restart of an Enterprise Search instance running a crawl. Each unique URL is only visited once thanks to the Seen URLs list persisted in Elasticsearch. Crawl-specific indexes are automatically cleaned up after a crawl is finished.

What functionality is not supported?

edit
  • Single-page app (SPA) support

    The crawler cannot currently crawl pages that are pure JavaScript single-page apps. We recommend looking at dynamic rendering to help your crawler properly index your JavaScript websites.

  • Crawling websites behind authentication

    Presently, the crawler cannot crawl websites that require authentication before accessing web content. The crawler supports custom User-Agent HTTP header values, which in some cases could be used to allow specific crawler requests and skip authentication.

  • Extracting content from files

    Currently, the web crawler will only extract content from HTML pages.