New

The executive guide to generative AI

Read more

Web crawler FAQ

edit

Web crawler FAQ

edit

Looking for the Elastic web crawler? See the Elastic web crawler documentation.

View frequently asked questions about the App Search web crawler:

See Web crawler reference for detailed technical information about the web crawler.

What functionality is supported?

edit
  • Crawling HTTP/HTTPS websites

    Includes support for both publicly-accessible and private/intranet web sites. Self-signed SSL certificates and custom Certificate Authorities are supported.

  • Support for crawling multiple domains per-Engine
  • Robots meta tag support
  • Robots "nofollow" support

    Includes robots meta tags set to "nofollow" and links with rel="nofollow" attributes.

  • Robots.txt support

    The web crawler honors directives within robots.txt files.

  • Sitemap support

    The web crawler honors XML sitemaps, and fetches sitemaps identified within robots.txt files. Additional sitemaps can also be managed on the domain through the domain dashboard.

  • Configurable content extraction

    The web crawler will extract a predefined, set of fields (url, body content, etc) from each page it visits. In addition to this, the crawler also supports extracting dynamic fields from meta tags.

  • "Entry points"

    Entry points allow customers to specify where the web crawler begins crawling each domain.

  • "Crawl rules"

    Crawl rules allow customers to control whether each URL the web crawler encounters will be visited and indexed.

  • Logging of each crawl

    Logs are representative of an entire crawl, which encompasses all domains in an engine.

  • Automatic crawling

    Configure the cadence for new crawls to start automatically if there isn’t an active crawl.

  • User interfaces and APIs for managing domains, entry points, and crawl rules

    Crawler configuration can be managed via App Search dashboard UIs or via a set of public APIs provided by the product.

  • Crawl persistence

    Crawler uses Elasticsearch to maintain its state during an active crawl, allowing crawls to be migrated between instances in case of an instance failure or a restart of an Enterprise Search instance running a crawl. Each unique URL is only visited once thanks to the Seen URLs list persisted in Elasticsearch. Crawl-specific indexes are automatically cleaned up after a crawl is finished.

  • Crawling websites behind authentication

    The web crawler can crawl content protected by HTTP authentication or content sitting behind an HTTP proxy (with or without authentication). See the following references:

What functionality is not supported?

edit
  • Single-page app (SPA) support

    The crawler cannot currently crawl pages that are pure JavaScript single-page apps. We recommend looking at dynamic rendering to help your crawler properly index your JavaScript websites.

  • The crawler does not support form-based authentication.

    The crawler only supports basic authentication and authentication header (e.g. bearer tokens) authentication methods.

Was this helpful?
Feedback