Web crawler events logs reference

edit

Web crawler events logs reference

edit

See View web crawler events logs to learn how to view web crawler events logs in Kibana.

The web crawler logs many events while discovering, extracting, and indexing web content.

Enterprise Search records these events using Elastic Common Schema (ECS), including a custom field set called crawler.* for crawler-specific data (like crawl_id).

This document provides a reference to these events and their fields.

This reference describes the fields common to all web crawler events, including:

The remainder of the document describes different types of web crawler events:

Fields common to all web crawler events

edit

All web crawler events include the following common fields.

Crawler-specific fields

edit
crawler.crawl.id
A unique ID of a specific crawl.

Base fields

edit
@timestamp
A UTC timestamp of the event.
event.id
A unique identifier of the event.
event.action
The type of event. See the sections that follow.
message
A textual description of the event (useful for displaying in a UI for human consumption).

Service fields

edit
service.ephemeral_id
A unique identifier of the crawler process generating the ID (changes every time a process is restarted).
service.type
All events will have this set to crawler.
service.version
Current version of the Enterprise Search product.

Process fields

edit
process.pid
The PID of the crawler instance.
process.thread.id
The id of the thread logging the event.

Host fields

edit
host.name
The host name where the crawler instance is deployed.

Crawl lifecycle events

edit

Each crawl lifecycle event records important checkpoints within the lifecycle of a specific crawl, for example: start, seed, end. Most of the event information is captured in the message field, along with the other common fields described above. The fields below provide additional details.

Each crawl lifecycle event has one of the following values for event.action:

crawl-start
Emitted when a crawl is started. Includes crawl configuration.
crawl-seed
Emitted every time a crawl is seeded with a set of URLs from the outside. Includes the list of URLs submitted to the crawler.
crawl-end
Emitted when a crawl is ended for any reason (finished, canceled, etc).
crawl-status
Periodic events with a snapshot of crawler status metrics used for monitoring an active crawl over time.

Crawl start events

edit
event.kind
Set to event.
event.type
Set to start.
event.action
Set to crawl-start.
crawler.crawl.config
A serialized version of the crawl config.

Crawl seed events

edit
event.kind
Set to event.
event.type
Set to change.
event.action
Set to crawl-seed.
crawler.crawl.seed_urls
A list of URLs used to seed a crawl.
crawler.url.type

A type of the URLs being added:

  • content for generic content URLs.
  • sitemap for sitemap and sitemap-index URLs.
  • feed for RSS/ATOM feeds.

Crawl end events

edit
event.kind
Set to event.
event.type
Set to end.
event.action
Set to crawl-end.
event.outcome
Set to success or failure depending on how a crawl ended (canceled crawls will be considered failed, etc).

Crawl status events

edit
event.kind
Set to metric.
event.type
Set to info.
event.action
Set to crawl-status.
crawler.status.*
A set of metrics describing the global state of a crawl and crawl-specific stats that may be useful to understand the state of a crawl over time.

URL lifecycle events

edit

Each URL lifecycle event is scoped to a particular URL within a specific crawl. Each event describes what happened to the URL during the crawl, for example: how and when did the crawler discover it?, why did the crawler skip it? These events have enough details to allow a human operator to understand exactly how the system discovered a specific URL, what decisions have been made about it, and what was the result of processing the URL.

Each URL lifecycle event has one of the following values for event.action:

url-seed
URL submitted to the crawl backlog for processing (from a seed list, from within the crawl, via an API, etc).
url-fetch
URL fetch attempt including timing information, server response headers, HTTP code, etc.
url-discover
URL discovery events. Each time the crawler discovers a URL on a page and makes a descision about it, the URL and the decision are logged.
url-extracted
Events logged when we finish content extraction from a URL (maybe with some basic metadata extracted from the page).
url-output
An event marking the end of URL processing.

Fields common to all URL lifecycle events

edit

All URL lifecycle events include the following common fields:

Identification fields:

crawler.url.hash
A unique identifier (hash) for the URL as it is handled by the crawler. All events for the same URL within a single crawl share the same hash.
crawler.url.source_hash
A unique identifier of the URL that was used to discover this URL (only used for cases when a URL was discovered during a crawl and not submitted as a seed URL).

URL details:

url.full
The full URL string.
url.scheme
Scheme portion of the URL.
url.domain
Domain portion of the URL.
url.port
Port of the URL.
url.path
Path of the URL.
url.query
URL query string. Included when available.
url.fragment
URL fragment. Included when available.
url.username
Username portion of the URL. Included when available.
url.password
Password portion of the URL. Included when available.

URL seed events

edit

These are small events used to track the flow of URLs into the crawler system and are primarily focused on tracking how a specific URL got into the backlog.

event.kind
Set to event.
event.type
Set to start.
event.action
Set to url-seed.
crawler.url.type

A type of the URL being added:

  • content for generic content URLs.
  • sitemap for sitemap and sitemap-index URLs.
  • feed for RSS/ATOM feeds.
crawler.url.source_type

A name of the source used for seeding the crawl:

  • seed-list for seed-list URLs submitted as a part of the crawl configuration.
  • organic for URLs discovered during a crawl by following organic links.
  • redirect for pages discovered by following a redirect.
  • canonical-url for pages discovered via the canonical URL meta tag.
crawler.url.source_url.hash
Set to the hash of the URL the crawler used to discover this page (only for URLs discovered during a crawl and not for entry points).
crawler.url.crawl_depth
A positive number, indicating the number of steps the crawler had to take from our seed URLs set to reach this specific page.

URL fetch events

edit

These are the primary events that will be used for troubleshooting networking layer issues with a crawl. They therefore aim to provide enough insight into what happened during a fetch attempt and what were the results.

These events represent a single HTTP request. If the crawler followed redirects, it logs a separate record for each event including information about the redirect response to help with redirect chain troubleshooting.

event.kind
Set to event.
event.type
Set to access.
event.action
Set to url-fetch.

Event timing and outcome details:

event.start
The start of the HTTP request.
event.end
The end of the HTTP request.
event.duration
Response timing for the HTTP request (total time it took to get the full response).
event.outcome

An ECS categorization field. Denotes whether the event represents a success or a failure from the perspective of the crawler:

  • failure - for all 3xx, 4xx and 5xx responses.
  • success - for all 2xx responses.
  • unknown - for network timeouts.

HTTP request details:

http.request.method
The method of the request.

HTTP response details:

http.response.body.bytes
The size of the response body in bytes (for successful responses only).
http.response.status_code
A string status code.

HTTP redirect details:

crawler.url.redirect.location
A Location header content for redirect responses.
crawler.url.redirect.count
Number of redirects followed so far in a redirect chain (starts with 1 on the first redirect and is increased on each subsequent redirect until a non-redirect response is received or the maximum number of redirects is reached).

URL discover events

edit

These are small events used to troubleshoot URL discovery within the crawler. Each time the crawler sees a new URL (extracted from a page, from a sitemap or from following a redirect), it logs information about it along with the decision on what will happen to the newly discovered link.

event.kind
Set to event.
event.action
Set to url-discover.
event.type

Depending on the decision regarding the URL, set to one of:

  • allowed if the URL will be added to the backlog for future crawling.
  • denied if the URL will not be followed (the message field will have a human-readable explanation of why the crawler decided not to follow it).
crawler.url.source_type

A type of the source used for discovering the link:

  • organic for URLs discovered during a crawl by following organic links.
  • redirect for pages discovered by following a redirect.
crawler.url.source_url.hash
Set to the hash of the URL the crawler used to discover this page (for URLs discovered during the crawl and not for entry points).
crawler.url.crawl_depth
A positive number, indicating the number of steps the crawler had to take from our seed URLs set to reach this specific page.
crawler.url.deny_reason

A field with a code explaining the reason for skipping a URL during a crawl:

  • already_seen when this exact URL/page has already been processed in this crawl.
  • link_too_deep when we hit a crawl depth limit.
  • link_too_long when we hit a URL length limit.
  • link_with_too_many_params when we hit a limit on the number of URL parameters allowed.
  • link_with_too_many_segments when we hit a limit on the number of URL segments allowed.
  • queue_full when we hit a backlog size limit.
  • sitemap_denied when a URL is prohibited from crawling by a sitemap rule.
  • domain_filter_denied for prohibited cross-domain links.
  • page_already_visited for crawl-scoped URL de-duplication events.
  • incorrect_protocol for non-HTTP links and non-HTTPS links in HTTPS-enforced mode.

URL extracted events

edit

These events are focused on the extraction portion of the crawler process and are logged to help an operator troubleshoot the process of content extraction for the pages on their domains. The primary focus here is capturing the details of the extraction process.

Each event represents a single extractor handling a single piece of content.

event.kind
Set to event.
event.action
Set to url-extracted.
event.module
The name of the extractor generating the event (e.g. html).
event.type

Depending on the decision regarding the URL, set to one of:

  • allowed if the URL has been allowed to be indexed.
  • denied if the URL has not been indexed because of a crawl rule, a robots.txt rule, etc (the message field will have a human-readable explanation of what happened).

Event timing and outcome details:

event.start
The start of the extraction process.
event.end
The end of the extraction process.
event.duration
End-to-end timing for the extraction process (total time it took to get the data extracted).
event.outcome

An ECS categorization field. Denotes whether the event represents a success or a failure from the perspective of the crawler:

  • failure if extraction process failed and we are going to drop the content.
  • success if extraction process succeeded (or failed in a graceful manner).

Extraction result details:

crawler.extraction.content_type
Content type for the page.
crawler.extraction.content_size.bytes
The size of the page.
crawler.extraction.fields_extracted
The list of fields extracted.

URL output events

edit

These events are designed to capture the results of ingestion of a single piece of content into an external system (file, App Search, etc). The main goal here is to capture any data needed to tie a URL fetched and processed by the crawler to the changes performed in the external system as a result of the crawl.

Each event represents a single output module handling a single piece of content.

event.kind
Set to event.
event.type
Set to end.
event.action
Set to url-output.
event.module
The name of the output module generating the event (e.g. file, app-search).

Event timing and outcome details:

event.start
The start of the output ingestion process.
event.end
The end of the output ingestion process.
event.duration
End-to-end timing for the output ingestion process (total time it took to get the data processed by the module).
event.outcome

An ECS categorization field. Denotes whether the event represents a success or a failure from the perspective of the crawler:

  • failure if output ingestion process failed and we are going to drop the content.
  • success if output ingestion process succeeded (or failed in a graceful manner).
  • unknown for cases specific to an output module.

Output ingestion results (file module):

crawler.output.file.directory
The directory where the event has been logged.
crawler.output.file.name
The name of the file where the event has been logged (base name without the directory).

Output ingestion results (app-search module):

crawler.output.app-search.engine.id
The id of the engine used to ingest the content.
crawler.output.app-search.engine.name
The name of the engine used to ingest the content.
crawler.output.app-search.document_id
The id of the document within the engine.
crawler.output.app-search.content_hash
The content hash used for de-duplication purposes.

Content ingestion events

edit

A special kind of event used to troubleshoot the ingestion process. These events are used only by complex output modules and, potentially, only enabled in debug mode or by using a special crawl config option. The goal of these events is to explain the ingestion process results in more details than could be captured by a URL output event.

event.kind
Set to event.
event.type
Set to info.
event.action
Set to ingest-progress.
event.module
The name of the output module generating the event (e.g. file, app-search).
message

Details on what is happening with the extraction process.

App Search logs URL-scoped events that explain how a specific piece of content from the crawler got ingested into the external system. These are important for troubleshooting cases when the crawler discovers and crawls a URL, but due to App Search de-duplication logic the content does not get ingested, etc.

ingest-progress
An event logged by an output module to help an operator troubleshoot the ingestion process. These are pretty generic events using the message field to explain what is happening.

URL identification fields:

These are used to correlate an ingestion event to the rest of the events generated by the crawler for a specific page:

crawler.url.hash
A unique identifier for the URL as it is handled by the crawler, all events for the same URL within a single crawl share the same hash (since it is calculated as SHA1 hash of the URL itself).
url.full
The full URL string.
url.scheme
Scheme portion of the URL.
url.domain
Domain portion of the URL.
url.port
Port of the URL.
url.path
Path of the URL.
url.query
URL query string. Included when available.
url.fragment
URL fragment. Included when available.
url.username
Username portion of the URL. Included when available.
url.password
Password portion of the URL. Included when available.