Web crawler schema

edit

Web crawler schema

edit

The web crawler indexes search documents using the following schema. All fields are strings or arrays of strings.

additional_urls
The URLs of additional pages with the same content.
body_content
The content of the page’s <body> tag with all HTML tags removed. Truncated to crawler.extraction.body_size.limit.
domains
The domains in which this content appears.
full_html
The full HTML of the page in string form. This is disabled by default. If the setting is disabled, the document will not have a full_html field at all.
headings
The text of the page’s HTML headings (h1 - h6 elements). Limited by crawler.extraction.headings_count.limit.
id
The unique identifier for the page.
last_crawled_at
The date and time when the page was last crawled.
links
Links found on the page. Limited by crawler.extraction.indexed_links_count.limit.
meta_description
The page’s description, taken from the <meta name="description"> tag. Truncated to crawler.extraction.description_size.limit.
meta_keywords
The page’s keywords, taken from the <meta name="keywords"> tag. Truncated to crawler.extraction.keywords_size.limit.
title
The title of the page, taken from the <title> tag. Truncated to crawler.extraction.title_size.limit.
url
The URL of the page.
url_host
The hostname or IP from the page’s URL.
url_path
The full pathname from the page’s URL.
url_path_dir1
The first segment of the pathname from the page’s URL.
url_path_dir2
The second segment of the pathname from the page’s URL.
url_path_dir3
The third segment of the pathname from the page’s URL.
url_port
The port number from the page’s URL (as a string).
url_scheme
The scheme of the page’s URL.

In addition to these predefined fields, you can also extract custom fields via meta tags and attributes.