Web crawler schema
editWeb crawler schema
editThe web crawler indexes search documents using the following schema. All fields are strings or arrays of strings.
-
additional_urls
- The URLs of additional pages with the same content.
-
body_content
-
The content of the page’s
<body>
tag with all HTML tags removed. Truncated tocrawler.extraction.body_size.limit
. -
domains
- The domains in which this content appears.
-
full_html
-
The full HTML of the page in string form.
This is disabled by default.
If the setting is disabled, the document will not have a
full_html
field at all. -
headings
-
The text of the page’s HTML headings (
h1
-h6
elements). Limited bycrawler.extraction.headings_count.limit
. -
id
- The unique identifier for the page.
-
last_crawled_at
- The date and time when the page was last crawled.
-
links
-
Links found on the page.
Limited by
crawler.extraction.indexed_links_count.limit
. -
meta_description
-
The page’s description, taken from the
<meta name="description">
tag. Truncated tocrawler.extraction.description_size.limit
. -
meta_keywords
-
The page’s keywords, taken from the
<meta name="keywords">
tag. Truncated tocrawler.extraction.keywords_size.limit
. -
title
-
The title of the page, taken from the
<title>
tag. Truncated tocrawler.extraction.title_size.limit
. -
url
- The URL of the page.
-
url_host
- The hostname or IP from the page’s URL.
-
url_path
- The full pathname from the page’s URL.
-
url_path_dir1
- The first segment of the pathname from the page’s URL.
-
url_path_dir2
- The second segment of the pathname from the page’s URL.
-
url_path_dir3
- The third segment of the pathname from the page’s URL.
-
url_port
- The port number from the page’s URL (as a string).
-
url_scheme
- The scheme of the page’s URL.
In addition to these predefined fields, you can also extract custom fields via meta tags and attributes.