Optimizing web content for the web crawler
editOptimizing web content for the web crawler
editThis documentation explains how to optimize web content for the crawler.
To do this, you must be able to access and modify HTML, robots.txt
files, or sitemap source files.
If you can’t access these files, manage crawls in Kibana.
You can optimize your web content source files for the web crawler.
These techniques are similar to search engine optimization (SEO) techniques used for other web crawlers and robots. For example, you can embed instructions for the web crawler within your HTML content. You can also prevent the crawler from following links or indexing any content for certain webpages. Use these tools to manage webpage discovery and content extraction.
Discovery concerns which web pages and files from crawled domains get indexed:
Extraction concerns how content is indexed and mapped to fields in Elasticsearch documents:
HTML elements and attributes
editThe following sections describe crawler instructions you can embed within HTML elements and attributes.
Canonical URL link tags
editA canonical URL link tag is an HTML element you can embed within pages that duplicate the content of other pages. See Duplicate document handling for detailed information about managing duplicate content using the Kibana UI. The canonical URL link tag specifies the canonical URL for that content.
The canonical URL is stored on the document in the url
field, while the additional_urls
field contains all other URLs where the crawler discovered the same content.
If your site contains pages that duplicate the content of other pages, use canonical URL link tags to explicitly manage which URL is stored in the url
field of the indexed document.
Template:
<link rel="canonical" href="{CANONICAL_URL}">
Example:
<link rel="canonical" href="https://example.com/categories/dresses/starlet-red-medium">
Robots meta tags
editRobots meta tags are HTML elements you can embed within pages to prevent the crawler from following links or indexing content. These tags are related to crawl rules. See Crawl rules for detailed information about crawl rules.
Template:
<meta name="robots" content="{DIRECTIVES}">
Supported directives:
-
noindex
- The web crawler will not index the page. If you want to index some, but not all, content on a page, see Data attributes for inclusion and exclusion.
-
nofollow
-
The web crawler will not follow links from the page. The web crawler logs a
url_discover_denied
event for each link.The directive does not prevent the web crawler from indexing the page.
Currently, content deletion (purge or process crawl) does not honor the noindex
and nofollow
directives.
Crawls will not remove previously indexed pages that now have noindex
and nofollow
directives from the engine at the end of each crawl.
To manually remove obsolete content, create the appropriate crawl rules to exclude the pages and run a process crawl.
Examples:
<meta name="robots" content="noindex"> <meta name="robots" content="nofollow"> <meta name="robots" content="noindex, nofollow">
Data attributes for inclusion and exclusion
editInject HTML data
attributes into your web pages to instruct the web crawler to include or exclude particular sections from extracted content. For example, use this feature to exclude navigation and footer content when crawling, or to exclude sections of content only intended for screen readers.
These attributes work as follows:
-
For all pages that contain HTML tags with a
data-elastic-exclude
attribute, the crawler will ignore these tags and their content. -
The crawler will always extract content from HTML tags that have the
data-elastic-include
attribute. -
You can nest
data-elastic-include
attributes insidedata-elastic-exclude
attributes. This will allow the crawler to extract specific content. - The web crawler will still crawl any links that appear inside excluded sections as long as the configured crawl rules allow them.
Examples
editA simple content exclusion rule example:
<body> <p>This is your page content, which will be indexed by the web crawler. <div data-elastic-exclude>Content in this div will be excluded from the search index</div> </body>
In this more complex example with nested exclusion and inclusion rules, the web crawler will only extract "test1 test3 test5 test7" from the page.
<body> test1 <div data-elastic-exclude> test2 <p data-elastic-include> test3 <span data-elastic-exclude> test4 <span data-elastic-include>test5</span> </span> </p> test6 </div> test7 </body>
Meta tags and data attributes to extract custom fields
editThe web crawler extracts a predefined, set of fields (url, body content, etc) from each page it visits. View your documents to see the full schema. With meta tags and data attributes you can extract custom fields from your HTML pages.
Template:
<head> <meta class="elastic" name="{FIELD_NAME}" content="{FIELD_VALUE}"> </head> <body> <div data-elastic-name="{FIELD_NAME}">{FIELD_VALUE}</div> </body>
The crawled document for this example
<head> <meta class="elastic" name="product_price" content="99.99"> </head> <body> <h1 data-elastic-name="product_name">Printer</h1> </body>
will include 2 additional fields.
{ "product_price": "99.99", "product_name": "Printer" }
You can specify multiple class="elastic"
and data-elastic-name
tags.
Template:
<head> <meta class="elastic" name="{FIELD_NAME_1}" content="{FIELD_VALUE_1}"> <meta class="elastic" name="{FIELD_NAME_2}" content="{FIELD_VALUE_2}"> </head> <body> <div data-elastic-name="{FIELD_NAME_1}">{FIELD_VALUE_1}</div> <div data-elastic-name="{FIELD_NAME_2}">{FIELD_VALUE_2}</div> </body>
{FIELD_NAME}
must conform to field name rules:
- Must contain a lowercase letter and may only contain lowercase letters, numbers, and underscores.
- Must not contain whitespace or have a leading underscore.
- Must not contain more than 64 characters.
-
Must not be any of the following reserved words:
-
id
-
engine_id
-
search_index_id
-
highlight
-
any
-
all
-
none
-
or
-
and
-
not
-
additional_urls
-
body_content
-
domains
-
headings
-
last_crawled_at
-
links
-
meta_description
-
meta_keywords
-
title
-
url
-
url_host
-
url_path
-
url_path_dir1
-
url_path_dir2
-
url_path_dir3
-
url_port
-
url_scheme
-
It might not be possible to customize the HTML source code for the webpages you want to crawl. Use Content extraction rules to customize how the crawler extracts content from webpages.
Nofollow links
editNofollow links are HTML links that instruct the crawler to not follow the URL.
The web crawler will not follow links that include rel="nofollow"
(i.e. will not add links to the crawl queue).
The web crawler logs a url_discover_denied
event for each link.
The link does not prevent the web crawler from indexing the page in which it appears.
Template:
<a rel="nofollow" href="{LINK_URL}">{LINK_TEXT}</a>
Example:
<a rel="nofollow" href="/admin/categories">Edit this category</a>
robots.txt
files
editIt is impossible to configure the web crawler to ignore or work around a domain’s robots.txt
file.
Remember this if you’re crawling a domain you don’t control.
A domain may have a robots.txt
file.
This is a plain text file that provides instructions to web crawlers.
The instructions within the file, also called directives, communicate which paths within that domain are disallowed (and allowed) for crawling.
You can also use a robots.txt
file to specify sitemaps for a domain.
See Sitemaps.
Most web crawlers automatically fetch and parse the robots.txt
file for each domain they crawl.
If you already publish a robots.txt
file for other web crawlers, be aware the web crawler will fetch this file and honor the directives within it.
You may want to add, remove, or update the robots.txt
file for each of your domains.
Example: add a robots.txt
file to a domain
To add a robots.txt
file to the domain https://shop.example.com
:
- Determine which paths within the domain you’d like to exclude.
-
Create a robots.txt file with the appropriate directives from the Robots exclusion standard. For instance:
User-agent: * Disallow: /cart Disallow: /login Disallow: /account
-
Publish the file, with filename
robots.txt
, at the root of the domain:https://shop.example.com/robots.txt
.
The next time the web crawler visits the domain, it will fetch and parse the robots.txt
file.
The web crawler will crawl only those paths that are allowed by the crawl rules for the domain and the directives within the robots.txt
file for the domain.
See crawl rules for detailed information about crawl rules.
Non-standard extensions
editThe Elastic web crawler does not support all Nonstandard extensions to the robots exclusion standard.
Directive |
Support |
Crawl-delay directive |
Not supported |
Sitemap directive |
Supported |
Host directive |
Not supported |
Sitemaps
editA sitemap is an XML file, associated with a domain, that informs web crawlers about pages within that domain. XML elements within the sitemap identify specific URLs that are available for crawling. Each domain may have one or more sitemaps.
If you already publish sitemaps for other web crawlers, the web crawler can use the same sitemaps.
To make your sitemaps discoverable, specify them within robots.txt
files.
Sitemaps are related to entry points. See entry points. You can choose to submit URLs to the web crawler using sitemaps, entry points, or a combination of both.
You may prefer using sitemaps over entry points for any of the following reasons:
- You have already been publishing sitemaps for other web crawlers.
- You don’t have access to the web crawler UI in Kibana.
- You prefer the sitemap file interface over the Kibana UI.
Use sitemaps to inform the web crawler of pages you think are important, or pages that are isolated and not linked from other pages.
However, be aware the web crawler will visit only those pages from the sitemap that are allowed by the domain’s crawl rules and robots.txt
file directives.
Sitemap discovery and management
editTo add a sitemap to a domain, you can specify it within a robots.txt
file.
At the start of each crawl, the web crawler fetches and processes each domain’s robots.txt
file and each sitemap specified within those robots.txt
files.
Sitemap format and technical specification
editThe sitemaps standard defines the format and technical specification for sitemaps. Refer to the standard for the required and optional elements, character escaping, and other technical considerations and examples.
The web crawler does not process optional meta data defined by the standard. The web crawler extracts a list of URLs from each sitemap and ignores all other information.
There is no guarantee that pages (and their respective linked pages) will be indexed in the order they appear in the sitemap, because crawls are run asynchronously.
Ensure each URL within your sitemap matches the exact domain — here defined as scheme + host + port— for your site.
Different subdomains (like www.example.com
and blog.example.com
), and different schemes (like http://example.com
and https://example.com
), require separate sitemaps.
The web crawler also supports sitemap index files. Refer to Using sitemap index files within the sitemap standard for sitemap index file details and examples.
Manage sitemaps
editExample: Add a sitemap via robots.txt
To add a sitemap to the domain https://shop.example.com
:
-
Determine which pages within the domain you’d like to include.
Ensure these paths are allowed by the domain’s crawl rules and the directives within the domain’s
robots.txt
file. -
Create a sitemap file with the appropriate elements from the sitemap standard. For instance:
<?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <url> <loc>https://shop.example.com/products/1/</loc> </url> <url> <loc>https://shop.example.com/products/2/</loc> </url> <url> <loc>https://shop.example.com/products/3/</loc> </url> </urlset>
-
Publish the file on your site, for example, at the root of the domain:
https://shop.example.com/sitemap.xml
. -
Create or modify the
robots.txt
file for the domain, located athttps://shop.example.com/robots.txt
. Anywhere within the file, add aSitemap
directive that provides the location of the sitemap. For instance:Sitemap: https://shop.example.com/sitemap.xml
-
Publish the new or updated
robots.txt
file.
The next time the web crawler visits the domain, it will fetch and parse the robots.txt
file and the sitemap.
Alternatively, you can also manage the sitemaps for a domain through the Kibana UI. From here, you can view, add, edit, and delete sitemaps. Use the UI to add custom sitemap definitions that do not live on the domain, and are used only by your crawler. See Entry points and sitemaps.