Customize crawler field values using an ingest pipeline
editCustomize crawler field values using an ingest pipeline
editIn this guide, you’ll learn how to use ingest pipelines to remove unwanted content from webpages you’ve crawled using the Elastic web crawler. Most websites contain content that is not needed for a search experience, such as navigation links, headers and footers, and other boilerplate.
If you can edit the HTML for the web pages you’re crawling, you can use Meta tags and data attributes to extract custom fields. However, if you don’t control the source HTML, you’ll need a workaround to remove this content from your documents.
You can also extract content using CSS selectors or regular expressions. Check out the Content extraction rules for more information.
In this guide, we’ll show you another option, using ingest pipelines. Ingest pipelines are a native Elasticsearch feature for transforming data. We’ll use a custom ingest pipeline to modify the values of specific fields in our documents, before they are written to Elasticsearch. The advantage of this approach is that we can manage everything from the Kibana UI.
The problem and solution
editIn this example, imagine we are creating a search experience for a news website. For this example, we chose the https://www.theskimm.com. We use the Elastic web crawler to crawl the website and index the content. However, the website contains a lot of boilerplate content that we don’t want to include in our search results. We know this because we’ve crawled the website and inspected the fields in the Elasticsearch documents created by the crawler.
To remove this content, we’ll use an ingest pipeline to modify the values of specific fields in our documents.
In this example we will:
-
Create an Elasticsearch index using the web crawler ingestion method
-
Configure the crawler to crawl a subset of
theskimm.com
webpages
-
Configure the crawler to crawl a subset of
-
Create a custom ingest pipeline
- Add a set of processors to the pipeline
- Configure the processors to find and remove boilerplate content
- Test the pipeline against sample documents
- Crawl the website using the custom pipeline
Prerequisites
editTo use Enterprise Search features you need a subscription, a deployment, and a user. You get all three with an Elastic Cloud deployment. If you’re brand new to Elastic start a free Elastic Cloud trial.
Within Advanced settings, ensure the Enterprise Search Size per zone is set to at least 4 GB RAM.
Configure your web crawler index
editFirst we need to create an Elasticsearch index and configure the web crawler to crawl the domain:
- In Kibana, navigate to Enterprise Search → Content → Indices to create your index.
- Add the domain https://www.theskimm.com to your index.
-
Add the following crawl rules to crawl a subset of the domain:
-
Policy:
Allow
, Rule:Begins with
, Path pattern:/news/2022
-
Policy:
Disallow
, Rule:Regex
, Path pattern:.*
-
Policy:
- Crawl the website.
The crawl shouldn’t take long (less than one minute), because we’ll only be indexing around 250 documents. Once the crawl is complete, take a minute to inspect a few documents. Go to the Documents tab and click on a few documents to inspect the content.
You’ll notice a few things:
-
In the
body_content
field, header and footer content is duplicated across all documents.-
There’s some boilerplate header content like
News Money Wellness Life Events PFL Midterms | Daily Skimm MORE+ Login search Sign up Menu News ...
-
There’s some boilerplate footer content like
Live Smarter Sign up for the Daily Skimm email newsletter. Delivered to your inbox every morning and prepares you for your day in minutes. ... © 2022 theSkimm, All rights reserved This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
-
There’s some boilerplate header content like
-
All
title
field values are suffixed with| theSkimm
(with leading and trailing spaces)
We don’t want this redundant content in our search results, so we’ll need to remove it. To do this, we’ll create a custom ingest pipeline that uses a set of processors to find and remove patterns from specific fields.
Create a custom ingest pipeline
editWhen you create an index in Enterprise Search, a default ingest pipeline is set up with several processors, to optimize your content for search.
This pipeline is called ent-search-generic-ingestion
.
Enterprise Search also enables you to easily create index-specific ingest pipelines for custom processing.
We’ll create a new pipeline with three processors targeting specific fields in our documents.
To create a custom pipeline:
- In Kibana, Go to the Pipelines tab for the index you created in the previous step.
- Select Copy and customize.
-
This creates a new pipeline called
<index-name>@custom
.
An ingest pipeline is a list of processors that are applied to a document in order. The pipeline we created is empty, so now we’ll need to add some processors.
The @custom
pipeline
editYou should not rename this pipeline.
The <index-name>@custom
pipeline is empty by default.
We’ll add processors in the Kibana UI.
For this exercise we’ll use the Gsub processor. The Gsub processor allows you to replace (or simply delete) substrings within a field. We’ll use it to remove the header/footer and title boilerplate from our document fields.
We need to configure three Gsub processors:
-
To remove the header boilerplate from the
body_content
field. -
To remove the footer boilerplate from the
body_content
field. -
To remove the title suffix from the
title
field.
The following table shows the configuration for each processor. When you configure your three processors, you’ll use the Field and Pattern values from this table.
Table 1. Table of ingest pipeline processors
Processor | Processor name | Field | Pattern | Replacement | Description |
---|---|---|---|---|---|
1 |
|
|
News Money Wellness.*?The Story |
- |
Find and remove boilerplate header content |
2 |
|
|
Live Smarter Sign up for the Daily Skimm email newsletter.*$ |
- |
Find and remove boilerplate footer content |
3 |
|
|
\\|.* |
- |
Find and remove boilerplate suffix from title |
Add and configure processors
editBecause we’re adding three Gsub processors, you’ll need to repeat these steps three times.
To add a processor:
- In the Content overview for your index, select the Pipelines tab.
-
Select Edit pipeline for the
<index-name>@custom
pipeline. This navigates you to the Stack Management → Ingest Pipelines page. - Select Manage and Edit in the modal.
- Select Add processor and scroll through the list of available processors.
- Select the Gsub processor.
Next, we’ll configure each processor to find and remove matched patterns from specific fields. Again, you’ll need to repeat this operation three times, once for each processor.
Follow these steps, copying the values in the table above to configure each processor.
- Under Field, enter the field which contains the text you want to remove.
- In the Pattern field, enter the regex pattern you want to remove. Pay attention to any leading or trailing spaces.
- Leave the Replacement fields blank, because we want to simply remove the matching text.
- Select Ignore missing to prevent processors from failing if the field is missing.
- When you’ve configured the three processors, select Save pipeline to save your pipeline configurations.
Our pipeline has been configured, but we should test it to make sure the three processors work as expected. This allows us to correct any problems before we start crawling. This is important, particularly if you’re crawling a large number of webpages.
Test your custom ingest pipeline
editYou can test your custom pipeline in the Kibana UI:
- In the Content overview for your index, click the Pipelines tab.
-
Select Edit pipeline for the
<index-name>@custom
pipeline. This navigates you to the Stack Management → Ingest Pipelines page. - Select Manage and Edit in the modal.
- Select Add documents next to Test pipeline in the Processors section.
- Add a document to test your pipeline. Use a document from your index by providing the document’s index and document ID. Find this information in the Documents tab of your index overview page.
Crawl your webpages
editNow that we’ve tested our pipeline, we can start crawling our webpages.
- Go to your index in the Kibana UI and launch a crawl.
-
Verify the structure of your documents in the Documents tab of the index overview page.
Check that the boilerplate content has been removed from the
body_content
andtitle
fields.
If you’ve followed along, well done! You’ve successfully configured your Enterprise Search deployment to customize how webpages are indexed. You’ve learned the basics of a powerful set of tools for creating custom search experiences. Use this guide as a blueprint for more complex use cases. There are currently around 40 available processors that are ready to use. See the Elasticsearch documentation for the full list.
Learn more
edit- Dive deeper into ingest pipelines in the Elasticsearch documentation.
- Learn more about ingest pipelines in Enterprise Search.
- The Web crawler schema details exactly how the crawler transforms HTML content into Elasticsearch documents.
- If you can edit the HTML for the web pages you’re crawling, see Optimizing web content for a more direct approach.
- Learn all about the Elastic web crawler.