- Elasticsearch - The Definitive Guide:
- Foreword
- Preface
- Getting Started
- You Know, for Search…
- Installing and Running Elasticsearch
- Talking to Elasticsearch
- Document Oriented
- Finding Your Feet
- Indexing Employee Documents
- Retrieving a Document
- Search Lite
- Search with Query DSL
- More-Complicated Searches
- Full-Text Search
- Phrase Search
- Highlighting Our Searches
- Analytics
- Tutorial Conclusion
- Distributed Nature
- Next Steps
- Life Inside a Cluster
- Data In, Data Out
- What Is a Document?
- Document Metadata
- Indexing a Document
- Retrieving a Document
- Checking Whether a Document Exists
- Updating a Whole Document
- Creating a New Document
- Deleting a Document
- Dealing with Conflicts
- Optimistic Concurrency Control
- Partial Updates to Documents
- Retrieving Multiple Documents
- Cheaper in Bulk
- Distributed Document Store
- Searching—The Basic Tools
- Mapping and Analysis
- Full-Body Search
- Sorting and Relevance
- Distributed Search Execution
- Index Management
- Inside a Shard
- You Know, for Search…
- Search in Depth
- Structured Search
- Full-Text Search
- Multifield Search
- Proximity Matching
- Partial Matching
- Controlling Relevance
- Theory Behind Relevance Scoring
- Lucene’s Practical Scoring Function
- Query-Time Boosting
- Manipulating Relevance with Query Structure
- Not Quite Not
- Ignoring TF/IDF
- function_score Query
- Boosting by Popularity
- Boosting Filtered Subsets
- Random Scoring
- The Closer, The Better
- Understanding the price Clause
- Scoring with Scripts
- Pluggable Similarity Algorithms
- Changing Similarities
- Relevance Tuning Is the Last 10%
- Dealing with Human Language
- Aggregations
- Geolocation
- Modeling Your Data
- Administration, Monitoring, and Deployment
WARNING: This documentation covers Elasticsearch 2.x. The 2.x versions of Elasticsearch have passed their EOL dates. If you are running a 2.x version, we strongly advise you to upgrade.
This documentation is no longer maintained and may be removed. For the latest information, see the current Elasticsearch documentation.
Custom Analyzers
editCustom Analyzers
editWhile Elasticsearch comes with a number of analyzers available out of the box, the real power comes from the ability to create your own custom analyzers by combining character filters, tokenizers, and token filters in a configuration that suits your particular data.
In Analysis and Analyzers, we said that an analyzer is a wrapper that combines three functions into a single package, which are executed in sequence:
- Character filters
-
Character filters are used to “tidy up” a string before it is tokenized. For instance, if our text is in HTML format, it will contain HTML tags like
<p>
or<div>
that we don’t want to be indexed. We can use thehtml_strip
character filter to remove all HTML tags and to convert HTML entities likeÁ
into the corresponding Unicode characterÁ
.An analyzer may have zero or more character filters.
- Tokenizers
-
An analyzer must have a single tokenizer. The tokenizer breaks up the string into individual terms or tokens. The
standard
tokenizer, which is used in thestandard
analyzer, breaks up a string into individual terms on word boundaries, and removes most punctuation, but other tokenizers exist that have different behavior.For instance, the
keyword
tokenizer outputs exactly the same string as it received, without any tokenization. Thewhitespace
tokenizer splits text on whitespace only. Thepattern
tokenizer can be used to split text on a matching regular expression. - Token filters
-
After tokenization, the resulting token stream is passed through any specified token filters, in the order in which they are specified.
Token filters may change, add, or remove tokens. We have already mentioned the
lowercase
andstop
token filters, but there are many more available in Elasticsearch. Stemming token filters “stem” words to their root form. Theascii_folding
filter removes diacritics, converting a term like"très"
into"tres"
. Thengram
andedge_ngram
token filters can produce tokens suitable for partial matching or autocomplete.
In Search in Depth, we discuss examples of where and how to use these tokenizers and filters. But first, we need to explain how to create a custom analyzer.
Creating a Custom Analyzer
editIn the same way as we configured the es_std
analyzer previously, we can configure
character filters, tokenizers, and token filters in their respective sections
under analysis
:
PUT /my_index { "settings": { "analysis": { "char_filter": { ... custom character filters ... }, "tokenizer": { ... custom tokenizers ... }, "filter": { ... custom token filters ... }, "analyzer": { ... custom analyzers ... } } } }
As an example, let’s set up a custom analyzer that will do the following:
-
Strip out HTML by using the
html_strip
character filter. -
Replace
&
characters with" and "
, using a custommapping
character filter:"char_filter": { "&_to_and": { "type": "mapping", "mappings": [ "&=> and "] } }
-
Tokenize words, using the
standard
tokenizer. -
Lowercase terms, using the
lowercase
token filter. -
Remove a custom list of stopwords, using a custom
stop
token filter:"filter": { "my_stopwords": { "type": "stop", "stopwords": [ "the", "a" ] } }
Our analyzer definition combines the predefined tokenizer and filters with the custom filters that we have configured previously:
"analyzer": { "my_analyzer": { "type": "custom", "char_filter": [ "html_strip", "&_to_and" ], "tokenizer": "standard", "filter": [ "lowercase", "my_stopwords" ] } }
To put it all together, the whole create-index
request looks like this:
PUT /my_index { "settings": { "analysis": { "char_filter": { "&_to_and": { "type": "mapping", "mappings": [ "&=> and "] }}, "filter": { "my_stopwords": { "type": "stop", "stopwords": [ "the", "a" ] }}, "analyzer": { "my_analyzer": { "type": "custom", "char_filter": [ "html_strip", "&_to_and" ], "tokenizer": "standard", "filter": [ "lowercase", "my_stopwords" ] }} }}}
After creating the index, use the analyze
API to test the new analyzer:
GET /my_index/_analyze { "text": "The quick & brown fox", "analyzer": "my_analyzer" }
The following abbreviated results show that our analyzer is working correctly:
{ "tokens" : [ { "token" : "quick", "position" : 2 }, { "token" : "and", "position" : 3 }, { "token" : "brown", "position" : 4 }, { "token" : "fox", "position" : 5 } ] }
The analyzer is not much use unless we tell Elasticsearch where to use it. We
can apply it to a string
field with a mapping such as the following:
PUT /my_index/_mapping/my_type { "properties": { "title": { "type": "string", "analyzer": "my_analyzer" } } }
On this page