- Elasticsearch - The Definitive Guide:
- Foreword
- Preface
- Getting Started
- You Know, for Search…
- Installing and Running Elasticsearch
- Talking to Elasticsearch
- Document Oriented
- Finding Your Feet
- Indexing Employee Documents
- Retrieving a Document
- Search Lite
- Search with Query DSL
- More-Complicated Searches
- Full-Text Search
- Phrase Search
- Highlighting Our Searches
- Analytics
- Tutorial Conclusion
- Distributed Nature
- Next Steps
- Life Inside a Cluster
- Data In, Data Out
- What Is a Document?
- Document Metadata
- Indexing a Document
- Retrieving a Document
- Checking Whether a Document Exists
- Updating a Whole Document
- Creating a New Document
- Deleting a Document
- Dealing with Conflicts
- Optimistic Concurrency Control
- Partial Updates to Documents
- Retrieving Multiple Documents
- Cheaper in Bulk
- Distributed Document Store
- Searching—The Basic Tools
- Mapping and Analysis
- Full-Body Search
- Sorting and Relevance
- Distributed Search Execution
- Index Management
- Inside a Shard
- You Know, for Search…
- Search in Depth
- Structured Search
- Full-Text Search
- Multifield Search
- Proximity Matching
- Partial Matching
- Controlling Relevance
- Theory Behind Relevance Scoring
- Lucene’s Practical Scoring Function
- Query-Time Boosting
- Manipulating Relevance with Query Structure
- Not Quite Not
- Ignoring TF/IDF
- function_score Query
- Boosting by Popularity
- Boosting Filtered Subsets
- Random Scoring
- The Closer, The Better
- Understanding the price Clause
- Scoring with Scripts
- Pluggable Similarity Algorithms
- Changing Similarities
- Relevance Tuning Is the Last 10%
- Dealing with Human Language
- Aggregations
- Geolocation
- Modeling Your Data
- Administration, Monitoring, and Deployment
WARNING: The 2.x versions of Elasticsearch have passed their EOL dates. If you are running a 2.x version, we strongly advise you to upgrade.
This documentation is no longer maintained and may be removed. For the latest information, see the current Elasticsearch documentation.
Tidying Up Input Text
editTidying Up Input Text
editTokenizers produce the best results when the input text is clean, valid text, where valid means that it follows the punctuation rules that the Unicode algorithm expects. Quite often, though, the text we need to process is anything but clean. Cleaning it up before tokenization improves the quality of the output.
Tokenizing HTML
editPassing HTML through the standard
tokenizer or the icu_tokenizer
produces
poor results. These tokenizers just don’t know what to do with the HTML tags.
For example:
GET /_analyzer?tokenizer=standard <p>Some déjà vu <a href="http://somedomain.com>">website</a>
The standard
tokenizer confuses HTML tags and entities, and emits the
following tokens: p
, Some
, d
, eacute
, j
, agrave
, vu
, a
,
href
, http
, somedomain.com
, website
, a
. Clearly not what was
intended!
Character filters can be added to an analyzer to preprocess the text
before it is passed to the tokenizer. In this case, we can use the
html_strip
character filter to remove HTML tags and to decode HTML entities
such as é
into the corresponding Unicode characters.
Character filters can be tested out via the analyze
API by specifying them
in the query string:
GET /_analyzer?tokenizer=standard&char_filters=html_strip <p>Some déjà vu <a href="http://somedomain.com>">website</a>
To use them as part of the analyzer, they should be added to a custom
analyzer definition:
PUT /my_index { "settings": { "analysis": { "analyzer": { "my_html_analyzer": { "tokenizer": "standard", "char_filter": [ "html_strip" ] } } } } }
Once created, our new my_html_analyzer
can be tested with the analyze
API:
GET /my_index/_analyzer?analyzer=my_html_analyzer <p>Some déjà vu <a href="http://somedomain.com>">website</a>
This emits the tokens that we expect: Some
, déjà
, vu
, website
.
Tidying Up Punctuation
editThe standard
tokenizer and icu_tokenizer
both understand that an
apostrophe within a word should be treated as part of the word, while single
quotes that surround a word should not. Tokenizing the text You're my 'favorite'
. would correctly emit the tokens You're, my, favorite
.
Unfortunately, Unicode lists a few characters that are sometimes used as apostrophes:
-
U+0027
-
Apostrophe (
'
)—the original ASCII character -
U+2018
-
Left single-quotation mark (
‘
)—opening quote when single-quoting -
U+2019
-
Right single-quotation mark (
’
)—closing quote when single-quoting, but also the preferred character to use as an apostrophe
Both tokenizers treat these three characters as an apostrophe (and thus as part of the word) when they appear within a word. Then there are another three apostrophe-like characters:
-
U+201B
-
Single high-reversed-9 quotation mark (
‛
)—same asU+2018
but differs in appearance -
U+0091
- Left single-quotation mark in ISO-8859-1—should not be used in Unicode
-
U+0092
- Right single-quotation mark in ISO-8859-1—should not be used in Unicode
Both tokenizers treat these three characters as word boundaries—a place to
break text into tokens. Unfortunately, some publishers use U+201B
as a
stylized way to write names like M‛coy
, and the second two characters may well
be produced by your word processor, depending on its age.
Even when using the “acceptable” quotation marks, a word written with a
single right quotation mark—You’re
—is not the same as the word written
with an apostrophe—You're
—which means that a query for one variant
will not find the other.
Fortunately, it is possible to sort out this mess with the mapping
character
filter, which allows us to replace all instances of one character with
another. In this case, we will replace all apostrophe variants with the
simple U+0027
apostrophe:
PUT /my_index { "settings": { "analysis": { "char_filter": { "quotes": { "type": "mapping", "mappings": [ "\\u0091=>\\u0027", "\\u0092=>\\u0027", "\\u2018=>\\u0027", "\\u2019=>\\u0027", "\\u201B=>\\u0027" ] } }, "analyzer": { "quotes_analyzer": { "tokenizer": "standard", "char_filter": [ "quotes" ] } } } } }
We define a custom |
|
For clarity, we have used the JSON Unicode escape syntax
for each character, but we could just have used the
characters themselves: |
|
We use our custom |
As always, we test the analyzer after creating it:
GET /my_index/_analyze?analyzer=quotes_analyzer You’re my ‘favorite’ M‛Coy
This example returns the following tokens, with all of the in-word
quotation marks replaced by apostrophes: You're
, my
, favorite
, M'Coy
.
The more effort that you put into ensuring that the tokenizer receives good-quality input, the better your search results will be.
On this page