WARNING: Version 5.x has passed its EOL date.
This documentation is no longer being maintained and may be removed. If you are running this version, we strongly advise you to upgrade. For the latest information, see the current release documentation.
Testing analyzers
editTesting analyzers
editWhen building your own analyzers, it’s useful to test that the analyzer does what we expect it to. This is where the Analyze API comes in.
Testing in-built analyzers
editTo get started with the Analyze API, we can test to see how a built-in analyzer will analyze a piece of text
var analyzeResponse = client.Analyze(a => a .Analyzer("standard") .Text("F# is THE SUPERIOR language :)") );
This returns the following response from Elasticsearch
{ "tokens": [ { "token": "f", "start_offset": 0, "end_offset": 1, "type": "<ALPHANUM>", "position": 0 }, { "token": "is", "start_offset": 3, "end_offset": 5, "type": "<ALPHANUM>", "position": 1 }, { "token": "the", "start_offset": 6, "end_offset": 9, "type": "<ALPHANUM>", "position": 2 }, { "token": "superior", "start_offset": 10, "end_offset": 18, "type": "<ALPHANUM>", "position": 3 }, { "token": "language", "start_offset": 19, "end_offset": 27, "type": "<ALPHANUM>", "position": 4 } ] }
which is deserialized to an instance of IAnalyzeResponse
by NEST
that we can work with
foreach (var analyzeToken in analyzeResponse.Tokens) { Console.WriteLine($"{analyzeToken.Token}"); }
In testing the standard
analyzer on our text, we’ve noticed that
-
F#
is tokenized as"f"
-
stop word tokens
"is"
and"the"
are included -
"superior"
is included but we’d also like to tokenize"great"
as a synonym for superior
We’ll look at how we can test a combination of built-in analysis components next to build an analyzer to fit our needs.
Testing built-in analysis components
editA transient analyzer can be composed from built-in analysis components to test an analysis configuration
var analyzeResponse = client.Analyze(a => a .Tokenizer("standard") .Filter("lowercase", "stop") .Text("F# is THE SUPERIOR language :)") );
{ "tokens": [ { "token": "f", "start_offset": 0, "end_offset": 1, "type": "<ALPHANUM>", "position": 0 }, { "token": "superior", "start_offset": 10, "end_offset": 18, "type": "<ALPHANUM>", "position": 3 }, { "token": "language", "start_offset": 19, "end_offset": 27, "type": "<ALPHANUM>", "position": 4 } ] }
Great! This has removed stop words, but we still have F#
tokenized as "f"
and no "great"
synonym for "superior"
.
Character and Token filters are applied in the order in which they are specified.
Let’s build a custom analyzer with additional components to solve this.
Testing a custom analyzer in an index
editA custom analyzer can be created within an index, either when creating the index or by updating the settings on an existing index.
When adding to an existing index, it needs to be closed first.
In this example, we’ll add a custom analyzer to an existing index. First, we need to close the index
client.CloseIndex("analysis-index");
Now, we can update the settings to add the analyzer
client.UpdateIndexSettings("analysis-index", i => i .IndexSettings(s => s .Analysis(a => a .CharFilters(cf => cf .Mapping("my_char_filter", m => m .Mappings("F# => FSharp") ) ) .TokenFilters(tf => tf .Synonym("my_synonym", sf => sf .Synonyms("superior, great") ) ) .Analyzers(an => an .Custom("my_analyzer", ca => ca .Tokenizer("standard") .CharFilters("my_char_filter") .Filters("lowercase", "stop", "my_synonym") ) ) ) ) );
And open the index again. Here, we also wait up to five seconds for the status of the index to become green
client.OpenIndex("analysis-index"); client.ClusterHealth(h => h .WaitForStatus(WaitForStatus.Green) .Index("analysis-index") .Timeout(TimeSpan.FromSeconds(5)) );
With the index open and ready, let’s test the analyzer
var analyzeResponse = client.Analyze(a => a .Index("analysis-index") .Analyzer("my_analyzer") .Text("F# is THE SUPERIOR language :)") );
Since we added the custom analyzer to the "analysis-index" index, we need to target this index to test it |
The output now looks like
{ "tokens": [ { "token": "fsharp", "start_offset": 0, "end_offset": 2, "type": "<ALPHANUM>", "position": 0 }, { "token": "superior", "start_offset": 10, "end_offset": 18, "type": "<ALPHANUM>", "position": 3 }, { "token": "great", "start_offset": 10, "end_offset": 18, "type": "SYNONYM", "position": 3 }, { "token": "language", "start_offset": 19, "end_offset": 27, "type": "<ALPHANUM>", "position": 4 } ] }
Exactly what we were after!
Testing an analyzer on a field
editIt’s also possible to test the analyzer for a given field type mapping. Given an index created with the following settings and mappings
client.CreateIndex("project-index", i => i .Settings(s => s .Analysis(a => a .CharFilters(cf => cf .Mapping("my_char_filter", m => m .Mappings("F# => FSharp") ) ) .TokenFilters(tf => tf .Synonym("my_synonym", sf => sf .Synonyms("superior, great") ) ) .Analyzers(an => an .Custom("my_analyzer", ca => ca .Tokenizer("standard") .CharFilters("my_char_filter") .Filters("lowercase", "stop", "my_synonym") ) ) ) ) .Mappings(m => m .Map<Project>(mm => mm .Properties(p => p .String(t => t .Name(n => n.Name) .Analyzer("my_analyzer") ) ) ) ) );
The analyzer on the name
field can be tested with
var analyzeResponse = client.Analyze(a => a .Index("project-index") .Field<Project>(f => f.Name) .Text("F# is THE SUPERIOR language :)") );
Advanced details with Explain
editIt’s possible to get more advanced details about analysis by setting Explain()
on
the request.
For this example, we’ll use Object Initializer syntax instead of the Fluent API; choose whichever one you’re most comfortable with!
var analyzeRequest = new AnalyzeRequest { Analyzer = "standard", Text = new [] { "F# is THE SUPERIOR language :)" }, Explain = true }; var analyzeResponse = client.Analyze(analyzeRequest);
We now get further details back in the response
{ "detail": { "custom_analyzer": false, "analyzer": { "name": "standard", "tokens": [ { "token": "f", "start_offset": 0, "end_offset": 1, "type": "<ALPHANUM>", "position": 0, "bytes": "[66]", "positionLength": 1 }, { "token": "is", "start_offset": 3, "end_offset": 5, "type": "<ALPHANUM>", "position": 1, "bytes": "[69 73]", "positionLength": 1 }, { "token": "the", "start_offset": 6, "end_offset": 9, "type": "<ALPHANUM>", "position": 2, "bytes": "[74 68 65]", "positionLength": 1 }, { "token": "superior", "start_offset": 10, "end_offset": 18, "type": "<ALPHANUM>", "position": 3, "bytes": "[73 75 70 65 72 69 6f 72]", "positionLength": 1 }, { "token": "language", "start_offset": 19, "end_offset": 27, "type": "<ALPHANUM>", "position": 4, "bytes": "[6c 61 6e 67 75 61 67 65]", "positionLength": 1 } ] } } }