Using the annotated-text field
editUsing the annotated-text
field
editThe annotated-text
tokenizes text content as per the more common text
field (see
"limitations" below) but also injects any marked-up annotation tokens directly into
the search index:
PUT my_index { "mappings": { "properties": { "my_field": { "type": "annotated_text" } } } }
Such a mapping would allow marked-up text eg wikipedia articles to be indexed as both text
and structured tokens. The annotations use a markdown-like syntax using URL encoding of
one or more values separated by the &
symbol.
We can use the "_analyze" api to test how an example annotation would be stored as tokens in the search index:
GET my_index/_analyze { "field": "my_field", "text":"Investors in [Apple](Apple+Inc.) rejoiced." }
Response:
{ "tokens": [ { "token": "investors", "start_offset": 0, "end_offset": 9, "type": "<ALPHANUM>", "position": 0 }, { "token": "in", "start_offset": 10, "end_offset": 12, "type": "<ALPHANUM>", "position": 1 }, { "token": "Apple Inc.", "start_offset": 13, "end_offset": 18, "type": "annotation", "position": 2 }, { "token": "apple", "start_offset": 13, "end_offset": 18, "type": "<ALPHANUM>", "position": 2 }, { "token": "rejoiced", "start_offset": 19, "end_offset": 27, "type": "<ALPHANUM>", "position": 3 } ] }
Note the whole annotation token |
We can now perform searches for annotations using regular term
queries that don’t tokenize
the provided search values. Annotations are a more precise way of matching as can be seen
in this example where a search for Beck
will not match Jeff Beck
:
# Example documents PUT my_index/_doc/1 { "my_field": "[Beck](Beck) announced a new tour" } PUT my_index/_doc/2 { "my_field": "[Jeff Beck](Jeff+Beck&Guitarist) plays a strat" } # Example search GET my_index/_search { "query": { "term": { "my_field": "Beck" } } }
As well as tokenising the plain text into single words e.g. |
|
Note annotations can inject multiple tokens at the same position - here we inject both
the very specific value |
|
A benefit of searching with these carefully defined annotation tokens is that a query for
|
Any use of =
signs in annotation values eg [Prince](person=Prince)
will
cause the document to be rejected with a parse failure. In future we hope to have a use for
the equals signs so wil actively reject documents that contain this today.