Using the annotated-text field
editUsing the annotated-text
field
editThe annotated-text
tokenizes text content as per the more common text
field (see
"limitations" below) but also injects any marked-up annotation tokens directly into
the search index:
PUT my-index-000001 { "mappings": { "properties": { "my_field": { "type": "annotated_text" } } } }
Such a mapping would allow marked-up text eg wikipedia articles to be indexed as both text
and structured tokens. The annotations use a markdown-like syntax using URL encoding of
one or more values separated by the &
symbol.
We can use the "_analyze" api to test how an example annotation would be stored as tokens in the search index:
GET my-index-000001/_analyze { "field": "my_field", "text":"Investors in [Apple](Apple+Inc.) rejoiced." }
Response:
{ "tokens": [ { "token": "investors", "start_offset": 0, "end_offset": 9, "type": "<ALPHANUM>", "position": 0 }, { "token": "in", "start_offset": 10, "end_offset": 12, "type": "<ALPHANUM>", "position": 1 }, { "token": "Apple Inc.", "start_offset": 13, "end_offset": 18, "type": "annotation", "position": 2 }, { "token": "apple", "start_offset": 13, "end_offset": 18, "type": "<ALPHANUM>", "position": 2 }, { "token": "rejoiced", "start_offset": 19, "end_offset": 27, "type": "<ALPHANUM>", "position": 3 } ] }
Note the whole annotation token |
We can now perform searches for annotations using regular term
queries that don’t tokenize
the provided search values. Annotations are a more precise way of matching as can be seen
in this example where a search for Beck
will not match Jeff Beck
:
# Example documents PUT my-index-000001/_doc/1 { "my_field": "[Beck](Beck) announced a new tour" } PUT my-index-000001/_doc/2 { "my_field": "[Jeff Beck](Jeff+Beck&Guitarist) plays a strat" } # Example search GET my-index-000001/_search { "query": { "term": { "my_field": "Beck" } } }
As well as tokenising the plain text into single words e.g. |
|
Note annotations can inject multiple tokens at the same position - here we inject both
the very specific value |
|
A benefit of searching with these carefully defined annotation tokens is that a query for
|
Any use of =
signs in annotation values eg [Prince](person=Prince)
will
cause the document to be rejected with a parse failure. In future we hope to have a use for
the equals signs so will actively reject documents that contain this today.
Synthetic _source
editSynthetic _source
is Generally Available only for TSDB indices
(indices that have index.mode
set to time_series
). For other indices
synthetic _source
is in technical preview. Features in technical preview may
be changed or removed in a future release. Elastic will work to fix
any issues, but features in technical preview are not subject to the support SLA
of official GA features.
annotated_text
fields support synthetic _source
if they have
a keyword
sub-field that supports synthetic
_source
or if the annotated_text
field sets store
to true
. Either way, it may
not have copy_to
.
If using a sub-keyword
field then the values are sorted in the same way as
a keyword
field’s values are sorted. By default, that means sorted with
duplicates removed. So:
PUT idx { "mappings": { "_source": { "mode": "synthetic" }, "properties": { "text": { "type": "annotated_text", "fields": { "raw": { "type": "keyword" } } } } } } PUT idx/_doc/1 { "text": [ "the quick brown fox", "the quick brown fox", "jumped over the lazy dog" ] }
Will become:
{ "text": [ "jumped over the lazy dog", "the quick brown fox" ] }
Reordering text fields can have an effect on phrase
and span queries. See the discussion about position_increment_gap
for more detail. You
can avoid this by making sure the slop
parameter on the phrase queries
is lower than the position_increment_gap
. This is the default.
If the annotated_text
field sets store
to true then order and duplicates
are preserved.
PUT idx { "mappings": { "_source": { "mode": "synthetic" }, "properties": { "text": { "type": "annotated_text", "store": true } } } } PUT idx/_doc/1 { "text": [ "the quick brown fox", "the quick brown fox", "jumped over the lazy dog" ] }
Will become:
{ "text": [ "the quick brown fox", "the quick brown fox", "jumped over the lazy dog" ] }