Data modelling tips

edit

Use structured and unstructured fields

edit

Annotations are normally a way of weaving structured information into unstructured text for higher-precision search.

Entity resolution is a form of document enrichment undertaken by specialist software or people where references to entities in a document are disambiguated by attaching a canonical ID. The ID is used to resolve any number of aliases or distinguish between people with the same name. The hyperlinks connecting Wikipedia’s articles are a good example of resolved entity IDs woven into text.

These IDs can be embedded as annotations in an annotated_text field but it often makes sense to include them in dedicated structured fields to support discovery via aggregations:

PUT my_index
{
  "mappings": {
    "properties": {
      "my_unstructured_text_field": {
        "type": "annotated_text"
      },
      "my_structured_people_field": {
        "type": "text",
        "fields": {
          "keyword" : {
            "type": "keyword"
          }
        }
      }
    }
  }
}

Applications would then typically provide content and discover it as follows:

# Example documents
PUT my_index/_doc/1
{
  "my_unstructured_text_field": "[Shay](%40kimchy) created elasticsearch",
  "my_twitter_handles": ["@kimchy"] 
}

GET my_index/_search
{
  "query": {
    "query_string": {
        "query": "elasticsearch OR logstash OR kibana",
        "default_field": "my_unstructured_text_field"
    }
  },
  "aggregations": {
  	"top_people" :{
  	    "significant_terms" : { 
	       "field" : "my_twitter_handles.keyword"
  	    }
  	}
  }
}

Note the my_twitter_handles contains a list of the annotation values also used in the unstructured text. (Note the annotated_text syntax requires escaping). By repeating the annotation values in a structured field this application has ensured that the tokens discovered in the structured field can be used for search and highlighting in the unstructured field.

In this example we search for documents that talk about components of the elastic stack

We use the my_twitter_handles field here to discover people who are significantly associated with the elastic stack.

Avoiding over-matching annotations

edit

By design, the regular text tokens and the annotation tokens co-exist in the same indexed field but in rare cases this can lead to some over-matching.

The value of an annotation often denotes a named entity (a person, place or company). The tokens for these named entities are inserted untokenized, and differ from typical text tokens because they are normally:

  • Mixed case e.g. Madonna
  • Multiple words e.g. Jeff Beck
  • Can have punctuation or numbers e.g. Apple Inc. or @kimchy

This means, for the most part, a search for a named entity in the annotated text field will not have any false positives e.g. when selecting Apple Inc. from an aggregation result you can drill down to highlight uses in the text without "over matching" on any text tokens like the word apple in this context:

the apple was very juicy

However, a problem arises if your named entity happens to be a single term and lower-case e.g. the company elastic. In this case, a search on the annotated text field for the token elastic may match a text document such as this:

they fired an elastic band

To avoid such false matches users should consider prefixing annotation values to ensure they don’t name clash with text tokens e.g.

[elastic](Company_elastic) released version 7.0 of the elastic stack today