Multilingual search using language identification in Elasticsearch
We’re pleased to announce that along with the release of the machine learning inference ingest processor, we are releasing language identification in Elasticsearch 7.6. With this release, we wanted to take the opportunity to describe some use cases and strategies for searching in multilingual corpora, and how language identification plays a part. We’ve covered some of these topics in the past, and we’ll build on these in some of the examples that follow.
Motivation
In today’s highly interconnected world, we find that documents and other sources of information come in a variety of languages. This poses a problem for many search applications. We need to understand the language of these documents as best we can to analyze them properly and provide the best search experience possible. Enter language identification.
Language identification is used to improve the overall search relevance for these multilingual corpora. Given a set of documents where we do not yet know the language(s) they contain, we want to efficiently search over them. The documents may contain a single language or multiple. The former is common in domains such as computer science where English is the predominant language of communication, while the latter is commonly found in biological and medical text where Latin terminology is frequently interspersed with English.
By applying language-specific analysis, we can improve relevance (both precision and recall) by ensuring that document terms are understood, indexed and searched over appropriately. Using a suite of language-specific analyzers in Elasticsearch (both built-in and through additional plugins), we can provide improved tokenization, token filtering and term filtering:
- Stop word and synonym lists
- Word form normalization: stemming and lemmatization
- Decompounding (e.g. German, Dutch, Korean)
For similar reasons, we find language identification in more general natural language processing (NLP) pipelines as one of the first processing steps to make use of highly precise, language-specific algorithms and models. For example, pre-trained NLP models such as Google’s BERT and ALBERT or OpenAI’s GPT-2 are commonly trained on per-language corpora or corpora with a predominant language, and fine tuned for tasks such as document classification, sentiment analysis, named entity recognition (NER), etc.
For the following examples and strategies, unless otherwise specified, we will assume that documents contain either a single or a predominant language.
Benefits of language-specific analysis
To help motivate this further, let’s have a quick look at a few benefits of language-specific analyzers.
Decompounding: In the German language, nouns are often built by compounding other nouns together to create beautifully long and hard to read compound words. A simple example is combining “Jahr” (“year”) into other words like “Jahrhunderts” (“century”), “Jahreskalender” (“annual calendar”) or “Schuljahr” (“school year”). Without a custom analyzer that can decompound these words, we wouldn’t be able to search for “jahr” and get back documents about school years, “Schuljahr”. Furthermore, German has different rules than other Latin languages for plural and dative forms, meaning that searching for “jahr” should also match “Jahre” (plural) and ”Jahren” (plural dative).
Common term: Some languages also make use of common or domain-specific terminology. For example “computer” is a word that is frequently used in other languages as-is. If we want to search for “computer”, we might also be interested in non-English documents. Being able to search across a known set of languages and still match common terms can be an interesting use-case. Again using German as an example, we might have documents about computer security in multiple languages. In German, that’s “Computersicherheit” (“sicherheit” meaning “security” or “safety”) and only with a German analyzer will searches for “computer” match across English and German.
Non-Latin scripts: The standard
analyzer works quite well for most Latin script languages (western European languages). However, it starts to break down rapidly with non-Latin scripts such as Cyrillic or CJK (Chinese/Japanese/Korean). In a previous blog series, we saw how the CJK languages are formed and the necessity to have language-specific analyzers. For example, Korean has postpositions — suffixes added to nouns and pronouns which alter their meaning. Sometimes using the standard
analyzer matches search terms, but it doesn’t do well at scoring the matches. Meaning you might have good recall on documents, but your precision suffers. In other cases, the standard
analyzer won’t match any terms, and both your precision and recall suffer.
Let’s look at the working example for “Winter Olympics”. In Korean, that’s “동계올림픽대회는” which is composed of “동계” meaning “winter season”, “올림픽대회” meaning “Olympics” or “Olympic competition” and finally “는” which is the topic postposition — a suffix added to the word which denotes the topic. Searching for that exact string with the standard
analyzer yields a perfect match, but searching for “올림픽대회”, meaning just “Olympics”, returns no results. However by using the nori
Korean analyzer, we get a match because “동계올림픽대회는” / “Winter Olympics” has been tokenized properly at index time.
Getting started with language identification
Demo project
In order to help illustrate use-cases and strategies for language identification in search, we’ve setup a small demo project. It contains all of the examples in this blog post, as well as some tooling to index and search over WiLI-2018, a multilingual corpus, which you can use as a reference and working example for experimenting with multilingual search. To follow the examples, it’s useful (but not strictly necessary) to have the demo project up and running, with documents indexed, in case you want to follow along.
For these experiments, you can install Elasticsearch 7.6 locally, or spin up a free trial of Elasticsearch Service.
First experiments
Language identification is a pre-trained model that ships in the default distribution of Elasticsearch. It’s used in conjunction with the inference ingest processor by specifying the lang_ident_model_1
as the model_id
when setting up your inference processor in an ingest pipeline.
{ "inference": { "model_id": "lang_ident_model_1", "inference_config": {}, "field_mappings": {} } }
The rest of the configuration is the same as with other models, allowing you to specify settings like the number of top classes to output, the output field that will contain the prediction and, most importantly for our use-cases, the input field to use. By default, the model expects a field called text
to contain the input. In the following example, we use the pipeline _simulate API with some single field documents. It maps the input contents field to the text field for inference — this mapping does not affect other processors in the pipeline. It then outputs the top three classes for inspection.
# simulate a basic inference setup POST _ingest/pipeline/_simulate { "pipeline": { "processors": [ { "inference": { "model_id": "lang_ident_model_1", "inference_config": { "classification": { "num_top_classes": 3 } }, "field_mappings": { "contents": "text" }, "target_field": "_ml.lang_ident" } } ] }, "docs": [ { "_source": { "contents": "Das Leben ist kein Ponyhof" } }, { "_source": { "contents": "The rain in Spain stays mainly in the plains" } }, { "_source": { "contents": "This is mostly English but has a touch of Latin since we often just say, Carpe diem" } } ] }
The output shows us each document, plus some extra information in the _ml.lang_ident
field. This includes the probability of each of the top three languages and the top language, which is stored in _ml.lang_ident.predicted_value
.
{ "docs" : [ { "doc" : { "_index" : "_index", "_type" : "_doc", "_id" : "_id", "_source" : { "contents" : "Das leben ist kein Ponyhof", "_ml" : { "lang_ident" : { "top_classes" : [ { "class_name" : "de", "class_probability" : 0.9996006023972855 }, { "class_name" : "el-Latn", "class_probability" : 2.625873919853074E-4 }, { "class_name" : "ru-Latn", "class_probability" : 1.130237050226503E-4 } ], "predicted_value" : "de", "model_id" : "lang_ident_model_1" } } }, "_ingest" : { "timestamp" : "2020-01-21T14:38:13.810179Z" } } }, { "doc" : { "_index" : "_index", "_type" : "_doc", "_id" : "_id", "_source" : { "contents" : "The rain in Spain stays mainly in the plains", "_ml" : { "lang_ident" : { "top_classes" : [ { "class_name" : "en", "class_probability" : 0.9988809847231199 }, { "class_name" : "ga", "class_probability" : 7.764148026288316E-4 }, { "class_name" : "gd", "class_probability" : 7.968926766495827E-5 } ], "predicted_value" : "en", "model_id" : "lang_ident_model_1" } } }, "_ingest" : { "timestamp" : "2020-01-21T14:38:13.810185Z" } } }, { "doc" : { "_index" : "_index", "_type" : "_doc", "_id" : "_id", "_source" : { "contents" : "This is mostly English but has a touch of Latin since we often just say, Carpe diem", "_ml" : { "lang_ident" : { "top_classes" : [ { "class_name" : "en", "class_probability" : 0.9997901768317939 }, { "class_name" : "ja", "class_probability" : 8.756250766054857E-5 }, { "class_name" : "fil", "class_probability" : 1.6980752372837307E-5 } ], "predicted_value" : "en", "model_id" : "lang_ident_model_1" } } }, "_ingest" : { "timestamp" : "2020-01-21T14:38:13.810189Z" } } } ] }
Looking good! We have German identified for the first document and English for the second and third documents, even with the touch of Latin in the third document.
Strategies for language identification in search
Now that we’ve seen a basic example of language identification, it’s time to start putting that into a strategy for indexing and searching.
There are two basic indexing strategies that we’ll be using: language per-field and language per-index. In the language per-field strategy, we’ll create a single index with a set of language-specific fields and use an analyzer tailored to each language. At search time, we can choose to either search over a known language field, or we can search over all language fields and choose the best matching field. In the language per-index strategy, we’ll create a set of language-specific indices with different mappings, where the field indexed has an analyzer for that language. At search time, we can take a similar approach to language per-field and choose to either search over a single language index, or across multiple indices with an index pattern in the search request.
Contrast these two strategies with what you’d have to do today — index the same string multiple times, each to a field or index with a language-specific analyzer. While this approach can work, it does cause an awful lot of duplication, causing slower queries and using up significantly more storage space than necessary.
Indexing
Let’s break this down and have a look at each of the two indexing strategies, since these dictate the search strategies we can use.
Per-Field
In the language per-field strategy, we’ll use the output of language identification and a series of processors in an ingest pipeline to store the input field in a language-specific field. We’ll support only a finite set of languages (German, English, Korean, Japanese and Chinese) since we need to set up a specific analyzer for each language. Any documents that aren’t in one of our supported languages will get indexed in a default field with the standard analyzer.
A full pipeline definition can be found in the demo project: config/pipelines/lang-per-field.json
A mapping to support this indexing strategy would then look like:
{ "settings": { "index": { "number_of_shards": 1, "number_of_replicas": 0 } }, "mappings": { "dynamic": "strict", "properties": { "contents": { "properties": { "language": { "type": "keyword" }, "supported": { "type": "boolean" }, "default": { "type": "text", "analyzer": "default", "fields": { "icu": { "type": "text", "analyzer": "icu_analyzer" } } }, "en": { "type": "text", "analyzer": "english" }, "de": { "type": "text", "analyzer": "german_custom" }, "ja": { "type": "text", "analyzer": "kuromoji" }, "ko": { "type": "text", "analyzer": "nori" }, "zh": { "type": "text", "analyzer": "smartcn" } } } } } }
(Note that the German analyzer configuration has been elided from the example above for brevity and can be found in: config/mappings/de_analyzer.json)
As with the previous example, we’ll use pipeline’s _simulate
API to explore:
# simulate a language per-field and output top 3 language classes for inspection POST _ingest/pipeline/_simulate { "pipeline": { "processors": [ { "inference": { "model_id": "lang_ident_model_1", "inference_config": { "classification": { "num_top_classes": 3 } }, "field_mappings": { "contents": "text" }, "target_field": "_ml.lang_ident" } }, { "rename": { "field": "contents", "target_field": "contents.default" } }, { "rename": { "field": "_ml.lang_ident.predicted_value", "target_field": "contents.language" } }, { "script": { "lang": "painless", "source": "ctx.contents.supported = (['de', 'en', 'ja', 'ko', 'zh'].contains(ctx.contents.language))" } }, { "set": { "if": "ctx.contents.supported", "field": "contents.{{contents.language}}", "value": "{{contents.default}}", "override": false } } ] }, "docs": [ { "_source": { "contents": "Das leben ist kein Ponyhof" } }, { "_source": { "contents": "The rain in Spain stays mainly in the plains" } }, { "_source": { "contents": "オリンピック大会" } }, { "_source": { "contents": "로마는 하루아침에 이루어진 것이 아니다" } }, { "_source": { "contents": "授人以鱼不如授人以渔" } }, { "_source": { "contents": "Qui court deux lievres a la fois, n’en prend aucun" } }, { "_source": { "contents": "Lupus non timet canem latrantem" } }, { "_source": { "contents": "This is mostly English but has a touch of Latin since we often just say, Carpe diem" } } ] }
And here’s the output with a language per-field:
{ "docs" : [ { "doc" : { "_index" : "_index", "_type" : "_doc", "_id" : "_id", "_source" : { "contents" : { "de" : "Das leben ist kein Ponyhof", "default" : "Das leben ist kein Ponyhof", "language" : "de", "supported" : true }, "_ml" : { "lang_ident" : { "top_classes" : [ { "class_name" : "de", "class_probability" : 0.9996006023972855 }, { "class_name" : "el-Latn", "class_probability" : 2.625873919853074E-4 }, { "class_name" : "ru-Latn", "class_probability" : 1.130237050226503E-4 } ], "model_id" : "lang_ident_model_1" } } }, "_ingest" : { "timestamp" : "2020-01-22T12:40:03.218641Z" } } }, { "doc" : { "_index" : "_index", "_type" : "_doc", "_id" : "_id", "_source" : { "contents" : { "en" : "The rain in Spain stays mainly in the plains", "default" : "The rain in Spain stays mainly in the plains", "language" : "en", "supported" : true }, "_ml" : { "lang_ident" : { "top_classes" : [ { "class_name" : "en", "class_probability" : 0.9988809847231199 }, { "class_name" : "ga", "class_probability" : 7.764148026288316E-4 }, { "class_name" : "gd", "class_probability" : 7.968926766495827E-5 } ], "model_id" : "lang_ident_model_1" } } }, "_ingest" : { "timestamp" : "2020-01-22T12:40:03.218646Z" } } }, { "doc" : { "_index" : "_index", "_type" : "_doc", "_id" : "_id", "_source" : { "contents" : { "default" : "オリンピック大会", "language" : "ja", "ja" : "オリンピック大会", "supported" : true }, "_ml" : { "lang_ident" : { "top_classes" : [ { "class_name" : "ja", "class_probability" : 0.9993823252841599 }, { "class_name" : "el", "class_probability" : 2.6448654791599055E-4 }, { "class_name" : "sd", "class_probability" : 1.4846805271384584E-4 } ], "model_id" : "lang_ident_model_1" } } }, "_ingest" : { "timestamp" : "2020-01-22T12:40:03.218648Z" } } }, { "doc" : { "_index" : "_index", "_type" : "_doc", "_id" : "_id", "_source" : { "contents" : { "default" : "로마는 하루아침에 이루어진 것이 아니다", "language" : "ko", "ko" : "로마는 하루아침에 이루어진 것이 아니다", "supported" : true }, "_ml" : { "lang_ident" : { "top_classes" : [ { "class_name" : "ko", "class_probability" : 0.9999939196272863 }, { "class_name" : "ka", "class_probability" : 3.0431805047662344E-6 }, { "class_name" : "am", "class_probability" : 1.710514725818281E-6 } ], "model_id" : "lang_ident_model_1" } } }, "_ingest" : { "timestamp" : "2020-01-22T12:40:03.218649Z" } } }, { "doc" : { "_index" : "_index", "_type" : "_doc", "_id" : "_id", "_source" : { "contents" : { "default" : "授人以鱼不如授人以渔", "language" : "zh", "zh" : "授人以鱼不如授人以渔", "supported" : true }, "_ml" : { "lang_ident" : { "top_classes" : [ { "class_name" : "zh", "class_probability" : 0.9999810103320087 }, { "class_name" : "ja", "class_probability" : 1.0390454083183788E-5 }, { "class_name" : "ka", "class_probability" : 2.6302271562335787E-6 } ], "model_id" : "lang_ident_model_1" } } }, "_ingest" : { "timestamp" : "2020-01-22T12:40:03.21865Z" } } }, { "doc" : { "_index" : "_index", "_type" : "_doc", "_id" : "_id", "_source" : { "contents" : { "default" : "Qui court deux lievres a la fois, n’en prend aucun", "language" : "fr", "supported" : false }, "_ml" : { "lang_ident" : { "top_classes" : [ { "class_name" : "fr", "class_probability" : 0.9999669852240882 }, { "class_name" : "gd", "class_probability" : 2.3485226102079597E-5 }, { "class_name" : "ht", "class_probability" : 3.536708810360631E-6 } ], "model_id" : "lang_ident_model_1" } } }, "_ingest" : { "timestamp" : "2020-01-22T12:40:03.218652Z" } } }, { "doc" : { "_index" : "_index", "_type" : "_doc", "_id" : "_id", "_source" : { "contents" : { "default" : "Lupus non timet canem latrantem", "language" : "la", "supported" : false }, "_ml" : { "lang_ident" : { "top_classes" : [ { "class_name" : "la", "class_probability" : 0.614050940088811 }, { "class_name" : "fr", "class_probability" : 0.32530021315840363 }, { "class_name" : "sq", "class_probability" : 0.03353817054854559 } ], "model_id" : "lang_ident_model_1" } } }, "_ingest" : { "timestamp" : "2020-01-22T12:40:03.218653Z" } } }, { "doc" : { "_index" : "_index", "_type" : "_doc", "_id" : "_id", "_source" : { "contents" : { "en" : "This is mostly English but has a touch of Latin since we often just say, Carpe diem", "default" : "This is mostly English but has a touch of Latin since we often just say, Carpe diem", "language" : "en", "supported" : true }, "_ml" : { "lang_ident" : { "top_classes" : [ { "class_name" : "en", "class_probability" : 0.9997901768317939 }, { "class_name" : "ja", "class_probability" : 8.756250766054857E-5 }, { "class_name" : "fil", "class_probability" : 1.6980752372837307E-5 } ], "model_id" : "lang_ident_model_1" } } }, "_ingest" : { "timestamp" : "2020-01-22T12:40:03.218654Z" } } } ] }
As expected, we get German fields stored in contents.de
, English in contents.en
, Korean in contents.ko
and so on. Notice that we mixed in a couple of examples of unsupported languages too — French and Latin. We see that they don’t get a supported flag and they are available to search on in the default field only. Check out the top predicted classes as well for the Latin example. Looks like the model thinks it’s Latin, which is correct, but the model is uncertain and predicts a strong second place for French.
This is just a basic example of an ingest pipeline with language identification but hopefully it gives you an idea of what is possible. With the flexibility of ingest pipelines, we can accomplish many different scenarios. We’ll dig into a few alternatives at the end of the post. Some of the steps in this example could be combined or omitted in a production pipeline, but remember that a good data processing pipeline is one that can be easily read and understood, not the one with the fewest lines possible.
Per-Index
Our language per-index strategy uses the same basic building blocks as the the pipeline for language per-field. The big difference is that instead of storing to a language-specific field, we use a different index. This is possible because at ingest time we can set the _index
field of a document, which enables us to override the default value and set it to a language-specific index name. If we don’t support the language, we skip that step and the document will be indexed in the default index. Simple!
A full pipeline definition can be found in the demo project: config/pipelines/lang-per-index.json
A mapping to support this indexing strategy would then look like the following.
{ "settings": { "index": { "number_of_shards": 1, "number_of_replicas": 0 } }, "mappings": { "dynamic": "strict", "properties": { "contents": { "properties": { "language": { "type": "keyword" }, "text": { "type": "text", "analyzer": "default" } } } } } }
Note that in that mapping we have not specified a custom analyzer, and instead use this file as a template. When we create each language-specific index, we set the analyzer for that language.
Simulating this pipeline:
# simulate a language per-index and output top 3 language classes for inspection POST _ingest/pipeline/_simulate { "pipeline": { "processors": [ { "inference": { "model_id": "lang_ident_model_1", "inference_config": { "classification": { "num_top_classes": 3 } }, "field_mappings": { "contents": "text" }, "target_field": "_ml.lang_ident" } }, { "rename": { "field": "contents", "target_field": "contents.text" } }, { "rename": { "field": "_ml.lang_ident.predicted_value", "target_field": "contents.language" } }, { "set": { "if": "['de', 'en', 'ja', 'ko', 'zh'].contains(ctx.contents.language)", "field": "_index", "value": "{{_index}}_{{contents.language}}", "override": true } } ] }, "docs": [ { "_source": { "contents": "Das leben ist kein Ponyhof" } }, { "_source": { "contents": "The rain in Spain stays mainly in the plains" } }, { "_source": { "contents": "オリンピック大会" } }, { "_source": { "contents": "로마는 하루아침에 이루어진 것이 아니다" } }, { "_source": { "contents": "授人以鱼不如授人以渔" } }, { "_source": { "contents": "Qui court deux lievres a la fois, n’en prend aucun" } }, { "_source": { "contents": "Lupus non timet canem latrantem" } }, { "_source": { "contents": "This is mostly English but has a touch of Latin since we often just say, Carpe diem" } } ] }
And here’s the output with a language per-index:
{ "docs" : [ { "doc" : { "_index" : "_index_de", "_type" : "_doc", "_id" : "_id", "_source" : { "contents" : { "language" : "de", "text" : "Das leben ist kein Ponyhof" }, "_ml" : { "lang_ident" : { "top_classes" : [ { "class_name" : "de", "class_probability" : 0.9996006023972855 }, { "class_name" : "el-Latn", "class_probability" : 2.625873919853074E-4 }, { "class_name" : "ru-Latn", "class_probability" : 1.130237050226503E-4 } ], "model_id" : "lang_ident_model_1" } } }, "_ingest" : { "timestamp" : "2020-01-21T14:41:48.486009Z" } } }, { "doc" : { "_index" : "_index_en", "_type" : "_doc", "_id" : "_id", "_source" : { "contents" : { "language" : "en", "text" : "The rain in Spain stays mainly in the plains" }, "_ml" : { "lang_ident" : { "top_classes" : [ { "class_name" : "en", "class_probability" : 0.9988809847231199 }, { "class_name" : "ga", "class_probability" : 7.764148026288316E-4 }, { "class_name" : "gd", "class_probability" : 7.968926766495827E-5 } ], "model_id" : "lang_ident_model_1" } } }, "_ingest" : { "timestamp" : "2020-01-21T14:41:48.486037Z" } } }, { "doc" : { "_index" : "_index_ja", "_type" : "_doc", "_id" : "_id", "_source" : { "contents" : { "language" : "ja", "text" : "オリンピック大会" }, "_ml" : { "lang_ident" : { "top_classes" : [ { "class_name" : "ja", "class_probability" : 0.9993823252841599 }, { "class_name" : "el", "class_probability" : 2.6448654791599055E-4 }, { "class_name" : "sd", "class_probability" : 1.4846805271384584E-4 } ], "model_id" : "lang_ident_model_1" } } }, "_ingest" : { "timestamp" : "2020-01-21T14:41:48.486039Z" } } }, { "doc" : { "_index" : "_index_ko", "_type" : "_doc", "_id" : "_id", "_source" : { "contents" : { "language" : "ko", "text" : "로마는 하루아침에 이루어진 것이 아니다" }, "_ml" : { "lang_ident" : { "top_classes" : [ { "class_name" : "ko", "class_probability" : 0.9999939196272863 }, { "class_name" : "ka", "class_probability" : 3.0431805047662344E-6 }, { "class_name" : "am", "class_probability" : 1.710514725818281E-6 } ], "model_id" : "lang_ident_model_1" } } }, "_ingest" : { "timestamp" : "2020-01-21T14:41:48.486041Z" } } }, { "doc" : { "_index" : "_index_zh", "_type" : "_doc", "_id" : "_id", "_source" : { "contents" : { "language" : "zh", "text" : "授人以鱼不如授人以渔" }, "_ml" : { "lang_ident" : { "top_classes" : [ { "class_name" : "zh", "class_probability" : 0.9999810103320087 }, { "class_name" : "ja", "class_probability" : 1.0390454083183788E-5 }, { "class_name" : "ka", "class_probability" : 2.6302271562335787E-6 } ], "model_id" : "lang_ident_model_1" } } }, "_ingest" : { "timestamp" : "2020-01-21T14:41:48.486043Z" } } }, { "doc" : { "_index" : "_index", "_type" : "_doc", "_id" : "_id", "_source" : { "contents" : { "language" : "fr", "text" : "Qui court deux lievres a la fois, n’en prend aucun" }, "_ml" : { "lang_ident" : { "top_classes" : [ { "class_name" : "fr", "class_probability" : 0.9999669852240882 }, { "class_name" : "gd", "class_probability" : 2.3485226102079597E-5 }, { "class_name" : "ht", "class_probability" : 3.536708810360631E-6 } ], "model_id" : "lang_ident_model_1" } } }, "_ingest" : { "timestamp" : "2020-01-21T14:41:48.486044Z" } } }, { "doc" : { "_index" : "_index", "_type" : "_doc", "_id" : "_id", "_source" : { "contents" : { "language" : "la", "text" : "Lupus non timet canem latrantem" }, "_ml" : { "lang_ident" : { "top_classes" : [ { "class_name" : "la", "class_probability" : 0.614050940088811 }, { "class_name" : "fr", "class_probability" : 0.32530021315840363 }, { "class_name" : "sq", "class_probability" : 0.03353817054854559 } ], "model_id" : "lang_ident_model_1" } } }, "_ingest" : { "timestamp" : "2020-01-21T14:41:48.486046Z" } } }, { "doc" : { "_index" : "_index_en", "_type" : "_doc", "_id" : "_id", "_source" : { "contents" : { "language" : "en", "text" : "This is mostly English but has a touch of Latin since we often just say, Carpe diem" }, "_ml" : { "lang_ident" : { "top_classes" : [ { "class_name" : "en", "class_probability" : 0.9997901768317939 }, { "class_name" : "ja", "class_probability" : 8.756250766054857E-5 }, { "class_name" : "fil", "class_probability" : 1.6980752372837307E-5 } ], "model_id" : "lang_ident_model_1" } } }, "_ingest" : { "timestamp" : "2020-01-21T14:41:48.48605Z" } } } ] }
As you would expect, the language identification results are the same as with the per-field strategy, the only difference being how we use that information in the pipeline to route a document to the correct index.
Search
Given the two indexing strategies, what’s the best way to search? As mentioned above, we have a couple of options for each of the indexing strategies. One common question is, how do we specify a language-specific analyzer for the query string, so it matches on the indexed field? Don’t worry, you don’t need to specify a special analyzer at search time. Unless you specify a search_analyzer in your query DSL, the query string will be analyzed by the same analyzer as the field being matched on. As in the language per-field examples, if you have fields en and de, the query string will be analyzed with the english analyzer when matching on the en field, and the german_custom analyzer when matching on the de field.
Query language
Before we dig into search strategies, it’s important to first set some context about language identification on the user’s query string itself. You might be thinking, “ok, now that we know the (predominant) language of the documents indexed, why not just do language identification on the query string and perform a normal search on the corresponding field or index?”. Unfortunately, search queries tend to be short. Like, really short! Way back in 2001, a study [1] of the good old Excite web search engine showed that the average user query contained only 2.4 terms! That was a while ago and although things have changed a lot with conversational search and natural language querying (e.g. “how do I use Elasticsearch to search in multilingual corpora”), search queries tend to still be too short to use in identifying the language. Many language identification algorithms work best with more than 50 characters [2]. To add to this problem, we often have search queries that are proper nouns, entity names or scientific names such as “Justin Trudeau”, “Foo Fighters”, or “plantar fasciitis” respectively. The user might want documents from an arbitrary language but it’s not possible to know that just by analyzing these kinds of query strings.
As such, we don’t recommend using language identification (of any kind) on query strings alone. If you do want to use the user’s query language to select the search field or index, it’s best to consider other approaches that make use of implicit or explicit information about the user. For example implicit context might be using the website domain (e.g. .com or .de) or app store locale that your app was downloaded from (e.g. US store or German store). In most cases however, the best thing to do is to just ask your user! Many sites have locale selection when a new user first visits the site. You can also consider using faceting (with a terms aggregation) over the document languages to help the user guide you to the languages they are interested in.
Per-Field
With the per-field strategy, we have multiple language sub-fields, so we need to search over all them at the same time and pick the top scoring field. This is relatively straightforward since in the indexing pipeline, we set only a single language field. So while we are searching over multiple fields, only one of them is actually populated. To do this, we’ll be using a multi_match query with type best_fields (default). This combination is executed as a dis_max query, and we use this combination since we are interested in all terms matching in a single field, and not across fields.
GET lang-per-field/_search { "query": { "multi_match": { "query": "jahr", "type": "best_fields", "fields": [ "contents.de", "contents.en", "contents.ja", "contents.ko", "contents.zh" ] } } }
If we want to search over all languages, we can also add in the contents.default
field into the multi_match
query. One advantage of the per-field strategy is also being able to use the identified language to help boost documents, such as those that match the user’s language or locale as discussed above. This can provide an improvement in both precision and recall as it can be used directly to influence relevance. Similarly, if we want to search over a single language, such as when we know the user’s query language, we can simply use a match query on the language field for that language, e.g. contents.de
.
Per-Index
With the per-index strategy, we have multiple language indices, but each index has the same field names. That means we can use a single, simple query, and just specify an index pattern when making the search request:
GET lang-per-index_*/_search { "query": { "match": { "contents.text": "jahr" } } }
If we want to search over all languages, we use an index pattern that also matches the default index: lang-per-index*
(note the absence of the underscore). If we want to search over a single language, we can simply use the index for that language, e.g. lang-per-index_de
.
Examples
Using the same examples we described in the “Motivation” section, we can try searching in our WiLI-2018 corpus. Try these commands out with the demo project and see what happens.
Decompounding:
# only matching exactly on the term "jahr" bin/search --strategy default jahr
# matches: "jahr", "jahre", "jahren", "jahrhunderts", etc. bin/search --strategy per-field jahr
Common term:
# only matching exactly on the term "computer", multiple languages are in the results bin/search --strategy default computer
# matches compound German words as well: "Computersicherheit" (computer security) bin/search --strategy per-field computer
Non-Latin scripts:
# standard analyzer gets poor precision and returns irrelevant/non-matching results with "network"/"internet": "网络" bin/search --strategy default 网络
# ICU and language-specific analysis gets things right, but note the different scores bin/search --strategy icu 网络 bin/search --strategy per-field 网络
Comparison
Based on the two strategies, which one should you actually use? Well, it depends. Here’s some pros and cons of each approach to help you decide.
Pros | Cons | |
Per-Field |
|
|
Per-Index |
|
|
If you still can’t decide, we’d recommend trying both and seeing what each strategy looks like with your dataset. If you have a dataset of relevance labels, you can also use the ranking evaluation API to see if there are differences in relevance between the various strategies.
Additional approaches
We’ve seen two basic strategies for using language identification and index and searching a multilingual corpus. With the power of ingest pipelines, we can achieve a wide variety of additional approaches and modifications. Here’s a few examples to explore:
- Map script-common languages into a single field, e.g. mapping Chinese, Japanese, and Korean to a
cjk
field and use thecjk
analyzer, and mapen
andfr
into alatin
field with thestandard
analyzer (see: examples/olympics.txt). - Map unknown languages or non-Latin scripts into an
icu
field and use theicu
analyzer (see: config/mappings/lang-per-field.json). - Using a processor conditional or script processor, set multiple top languages above a threshold into a field (for faceting/filtering).
- Concatenate multiple fields of the document into a single field in order to identify the language, and optionally use it to search over (e.g. in an
all_contents
field) or just continue to follow the "language per-field" strategy after identifying the language (see: examples/simulate-concatenation.txt and examples/simulate-concatenation.out.json). - Using a script processor, choose the predominant language only if the top class is above a threshold (e.g. 60% or 50%) or significantly greater than the second class predicted (e.g. above 50% and more than 10% higher than the second class).
Wrapping up
Hopefully this blog post gives you a starting point and some ideas on how to use language identification successfully for multilingual search! We’d love to hear from you so please don’t be shy and join our Discuss forum. Let us know if you’re using language identification successfully or if you run into any problems.
References
- Amanda Spink, Dietmar Wolfram, Major B. J. Jansen, Tefko Saracevic. 2001. Searching the Web: The Public and Their Queries. Journal of the American Society for Information Science and Technology. Volume 52, Issue 3, pp 226-234.
- A. Putsma. 2001. Applying Monte Carlo Techniques to Language Identification.