This Week in Elasticsearch and Apache Lucene - 2018-05-25
Elasticsearch
HTTP Pipelining always enabled in 7.0
HTTP Pipelining is enabled by default today. However, it is possible for users to disable HTTP pipelining in Elasticsearch, and then send send multiple requests to Elasticsearch at the same time from the client (i.e., behave like a pipeline-enabled client). In this situation, Elasticsearch’s behavior is undefined. In the interests of preventing confusion and error, we merged a PR to make HTTP Pipelining no longer be configurable on Elasticsearch and therefore always enabled.
Making context mandatory in the context suggester in 7.0
Querying a context enabled completion field without context is slow. While this is documented, it is also dangerous. Accordingly, we have decided to deprecate this behavior in 6.x and remove it in 7. If querying all context is necessary, it will still be possible to add a special "match_all" category context to all the suggestions at index time. This approach will require reindexing, but also be efficient as compared to the previous approach which was not designed to handled the speed required for completion suggesters.
Keyword splitting on whitespace at query time
In 6.x, Elasticsearch moved from splitting query_string queries on whitespace to using the normalizer only. Therefore, as of 6.x, a simple query like q=keyword_field:(new york) now creates a single term query "new york" targeting the keyword field. While this is intuitive, some users built functionality on top of the old behavior. Therefore, we have added an option to the keyword mappings that indicates a whitespace tokenizer should be used at query time. This works with all full text query parsers and doesn’t break the multi-word analysis of text fields.
Plugin Signature Verification
We added support for verifying signatures on official plugins during plugin installation. Today we sign our artifacts with our gpg key which means that users have a way to validate the integrity and authenticity of our artifacts after they have downloaded them over the Internet. Therefore, this week we added this: any time a user installs an official plugin (e.g., bin/elasticsearch-plugin install analysis-icu) over the Internet on a release, or snapshot build (or for an internal staged release build), we check that the downloaded bits have a valid signature by the expected key.
Cross-cluster replication benchmarking
The CCR team has gotten to the point of being able to benchmark our in-development cross-cluster replication feature, transferring 30GB of data between regions in Google Cloud. While much work yet remains, this is an important milestone as it allows the team to iterate on the default parameters that will be critical to out-of-the-box performance.
Improved authentication handling
We made a change to our authentication layer that will prevent nodes from making multiple simultaneous authentication requests to external systems (such as LDAP) for the same user. While we already cached successful authentications, a few scenarios (such as many metricbeat instances connecting to ES at the same time) could still cause periodic spikes in load due to these duplicative authentication requests.
Changes
Changes in 5.6:
- Use correct cluster state version for node fault detection #30810
Changes in 6.3:
- Security: fix dynamic mapping updates with aliases #30787
- Move watcher-history version setting to _meta field #30832
- [Security] Include an empty json object in an json array when FLS filters out all fields #30709
- SQL: Preserve scoring in bool queries #30730
- Upgrade to lucene-7.3.1 #30729
Changes in 6.4:
- Modify state of VerifyRepositoryResponse for bwc #30762
- REST high-level client: add put ingest pipeline API #30793
- Use remote client in TransportFieldCapsAction #30838
- Limit user to single concurrent auth per realm #30794
- [Tests] Move templated _rank_eval tests #30679
- Ensure that ip_range aggregations always return bucket keys. #30701
- Force stable file modes for built packages #30823
- Send client headers from TransportClient #30803
- Add support for indexed shape routing in geo_shape query #30760
- Add a format option to docvalue_fields. #29639
- Only ack cluster state updates successfully applied on all nodes #30672
- Replace Request#setHeaders with addHeader #30588
- Reduce CLI scripts to one-liners on Windows #30772
- [Feature] Adding a char_group tokenizer #24186
- Increase the maximum number of filters that may be in the cache. #30655
- Enable installing plugins from snapshots.elastic.co #30765
- Ignore empty completion input #30713
- Add Delete Repository High Level REST API #30666
- Reduce CLI scripts to one-liners #30759
Changes in 7.0:
- BREAKING: Use geohash cell instead of just a corner in geo_bounding_box #30698
- Reintroduce mandatory http pipelining support #30820
- Expose Lucene’s FeatureField. #30618
- Simplify number of shards setting #30783
- Make http pipelining support mandatory #30695
- BREAKING: Scripting: Remove getDate methods from ScriptDocValues #30690
Lucene
Impacts for synonym and phrase queries
The fact that codecs expose raw impacts enables to speed up more queries when the total hit count is not tracked. For instance merging impacts by taking the sum of term frequencies for all involved terms allows to compute upper bounds of scores of synonym queries, which are typically created by query parsers for terms that occur at the same position. This in-turn allows to speed up synonym queries quite a bit by skipping blocks of documents whose score upper bound is less than the current minimum score that is required for a hit to be competitive. We explored doing the same for phrase queries by taking the minimum of term frequency of all involved terms, which also looks promising even though the speed up is less spectacular than for synonym queries.
Other
- There was a concurrency issue in the way that we publish deletes for merges.
- A user reported that SmartChineseAnalyzer doesn't deal correctly with surrogate pairs. The team helped fix it so that its name is less embarrassing. :)
- IndexWriter is a quite a beast with complex dependencies with merge policies, merge schedulers, deletion policies, etc. There is a proposal to clean this up a bit by detaching IndexWriter from merge policies.
- TestRandomChains continues to find some corner cases with the new ConditionalTokenFilter.
- We wonder whether we could add a multiplexing token filter, in order for a token filter to feed multiple children with the same tokens.
- There are ongoing discussions about the way we should expose matching terms for the current position in the new matches API.
- Lucene's mock WindowsFS was assuming that the inode of a file would change when a file is moved, but this is not true with HardLinkCopyDirectoryWrapper which copies files using hard links.
- We have a proposal for a new ConcatenateFilter to concatenate the content of all produced tokens.
- Unreferenced files may remain in the index after a commit because of recent changes, which boiled down to a missing checkpoint. It is now fixed.
- We are discussing the impact of handling more than 2B documents in a single index. There are concerns about making this number unbounded, but maybe we could go with a higher limit such as 16B.