An Introduction to Elasticsearch SQL with Practical Examples - Part 2
Want to learn more about the differences between the Amazon Elasticsearch Service and our official Elasticsearch Service? Visit our AWS Elasticsearch comparison page.
In Part 1 of this Introduction to Elasticsearch SQL series, we provided a gentle introduction to the new Elasticsearch SQL feature as well as the _translate API. This specific post continues the series by exploring more complex features. After that, we’ll look at some of the current limitations a user is likely to encounter with this initial release of SQL. Finally, we’ll conclude with a teaser of what is on the roadmap!
As a reminder, all readers can execute any examples provided at demo.elastic.co or wait for the dataset to be available with Kibana. Any datasets used in this series have been indexed, with SQL made available through the Kibana console. Links to demo.elastic.co are provided for each example. Alternatively, all queries can be viewed as a single thread in our demo Kibana console.
Complex Examples and Elasticsearch Goodness
Grouping
The aggregation framework of Elasticsearch — capable of summarising potentially billions of data points — represents one of the most powerful and popular features of the stack. From a capability perspective, it has a natural equivalence to the GROUP BY operator in SQL. As well as providing some examples of the GROUP BY capabilities, we will again use the translate API to show the equivalent aggregations.
“Find the average flight time per source destination country for flights to London. Order by the countries alphabetically.”
sql> SELECT AVG(FlightTimeHour) Avg_Flight_Time, OriginCountry FROM flights GROUP BY OriginCountry ORDER BY OriginCountry LIMIT 5; Avg_Flight_Time | OriginCountry ------------------+--------------- 9.342180244924574 |AE 13.49582274385201 |AR 4.704097126921018 |AT 15.081367354940724|AU 7.998943401875511 |CA
Inspecting the DSL for this query shows the use of the Composite Aggregation.
{ "size": 0, "_source": false, "stored_fields": "_none_", "aggregations": { "groupby": { "composite": { "size": 1000, "sources": [ { "3471": { "terms": { "field": "OriginCountry.keyword", "order": "asc" } } } ] }, "aggregations": { "3485": { "avg": { "field": "FlightTimeHour" } } } } } }
We can also group by alias fields defined in the select using a function.
“Find the number and average flight time of flights per month.”
POST _xpack/sql { "query":"SELECT COUNT(*), MONTH_OF_YEAR(timestamp) AS month_of_year, AVG(FlightTimeHour) AS Avg_Flight_Time FROM flights GROUP BY month_of_year" }
COUNT(1) | month_of_year | Avg_Flight_Time ---------------+---------------+----------------- 1303 |5 |8.628949653846158 8893 |6 |8.520481551839334 2863 |7 |8.463433805045094
The use of Composite Aggregation has one principle advantage — it ensures the GROUP BY implementation is scalable for even high cardinality fields, providing a mechanism to stream all buckets of a specific aggregation similarly to what scroll does for documents. This also ensures the implementation does not suffer the same memory limitations that it would if it utilized the terms aggregation. It does, however, mean we are currently unable to sort the GROUP BY a metric. For example, the following would result in an error:
POST _xpack/sql/translate { "query":"SELECT AVG(FlightTimeHour) Avg_Flight_Time, OriginCountry FROM flights GROUP BY OriginCountry ORDER BY Avg_Flight_Time" }
Filtering Groups
In order to filter groups we can utilise the HAVING operator, which can also utilise an alias specified in the SELECT clause. This might be unusual for some SQL experts, as it is commonly not possible in RDBMS based implementations due to the SELECT being executed after the HAVING. Here, the HAVING clause is using an alias that is declared in the execution phase. Our Analyzer is, however, smart enough to do look ahead and pick up the declaration for use in HAVING.
“Find the number, average, and 95th percentile distance of flights from each origin city, where the average distance is between 3000 and 4000 miles.”
sql> SELECT OriginCityName, ROUND(AVG(DistanceKilometers)) avg_distance, COUNT(*) c, ROUND(PERCENTILE(DistanceKilometers,95)) AS percentile_distance FROM flights GROUP BY OriginCityName HAVING avg_distance BETWEEN 3000 AND 4000; OriginCityName | avg_distance | c |percentile_distance ---------------+---------------+---------------+------------------- Verona |3078 |120 |7927 Vienna |3596 |120 |7436 Xi'an |3842 |114 |7964
To implement the HAVING functionality, SQL Elasticsearch utilises the Bucket Selector pipeline aggregation, filtering the values using a parameterized painless script. Note how below, the keyword variant of the OriginCityName field is automatically selected for the aggregation instead of attempting to use the standard text variant which would likely fail due field data not being enabled. The avg and percentile metric aggregations provide equivalent functions to the SQL variants.
{ "size": 0, "_source": false, "stored_fields": "_none_", "aggregations": { "groupby": { "composite": { "size": 1000, "sources": [ { "4992": { "terms": { "field": "OriginCityName.keyword", "order": "asc" } } } ] }, "aggregations": { "4947": { "bucket_selector": { "buckets_path": { "a0": "5010", "a1": "5010" }, "script": { "source": "params.v0 <= params.a0 && params.a1 <= params.v1", "lang": "painless", "params": { "v0": 3000, "v1": 4000 } }, "gap_policy": "skip" } }, "5010": { "avg": { "field": "DistanceKilometers" } }, "5019": { "percentiles": { "field": "DistanceKilometers", "percents": [ 95 ], "keyed": true, "tdigest": { "compression": 100 } } } } } } }
Text Operators and Relevancy
One of the unique capabilities of Elasticsearch as a search engine, compared to a traditional RDBMS, is its ability to score matches beyond a simple yes/no by considering the properties of textual data using a relevancy calculation. Extending the SQL syntax allows us to expose this functionality, and go above and beyond what a traditional RDBMS might provide.
We thus introduce two new operators: QUERY and MATCH. For those familiar with Elasticsearch, these are equivalent to the underlying multi_match and query_string operators. Users of Kibana will be familiar with the behaviour of the query_string operator, as it used to power the default search bar. It provides intelligent parsing capabilities and allows a natural language style of querying. The specifics of these two operators is beyond the scope of this blog, but the definitive guide entry provides an excellent introduction to these concepts.
As an example, consider the following:
“Find all delayed flights to and from Kastrup Airport between the dates 2018-06-06 and 2018-06-17, sorted by date.”
Kastrup Airport is actually in Copenhagen, with the full title "Copenhagen Kastrup Airport". Using the QUERY operator we simply search for Kastrup.
sql> SELECT timestamp, FlightNum, OriginCityName, DestCityName FROM flights WHERE QUERY('Kastrup') AND FlightDelay=true AND timestamp > '2018-06-20' AND timestamp < '2018-06-27' ORDER BY timestamp; timestamp | FlightNum |OriginCityName | DestCityName ------------------------+---------------+---------------+--------------- 2018-06-21T01:46:28.000Z|57SWSLT |Copenhagen |Orlando 2018-06-21T07:28:07.000Z|X43J6GE |Abu Dhabi |Copenhagen 2018-06-21T13:36:31.000Z|7T04SK7 |Copenhagen |Milan 2018-06-22T19:52:15.000Z|NXMN87D |Mumbai |Copenhagen 2018-06-23T08:05:02.000Z|YXHMDKV |Copenhagen |Oslo 2018-06-25T18:21:04.000Z|2R86JEZ |Copenhagen |Shanghai 2018-06-26T22:16:10.000Z|TCE99LO |Copenhagen |Edmonton
Notice there was no requirement to specify the field here. Simply searching for “kastrup” using the QUERY operator was sufficient. Additionally, notice that we have flights delayed both to and from Kastrup. The Elasticsearch query here:
{ "size": 1000, "query": { "bool": { "filter": [ { "bool": { "filter": [ { "bool": { "filter": [ { "query_string": { "query": "Kastrup", "fields": [], "type": "best_fields", "default_operator": "or", "max_determinized_states": 10000, "enable_position_increments": true, "fuzziness": "AUTO", "fuzzy_prefix_length": 0, "fuzzy_max_expansions": 50, "phrase_slop": 0, "escape": false, "auto_generate_synonyms_phrase_query": true, "fuzzy_transpositions": true, "boost": 1 } }, { "term": { "FlightDelay": { "value": true, "boost": 1 } } } ], "adjust_pure_negative": true, "boost": 1 } }, { "range": { "timestamp": { "from": "2018-06-20", "to": null, "include_lower": false, "include_upper": false, "boost": 1 } } } ], "adjust_pure_negative": true, "boost": 1 } }, { "range": { "timestamp": { "from": null, "to": "2018-06-27", "include_lower": false, "include_upper": false, "boost": 1 } } } ], "adjust_pure_negative": true, "boost": 1 } }, "_source": { "includes": [ "FlightNum", "OriginCityName", "DestCityName" ], "excludes": [] }, "docvalue_fields": [ "timestamp" ], "sort": [ { "timestamp": { "order": "asc" } } ] }
For a user that’s new to Elasticsearch, this represents a relatively complex query. We have a bool query with a nested range, term restriction and query string operator. For a user migrating an application from SQL, this may have been traditionally a fairly daunting task even before concerns as to whether the final query was functionally correct and optimal. The actual query_string operator has been nested in a filter as no relevancy was required (we are sorting by date), thus allowing us to exploit the filter cache, skip scoring and improve the response time.
The parameters for these operators are also exposed in SQL. This final example, illustrates using a MATCH query with multiple search terms across several fields to limit results.
“Find flights to or from Barcelona where the weather included lightning”
For purposes of example, we also sort by and show the relevance score via the Score() function.
sql> SELECT Score(), timestamp, FlightNum, OriginCityName, DestCityName, DestWeather, OriginWeather FROM flights WHERE MATCH('*Weather,*City*', 'Lightning Barcelona', 'type=cross_fields;operator=AND') ORDER BY Score() DESC LIMIT 5; SCORE() | timestamp | FlightNum |OriginCityName | DestCityName | DestWeather | OriginWeather ---------------+------------------------+---------------+---------------+---------------+-------------------+------------------- 6.990964 |2018-05-31T06:00:41.000Z|L637ISB |Barcelona |Santiago |Rain |Thunder & Lightning 6.990964 |2018-06-13T12:23:44.000Z|0GIHB62 |Barcelona |Buenos Aires |Clear |Thunder & Lightning 6.9796515 |2018-06-14T21:04:51.000Z|7GEMEDR |Barcelona |Hyderabad |Thunder & Lightning|Rain 6.9133706 |2018-05-31T01:58:51.000Z|ZTOD7RQ |Barcelona |Dubai |Sunny |Thunder & Lightning 6.9095163 |2018-06-06T14:02:34.000Z|QSQA5CT |Barcelona |Naples |Rain |Thunder & Lightning
We use wildcard patterns to specify the fields to match and request the match be a boolean AND. The cross fields parameter does not require the terms to appear all in one field, rather they are allowed to be present in different fields provided that both are present. Given the structure of the data, this is essential to the match.
Our examples here have returned rows vs groups. The QUERY and MATCH operators can, however, also be used with a GROUP BY — effectively a filtered aggregation to Elasticsearch.
Cross Index Searches & Aliases
So far, our queries have targeted only a single table/index. If we duplicate the flights index, copying the documents to a new named version through a reindex request, we can query both at the same time, provided both indices have identical mappings. Any difference in the mapping would cause the query to error on analysis. In order to query multiple indices together the user can either add them to an Elasticsearch alias, or utilise wildcards in the WHERE clause. For example, assuming i have two indices “flights” and “flights-2” with the following “f_alias” alias:
POST /_aliases { "actions" : [ { "add" : { "index" : "flights-2", "alias" : "f_alias" } }, { "add" : { "index" : "flights", "alias" : "f_alias" } } ] }
The following are logically equivalent:
sql> SELECT FlightNum, OriginCityName, DestCityName, DestWeather, OriginWeather FROM flights* ORDER BY timestamp DESC LIMIT 1; FlightNum |OriginCityName | DestCityName | DestWeather | OriginWeather ---------------+---------------+---------------+---------------+--------------- 6UPDDGK |Zurich |Zurich |Rain |Hail sql> SELECT FlightNum, OriginCityName, DestCityName, DestWeather, OriginWeather FROM f_alias ORDER BY timestamp DESC LIMIT 1; FlightNum |OriginCityName | DestCityName | DestWeather | OriginWeather ---------------+---------------+---------------+---------------+--------------- 6UPDDGK |Zurich |Zurich |Rain |Hail
Try on demo.elastic.co. Our demo environment contains both indices with a pre-configured alias for users to test the above.
We may loosen the above requirements in future, but for now, this simplifies the logic in this initial release.
JOINs
JOINs in a traditional RDBMS SQL implementation allow rows from separate tables to be combined via related columns in a singular tabular response. This allows relational modelling of data and represents a significant topic compared to the options natively available in Elasticsearch to achieve this. Whilst Elasticsearch SQL currently exposes no JOIN operator, it does allow users to exploit nested documents, which offer simple relational modelling of one-to-many. The querying of nested documents is handled transparently to the user. In order to demonstrate this functionality, we need an index with such data. For the purposes of example, we have loaded an “orders” index to demo.elastic.co. This index’s documents represents orders from an ecommerce site and contain fields such as order_date, billing_city and customer_last_name. Additionally, a “products” field contains a nested sub document for every product in the order. For example:
{ "billing_last_name": "Green", "billing_first_name": "Jason", "order_id": 519894, "products": [ { "tax_amount": 0, "taxful_price": 22.99, "quantity": 1, "taxless_price": 22.99, "discount_amount": 0, "base_unit_price": 22.99, "discount_percentage": 0, "product_name": "Briefcase - black", "manufacturer": "Pier One", "min_price": 11.27, "created_on": "2016-11-08T04:16:19+00:00", "unit_discount_amount": 0, "price": 22.99, "product_id": 12733, "base_price": 22.99, "_id": "sold_product_519894_12733", "category": "Men's Accessories", "sku": "PI952HA0M-Q11" }, { "tax_amount": 0, "taxful_price": 16.99, "quantity": 1, "taxless_price": 16.99, "discount_amount": 0, "base_unit_price": 16.99, "discount_percentage": 0, "product_name": "3 PACK - Boxer shorts - white/navy", "manufacturer": "Pier One", "min_price": 8.33, "created_on": "2016-11-08T04:16:19+00:00", "unit_discount_amount": 0, "price": 16.99, "product_id": 18370, "base_price": 16.99, "_id": "sold_product_519894_18370", "category": "Men's Clothing", "sku": "PI982AA0Y-A11" } ], "has_error": false, "customer_last_name": "Green", "currency": "EUR", "billing_first_name": "Jason", "shipping_country_code": "US", "email": "swagelastic@gmail.com", "day_of_week": "Tuesday", "geoip": { "continent_name": "North America", "city_name": "New York", "country_iso_code": "US", "location": { "lon": -73.9862, "lat": 40.7662 }, "region_name": "New York" }, "payment_status": "FULLY_PAID", ... }
Normally, querying these documents would require the user to understand why we would use a nested data type for the products field and also have an understanding of the nested query syntax. However, with Elasticsearch SQL we are able to query these nested documents as if each represents a separate row with the fields of its parent (i.e. we effectively flatten the structure for presentation). Consider the above order which has two products. When queried this is presented as two rows, when requesting fields from the product sub-documents. Each row can also contain the fields of the parent order if requested. For example:
“Find flights the billing name used, and products purchased, for order 518894.”
sql> SELECT billing_last_name, billing_first_name, products.price, products.product_id FROM orders WHERE order_id=519894; billing_last_name|billing_first_name|products.price |products.product_id -----------------+------------------+---------------+------------------- Green |Jason |16.984375 |18370 Green |Jason |22.984375 |12733
The _translate API will show how this query is constructed using a nested query:
{ "size": 1000, "query": { "bool": { "filter": [ { "term": { "order_id": { "value": 519894, "boost": 1 } } }, { "nested": { "query": { "match_all": { "boost": 1 } }, "path": "products", "ignore_unmapped": false, "score_mode": "none", "boost": 1, "inner_hits": { "ignore_unmapped": false, "from": 0, "size": 99, "version": false, "explain": false, "track_scores": false, "_source": false, "stored_fields": "_none_", "docvalue_fields": [ "products.product_id", "products.price" ] } } } ], "adjust_pure_negative": true, "boost": 1 } }, "_source": { "includes": [ "billing_last_name", "billing_first_name" ], "excludes": [] }, "sort": [ { "_doc": { "order": "asc" } } ] }
Conversely if we queried for just parent fields we are presented with only one row:
“Find flights the billing name used for order 518894”
sql> SELECT billing_last_name, billing_first_name FROM orders WHERE order_id=519894; billing_last_name|billing_first_name -----------------+------------------ Green |Jason
What's missing and what's coming...
Current limitations which you might hit early in any experimentation:
- Cross table/index queries - This is possible provided the mappings of the indices are identical. Any differences will currently cause an error at query time.
- JOINs - As highlighted above, we support limited JOINs only using nested documents.
- Ordering of groups in GROUP BY - As discussed above, this limitation is inherited from the Elasticsearch composite aggregation
- Nested SELECT Clauses - These are common mechanism to support drill down on BI tools. Whilst nested SELECT statements using multiple indexes would be equivalent to a JOIN, statements using the same index could possibly be rewritten and executed. This is something we may consider in the future.
Some of the elasticsearch goodness we plan to address in future releases includes:
- The ability to do multi-level GROUP BY (e.g. “Find flight routes (source and destination) by month with an average delay greater than 2 hrs”)
- Exposing of Elasticsearch Geospatial operators
- INTERVAL type support (i.e the ability to operate on the difference between two dates and times.)
- Enhanced Data/Time Math and String functions
- Support for Histograms in grouping
- Cross-index and Cross-Cluster search where the mappings are not identical
Thanks for getting this far and stay tuned for subsequent releases in this Introduction to Elasticsearch SQL series!