From ES|QL to native Pandas dataframes in Python

Since Elasticsearch 8.15 or with Elasticsearch Serverless, ES|QL responses support Apache Arrow streaming format. This blog post will show you how to take advantage of it in Python. In an earlier blog post, I demonstrated how to convert ES|QL queries to Pandas dataframes using CSV as an intermediate representation. Unfortunately, CSV requires explicit type declarations, is slow (especially for larger datasets) and does not handle nested arrays and objects. Apache Arrow lifts all these limitations.

ES|QL produces tables

ES|QL to Pandas dataframes in Python

Importing test data

First, let's import some test data. As before, we will be using the employees sample data and mappings. The easiest way to load this dataset is to run these two Elasticsearch API requests in the Kibana Console.

Converting dataset to a Pandas DataFrame object

OK, with that out of the way, let's convert the full employees dataset to a Pandas DataFrame object using the ES|QL Arrow export:

from elasticsearch import Elasticsearch

client = Elasticsearch(
    "https://[host].elastic-cloud.com",
    api_key="...",
)

response = client.esql.query(
    query="FROM employees | DROP is_rehired,job_positions,salary_change* | LIMIT 500",
    format="arrow",
)
df = response.to_pandas()
print(df)

Even though this dataset only contains 100 records, we use a LIMIT command to avoid ES|QL warning us about potentially missing records. This prints the following dataframe:

    avg_worked_seconds birth_date  emp_no  ...    last_name salary  still_hired
0            268728049 1953-09-02   10001  ...      Facello  57305         True
1            328922887 1964-06-02   10002  ...       Simmel  56371         True
2            200296405 1959-12-03   10003  ...      Bamford  61805        False
3            311267831 1954-05-01   10004  ...      Koblick  36174         True
4            244294991 1955-01-21   10005  ...     Maliniak  63528         True
..                 ...        ...     ...  ...          ...    ...          ...
95           204381503 1954-09-16   10096  ...      Mandell  43889        False
96           206258084 1952-02-27   10097  ...  Waschkowski  71165        False
97           272392146 1961-09-23   10098  ...    Servieres  44817        False
98           377713748 1956-05-25   10099  ...      Sullins  73578         True
99           223910853 1953-04-21   10100  ...    Haraldson  68431         True

[100 rows x 17 columns]

OK, so what actually happened here?

  • Given format="arrow", Elasticsearch returns binary Arrow streaming data
  • The Elasticsearch Python client looks at the Content-Type header and creates a PyArrow object
  • Finally, PyArrow's Pandas integration converts the PyArrow object to a Pandas dataframe. (Note that in most cases this is not a zero copy conversion.)

This means that for this example to work, the Pandas and PyArrow optional dependencies need to be installed. Note that Pandas itself is not strictly required, only PyArrow. If you want to use Polars instead, you can use from_arrow to create a Polars DataFrame from the PyArrow table the client returns.

One limitation is that Elasticsearch does not currently handle multi-valued fields, which is why we had to drop the is_rehired, job_positions and salary_change columns. This limitation will be lifted in a future version of Elasticsearch.

Anyway, you now have a Pandas dataframe that you can use to analyze your data further. But you can also continue massaging the data using ES|QL, which is particularly useful when queries return more than 10,000 rows, the current maximum number of rows that ES|QL queries can return.

More complex queries

In the next example, we're counting how many employees are speaking a given language by using STATS ... BY (not unlike GROUP BY in SQL). And then we sort the result with the languages column using SORT:

response = client.esql.query(
    query="""
    FROM employees
    | DROP is_rehired,job_positions,salary_change*
    | STATS count = COUNT(emp_no) BY languages
    | SORT languages
    | LIMIT 500
    """,
    format="arrow",
)

df = response.to_pandas()
print(df)

Unlike with CSV, we did not have to specify any types, as Arrow data already includes types. Here's the result:

   count  languages
0     15        1.0
1     19        2.0
2     17        3.0
3     18        4.0
4     21        5.0
5     10        NaN

21 employees speak 5 languages, wow!

Queries with parameters

Finally, suppose that you want to expand the query from the previous section to only consider employees that speak N or more languages, with N being a variable parameter. For this we can use ES|QL's built-in support for parameters, which eliminates the risk of an injection attack associated with manually assembling queries with variable parts:

response = client.esql.query(
    query="""
    FROM employees
    | DROP is_rehired,job_positions,salary_change*
    | STATS count = COUNT(emp_no) BY languages
    | WHERE languages >= (?)
    | SORT languages
    | LIMIT 500
    """,
    format="arrow",
    params=[3],
)

df = response.to_pandas()
print(df)

which prints the following:

   count  languages
0     17          3
1     18          4
2     21          5

Conclusion

As we saw, ES|QL's native Arrow support makes working with Pandas and other DataFrame libraries even nicer than using CSV and it will continue to improve over time, with the multi-value support coming in a future version of Elasticsearch.

Additional resources

If you want to learn more about ES|QL, the ES|QL documentation is the best place to start. You can also check out this other Python example using Boston Celtics data. To know more about the Python Elasticsearch client itself, you can refer to the documentation, ask a question on Discuss with the language-clients tag or open a new issue if you found a bug or have a feature request. Thank you!

Ready to try this out on your own? Start a free trial.
Elasticsearch has integrations for tools from LangChain, Cohere and more. Join our advanced semantic search webinar to build your next GenAI app!
Recommended Articles