This blog shows how to fetch information from an Elasticsearch index, in a natural language expression, using semantic search. We will create a serverless Elasticsearch project, load previous olympic games data set into an index, generate inferred data (in a sparse vector field) using the inference processor along with ELSER model, and finally search for historical olympic competition information in a natural language expression, thanks to text expansion query.
The tools and the data set
For this project we will use an Elasticsearch serverless project, and the serverless Python client (elasticsearch_serverless) for interactions with Elasticsearch. To create a serverless project, simply follow the get started with serverless guide. More information on serverless including pricing can be found here.
When setting up a serverless project, be sure to select the option for Elasticsearch and the general purpose option for working this tutorial.
The data set used is that of summer olympic games competitors from 1896 to 2020, obtained from Kaggle (Athletes_summer_games.csv). It contains information about the competition year, the type of competition, the name of the participant, whether they won a medal or not and which medal eventually, along with other information.
For the data set manipulation, we will use Eland, a Python client and toolkit for DataFrames and machine learning in Elasticsearch.
Finally the natural language processing (NLP) model used is Elastic Learned Sparse EncodeR (ELSER), a retrieval model trained by Elastic that allows to retrieve more relevant search results through semantic search.
Before following the steps below, please make sure you have installed the serverless Python client and Eland.
pip install elasticsearch_serverlesspip install eland
Please note the versions I used below. If you are not using the same versions, you might need to adjust the code to any eventual syntax change in the versions you are using.
➜ ~ python3 --versionPython 3.9.6➜ ~ pip3 list | grep -E 'elasticsearch-serverless|eland'eland 8.14.0elasticsearch-serverless 0.3.0.20231031
Download and deploy ELSER model
We will use the Python client to download and deploy the ELSER model. Before doing that, let's first confirm that we can connect to our serverless project. The URL and API key below are read from environment variables; you need to use the appropriate values in your case, or use whichever method you prefer for reading credentials.
from elasticsearch_serverless import Elasticsearchfrom os import environ
serverless_endpoint = environ.get("SERVERLESS_ENDPOINT_URL")serverless_api_key = environ.get("SERVERLESS_API_KEY")
client = Elasticsearch( serverless_endpoint, api_key=serverless_api_key)
client.info()
If everything is properly configured, you should get an output like below:
ObjectApiResponse({'name': 'serverless', 'cluster_name': 'd6c6698e28c34e58b6f858df9442abac', 'cluster_uuid': 'hOuAhMUPQkumEM-PxW_r-Q', 'version': {'number': '8.11.0', 'build_flavor': 'serverless', 'build_type': 'docker', 'build_hash': '00000000', 'build_date': '2023-10-31', 'build_snapshot': False, 'lucene_version': '9.7.0', 'minimum_wire_compatibility_version': '8.11.0', 'minimum_index_compatibility_version': '8.11.0'}, 'tagline': 'You Know, for Search'})
Now that we've confirmed that the Python client is successfully connecting to the serverless Elasticsearch project, let’s download and deploy the ELSER model. We will check if the model was previously deployed and delete it in order to perform a fresh install. Also, as the deploy phase could take a few minutes, we will continuously check the model configuration information to make sure that the model definition is present before moving to the next phase. For more information check the Get trained models API.
from elasticsearch_serverless import Elasticsearch, exceptionsimport time
# delete model if already downloaded and deployedtry: client.ml.delete_trained_model(model_id=".elser_model_2", force=True) print("Model deleted successfully, We will proceed with creating one")except exceptions.NotFoundError: print("Model doesn't exist, but We will proceed with creating one")
# Creates the ELSER model configuration. Automatically downloads the model if it doesn't exist.client.ml.put_trained_model( model_id=".elser_model_2", input={ "field_names": [ "concatenated_textl" ] })
# Check the download and deploy progresswhile True: status = client.ml.get_trained_models( model_id=".elser_model_2", include="definition_status" )
if status["trained_model_configs"][0]["fully_defined"]: print("ELSER Model is downloaded and ready to be deployed.") break else: print("ELSER Model is downloaded but not ready to be deployed.") time.sleep(5)
Once we get the confirmation that the model is downloaded and ready to be deployed, we can go ahead and start ELSER. It can take a little while to fully be ready to be deployed.
# A function to check the model's routing state# https://www.elastic.co/guide/en/elasticsearch/reference/current/get-trained-models-stats.htmldef get_model_routing_state(model_id=".elser_model_2"): try: status = client.ml.get_trained_models_stats( model_id=".elser_model_2", ) return status["trained_model_stats"][0]["deployment_stats"]["nodes"][0]["routing_state"]["routing_state"] except: return None
# If ELSER is already started, then we are fine.if get_model_routing_state(".elser_model_2") == "started": print("ELSER Model has been already deployed and is currently started.")
# Otherwise, we will deploy it, and monitor the routing state to make sure it is started.else: print("ELSER Model will be deployed.")
# Start trained model deployment client.ml.start_trained_model_deployment( model_id=".elser_model_2", number_of_allocations=16, threads_per_allocation=4, wait_for="starting" )
while True: if get_model_routing_state(".elser_model_2") == "started": print("ELSER Model has been successfully deployed.") break else: print("ELSER Model is currently being deployed.") time.sleep(5)
Load the data set into Elasticsearch using Eland
eland.csv_to_eland
allows reading a comma-separated values (csv) file into a data frame stored in an Elasticsearch index. We will use it to load the Olympics data (Athletes_summer_games.csv) into Elasticsearch. The es_type_overrides
allows to override default mappings.
import eland as ed
index="elser-olympic-games"csv_file="Athletes_summer_games.csv"
ed.csv_to_eland( csv_file, es_client=client, es_dest_index=index, es_if_exists='replace', es_dropna=True, es_refresh=True, index_col=0, es_type_overrides={ "City": "text", "Event": "text", "Games": "text", "Medal": "text", "NOC": "text", "Name": "text", "Season": "text", "Sport": "text", "Team": "text" })
After executing the lines above, the data will be written in the index elser-olympic-games
. You can also retrieve the resulting dataframe (eland.DataFrame
) into a variable for further manipulations.
Create an ingest pipeline for inference based on ELSER
The next step in our journey to explore past Olympic competition data using semantic search is to create an ingest pipeline containing an inference processor that runs the ELSER model. A set of fields has been selected and concatenated into a single field on which the inference processor will work. Depending on your use case, you might want to use another strategy.
The concatenation is done using the script processor. The inference processor uses the previously deployed ELSER model, taking as input the concatenated field, and storing the output in a sparse vector type field (see following point).
client.ingest.put_pipeline( id="elser-ingest-pipeline", description="Ingest pipeline for ELSER", processors=[ { "script": { "description": "Concatenate some selected fields value into `concatenated_text` field", "lang": "painless", "source": """ ctx['concatenated_text'] = ctx['Name'] + ' ' + ctx['Team'] + ' ' + ctx['Games'] + ' ' + ctx['City'] + ' ' + ctx['Event']; """ } }, { "inference": { "model_id": ".elser_model_2", "ignore_missing": True, "input_output": [ { "input_field": "concatenated_text", "output_field": "concatenated_text_embedding" } ] } } ])
Preparing the index
This is the last stage before being able to query past Olympic competition data using natural language expressions. We will update the previously created index’s mapping adding a sparse vector type field.
Update the mapping: add a sparse vector field
We will update the index mapping by adding a field that will hold the concatenated data, and a sparse vector field that will hold the inferred information computed by the inference processor using the ELSER model.
index="elser-olympic-games"
mappings_properties={ "concatenated_text": { "type": "text" }, "concatenated_text_embedding": { "type": "sparse_vector" }}
client.indices.put_mapping( index=index, properties=mappings_properties)
Populate the sparse vector field
We will run an update by query to call the previously created ingest pipeline in order to populate the sparse vector field in each document.
client.update_by_query( index="elser-olympic-games", pipeline="elser-ingest-pipeline", wait_for_completion=False)
The request will take a few moments depending on the number of documents, and the number of allocations and threads per allocation used for deploying ELSER. Once this step is completed, we can now start exploring past olympic data set using semantic search.
Let's explore the Olympic data set using semantic search
Now we will use text expansion queries to retrieve information about past Olympic game competitions using natural language expressions. Before going to the demonstration, let's create a function to retrieve and format the search results.
def semantic_search(search_text): response = client.search( index="elser-olympic-games", size=3, query={ "bool": { "must": [ { "text_expansion": { "concatenated_text_embedding": { "model_id": ".elser_model_2", "model_text": search_text } } }, { "exists": { "field": "Medal" } } ] } }, source_excludes="*_embedding, concatenated_text" )
for hit in response["hits"]["hits"]: doc_id = hit["_id"] score = hit["_score"] year = hit["_source"]["Year"] event = hit["_source"]["Event"] games = hit["_source"]["Games"] sport = hit["_source"]["Sport"] city = hit["_source"]["City"] team = hit["_source"]["Team"] name = hit["_source"]["Name"] medal = hit["_source"]["Medal"]
print(f"Score: {score}\nDocument ID: {doc_id}\nYear: {year}\nEvent: {event}\nName: {name}\nCity: {city}\nTeam: {team}\nMedal: {medal}\n")
The function above will receive a question about past Olympic games competition winners, performing a semantic search using Elastic’s text expansion query. The retrieved results are formatted and printed. Notice that we force the existence of medals in the query, as we are only interested in the winners. We also limited the size of the result to 3 as we expect three winners (gold, silver, bronze). Again, based on your use case, you might not necessarily do the same thing.
🏌️♂️ “Who won the Golf competition in 1900?”
Request:
semantic_search("Who won the Golf competition in 1900?")
Output:
Score: 18.184263Document ID: 206566Year: 1900Event: Golf Men's IndividualName: Walter Mathers RutherfordCity: ParisTeam: Great BritainMedal: Silver
Score: 17.443663Document ID: 209892Year: 1900Event: Golf Men's IndividualName: Charles Edward SandsCity: ParisTeam: United StatesMedal: Gold
Score: 16.939331Document ID: 192747Year: 1900Event: Golf Women's IndividualName: Myra Abigail "Abbie" Pratt (Pankhurst-, Wright-, -Karageorgevich)City: ParisTeam: United StatesMedal: Bronze
🏃♀️ “2004 Women's Marathon winners”
Request:
semantic_search("2004 Women's Marathon winners")
Output:
Score: 24.948284Document ID: 168955Year: 2004Event: Athletics Women's MarathonName: Wincatherine Nyambura "Catherine" NderebaCity: AthinaTeam: KenyaMedal: Silver
Score: 24.08922Document ID: 58799Year: 2004Event: Athletics Women's MarathonName: Deena Michelle Drossin-KastorCity: AthinaTeam: United StatesMedal: Bronze
Score: 21.391462Document ID: 172670Year: 2004Event: Athletics Women's MarathonName: Mizuki NoguchiCity: AthinaTeam: JapanMedal: Gold
🏹 “Women archery winners of 1908”
Request:
semantic_search("Women archery winners of 1908")
Output:
Score: 21.876282Document ID: 96010Year: 1908Event: Archery Women's Double National RoundName: Beatrice Geraldine Hill-Lowe (Ruxton-, -Thompson)City: LondonTeam: Great BritainMedal: Bronze
Score: 21.0998Document ID: 170250Year: 1908Event: Archery Women's Double National RoundName: Sybil Fenton NewallCity: LondonTeam: Great BritainMedal: Gold
Score: 21.079535Document ID: 56686Year: 1908Event: Archery Women's Double National RoundName: Charlotte "Lottie" DodCity: LondonTeam: Great BritainMedal: Silver
🚴♂️ “Who won the individual cycling competition in 1972?”
Request:
semantic_search("Who won the cycling competition in 1972?")
Output:
Score: 20.554308Document ID: 215559Year: 1972Event: Cycling Men's Road Race, IndividualName: Kevin "Clyde" SeftonCity: MunichTeam: AustraliaMedal: Silver
Score: 20.267525Document ID: 128598Year: 1972Event: Cycling Men's Road Race, IndividualName: Hendrikus Andreas "Hennie" KuiperCity: MunichTeam: NetherlandsMedal: Gold
Score: 19.108923Document ID: 19225Year: 1972Event: Cycling Men's Team Pursuit, 4,000 metresName: Michael John "Mick" BennettCity: MunichTeam: Great BritainMedal: Bronze
Conclusion
This blog showed how you can perform semantic search with the Elastic Learned Sparse EncodeR (ELSER) NLP model, in Python programming language using Serverless. You will want to make sure you turn off severless after running this tutorial to avoid any extra charges. To go further, feel free to check out our Elasticsearch Relevance Engine (ESRE) Engineer course where you can learn how to leverage the Elasticsearch Relevance Engine (ESRE) and large language models (LLMs) to build advanced RAG (Retrieval-Augmented Generation) applications that combine the storage, processing, and search features of Elasticsearch with the generative power of an LLM.
The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.
Want to get Elastic certified? Find out when the next Elasticsearch Engineer training is running!
Elasticsearch is packed with new features to help you build the best search solutions for your use case. Dive into our sample notebooks to learn more, start a free cloud trial, or try Elastic on your local machine now.