Geospatial search made simple with LLM and Elasticsearch: Journey through the city

Explore how to create a geospatial search from a question formulated in natural language. In the example below, we'll demonstrate a question requesting lists of Airbnb properties within a certain radius around a subway station or a point of interest. You can expand this everyday use case to other geospatial searches, such as looking for restaurants, sights, schools, and other places within the specified area.

We will use the following datasets for New York City:

There is an accompanying Jupyter Notebook, which will guide you through the process of setting up the datasets, ingesting them into Elasticsearch, and setting up the GenAI and LLM part. We will also show you how to do a geospatial search with Elasticsearch and how to combine the two.

The first thing we always need to do is get the correct datasets and prep them for ingest. In the Jupyter Notebook, it is described in more detail. We decided to run ELSER for semantic search purposes on the Airbnb dataset. This means we need to use an ingest pipeline that runs the inference processor using ELSER as the target and ensure that we map the fields as sparse_vector. Below we have JSON representation of an Airbnb listing, which is shortened to only the important fields like name, description, and amenities.

{
    "name": "Lovely 1 bedroom rental in New York",
    "name_embedding": {
      "bed": 0.4812702,
      "studio": 0.3967694,...
    },
    "description": "My guests will enjoy easy access to everything from this centrally located place.",
    "description_embedding": {
      "parking": 0.5157142,
      "studio": 0.2493607,...
    },
    "amenities": ["Mosquito net", "Dishes and silverware", "N/A gas stove", "Refrigerator", "Babysitter recommendations", "Children's books and toys", "N/A  oven", "Air conditioning", "Toaster", "Wifi", "TV", "Security cameras on property", "Long term stays allowed", "Kitchen", "Wine glasses", "Hot water", "Rice maker", "Carbon monoxide alarm", "Bathtub", "Laundromat nearby", "Essentials", "Baking sheet", "Extra pillows and blankets", "Clothing storage", "Free parking on premises", "Smoke alarm", "Paid parking garage off premises", "Hangers", "N/A  conditioner", "Fire extinguisher", "Private hot tub", "Cleaning products", "Dining table", "Dedicated workspace", "Blender", "Safe", "Cooking basics", "Freezer", "Bed linens", "Hair dryer", "Iron", "Window guards", "Fireplace guards", "Coffee", "Heating", "N/A shampoo", "Microwave", "Free street parking"],
    "amenities_embedding": {
      "2000": 0.1171638,
      "tub": 0.8834068,...
    },
    "location": {
      "lon": -73.93711,
      "lat": 40.8015
    }
}

Leveraging ELSER, we have generated the vector embeddings for the amenities, names, and descriptions. The embeddings help us build a better search experience by enabling us to find explicit and related matches. Users might search for Close to a garden, meaning any park, garden, or recreational area. Therefore, the answer to this query can contain Central Park, Botanical Garden, Convent Garden, and many more green areas. Below a user is searching for Next to Central Park and Empire State Building.

{
  "text_expansion": {
      "description_embedding": {
          "model_id": ".elser_model_2_linux-x86_64",
          "model_text": "Next to Central Park and Empire State Building",
      }
  }
}

This will now search for the embeddings of the description field. This of course will be more accurate for Airbnb listings that say Close to Empire State Building and or mention Central Park. But it will also find listings that are close to these locations but do not mention them in the description, depending on the semantic search capability. ELSER might know that Bow Bridge is a scenic bridge located within Central Park and therefore a description with Only a short walk to the iconic Bow Bridge might also be in the results.

Looking at this from a Python code example the entire code needed looks like this:

response = client.search(
    index="airbnb-*",
    size=10,
    query={
        "text_expansion": {
            "description_embedding": {
                "model_id": ".elser_model_2_linux-x86_64",
                "model_text": "Next to Central Park and Empire State Building",
            }
        }
    },
)

This will return the top 10 Airbnb listings that are close to the Central Park and Empire State Building. The results will be sorted by relevance and not by any geographical measurement.

The next step is to get an emphasis on doing proper geospatial search. We have all the details in the Jupyter Notebook. There are a couple of search types we need to discuss before getting too much into the details. Among all the available geo searches, we can find geo_bounding_box and geo_distance, which are the ones we are going to focus on today.

The geo_distance query is pretty simple. Given a certain geo point (it is also possible to use geo shapes, but that's for another blog post), you can search for all documents within a certain radius around that point.

GET /airbnb-listings/_search
{
  "query": {
    "geo_distance": {
      // The maximum radius around the location
      "distance": "1km",
      // The location of the `Empire State Building` from where you want to calculate the 1km radius.
      "location": {
        "lat": 40.74,
        "lon": -73.98
      } 
    }
  }
}

The geo_bounding_box query is a bit more complex. You need to supply at least two points representing a rectangular bounding box, which helps us answer questions like Find me all Airbnb between Empire State Building and Central Park Bow Bridge. This would ensure that any Airbnb above the Bow Bridge is excluded from the search result. This is what such a search looks like when visualised.

Map

GET /airbnb-listings/_search
{
  "query": {
    "geo_bounding_box": {
          "location": {
            "top_left": {
              "lat": 40.77,   // lat of Bow bridge
              "lon": -73.98   // lon of Empire State
            },
            "bottom_right": {
              "lat": 40.74,   // lat of Empire State
              "lon": -73.97   // lon of Bow Bridge
            }
          }
        }
  }
}

This is a very simple example of how to do geospatial search with Elasticsearch. There are many more things you can do, like adding a sort parameter to sort by distance, or adding a filter to filter out certain Airbnb properties that do not meet your requirements, such as too expensive ones or ones with missing amenities.

Since we indexed the points of interest as documents in Elasticsearch, we do not need to manually specify the latitude and longitude as above. Instead, we can use the terms query to search for the names of the points of interest. Below is an example of how to search for the Empire State Building and Central Park Bow Bridge for a geo_bounding_box query.

# We first grab the location of the Empire State Building and Central Park Bow Bridge
response = client.search(
    index="points-of-interest",
    size=2,
    query={
        "terms": {
            "name": ["central park bow bridge", "empire state building"]
        }
    },
)

# for easier access we store the locations in two variables
central = {}
empire = {}
for hit in response["hits"]["hits"]:
    hit = hit["_source"]
    if "central park bow bridge" in hit["name"]:
        central = hit["location"]
    elif "empire state building" in hit["name"]:
        empire = hit["location"]

# Now we can run the geo_bounding_box query and sort it by the 
# distance first to Central Park Bow Bridge
# and then to the Empire State Building.
response = client.search(
    index="airbnb-*",
    size=50,
    query={
        "geo_bounding_box": {
          "location": {
              "top_left": {
                  "lat": central["lat"],
                  "lon": empire["lon"]
              },
              "bottom_right": {
                  "lat": empire["lat"],
                  "lon": central["lon"]
              }
          }
        }
    },
    sort=[
        {
            "_geo_distance": {
                "location": {
                  "lat": central["lat"],
                  "lon": central["lon"]
                },
                "unit": "km",
                "distance_type": "plane",
                "order": "asc"
            }
        },
                {
            "_geo_distance": {
                "location": {
                  "lat": empire["lat"],
                  "lon": empire["lon"]
                },
                "unit": "km",
                "distance_type": "plane",
                "order": "asc"
            }
        }
    ]
)

Asking the LLM to extract entities

Now that we understand how geospatial searches work, we can add the GenAI and LLM part to the picture. The idea is to have a search box, chatbot or anything else that you fancy, in which you can ask in natural language to find you Airbnb properties close to a certain location or locations. In the Jupyter Notebook, we rely on ChatGPT3.5 Turbo to extract the information from the natural language and turn it into a JSON that is then parsed and processed.

Question: Get me the closest Airbnb within 1 mile of the Empire State Building

question="""
As an expert in named entity recognition machine learning models, I will give you a sentence from which I would like you to extract what needs to be found (location, apartment, airbnb, sight, etc) near which location and the distance between them. The distance needs to be a number expressed in kilometers. I would like the result to be expressed in JSON with the following fields: "what", "near", "distance_in_km". Only return the JSON.
Here is the sentence: "Get me the closest Airbnb between 1 miles distance from the Empire State Building"
"""

answer = oai_client.completions.create(prompt=question, model=model, max_tokens=100)
print(answer.choices[0].text)
# Output below
{
    "what": "Airbnb",
    "near": "Empire State Building",
    "distance_in_km": 1.610
}

Resolving points of interest

We first run a search against the points-of-interest index and extract the geolocation for the Empire State Building. Next up we run a search against the airbnb-listings index and use the geo_distance query to find all Airbnb properties within 1 mile of the Empire State Building. We then sort the results by distance and return the closest one, as shown below:

Distance to Empire State Building: 0.004002111094864837 km
Title: Comfort and Convenience! 2 Units Near Bryant Park!

Distance to Empire State Building: 0.011231615140053008 km
Title: Relax and Recharge! 3 Relaxing Units, Pets Allowed

Finding accessible subway stations

We now have a working geospatial search that is powered by GenAI and Elasticsearch, but we can take this even further! In the beginning, we also ingested the list of subway stations and they contain a field called ADA which is short for Americans with Disabilities Act and the value can be:

  • 0: Not accessible
  • 1: Fully accessible
  • 2: Partially accessible

Combining the different datasets, we can search for Airbnb properties that are fully accessible, provide elevator access, handicapped bathrooms, and many more amenities, and ensure that the Airbnb is as close as possible to a fully accessible subway station.

The user question might be, Find me all Airbnb properties within 250m in Manhatten near the Empire State Building that are fully accessible and close to a subway station that is fully accessible. The GenAI will extract the information, and we can run the search against the airbnb-listings and mta-stations indexes to find the best Airbnb for the user.

{
    "what": "Airbnb",
    "near": "Empire State Building",
    "accessibility": "fully accessible",
    "distance_in_km": 0.250
}

We can build the following query that searches for fully accessible subway stations next to the Empire State Building. If we do not find any, we can notify the user and tell them that there are no fully accessible subway stations near the Empire State Building and that we should pick another sight.

GET mta-stations/_search
{
  "query": {
    "bool": {
      "filter": [
        // Subway stations have `fully accessible` as `ADA` 1 value.
        {
          "term": {
            "ADA": 1
          }
        }
      ],
      "must": [
          {
            "geo_distance": {
              // The distance, 250m as in the prompt.
              "distance": "250m",
              // The location of the `Empire State Building` from where you want to calculate the 250m radius.
              "location": {
                "lon": -73.985322454067,
                "lat": 40.74842927376084
              }
            }
          }
        ]
    }
  },
  "sort": [
    {
      "_geo_distance": {
        "location": {
          "lon": -73.985322454067,
          "lat": 40.74842927376084
        },
        "unit": "km",
        "distance_type": "plane",
        "order": "asc"
      }
    }
  ]
}

This returns all the subway stations that are fully accessible and within 250m of the Empire State Building. The result is sorted by distance, so the closest subway station is the first one in the result. This is an excellent case of the issue we can experience when using a geospatial search. We might not find anything within the 250m range, but there may be a station just one meter away, and we could still consider it. That's why we can run a follow-up query extending the distance to 300m. Running the query a second time with the adapted distance returns the stop named '34 St-Herald Sq` with a distance of 254 meters to the Empire State Building.

Putting it all together

Now that we have a fully accessible subway station near the Empire State Building, we can run the following query against the airbnb-listings index to find all Airbnb properties fully accessible within 250m of the Empire State Building. We then sort the results by distance, first by the distance to the Empire State Building and then by the distance to the subway station.

GET airbnb-listings/_search
{
  "query": {
    "bool": {
      "filter": [
        // Airbnb listings that are `fully accessible`
        {
          "text_expansion": {
            "amenities_embedding": {
                "model_id": ".elser_model_2_linux-x86_64",
                "model_text": "fully accessible"
            }
          }
        }
        ],
        "must": [
          {
            "geo_distance": {
              // The distance, 250m as in the prompt.
              "distance": "250m",
              // The location of the `Empire State Building` from where you want to calculate the 250m radius.
              "location": {
                "lon": -73.985322454067,
                "lat": 40.74842927376084
              }
            }
          }
          ]
    }
  },
  "sort": [
    {
      "_geo_distance": {
        "location": {
          "lon": -73.985322454067,
          "lat": 40.74842927376084
        },
        "unit": "km",
        "distance_type": "plane",
        "order": "asc"
      }
    },
    {
      "_geo_distance": {
        "location": {
          "lon": -73.985322454067,
          "lat": 40.74842927376084
        },
        "unit": "km",
        "distance_type": "plane",
        "order": "asc"
      }
    }
  ]
}

This query now lists all Airbnb properties that match our ELSER-powered search for fully accessible amenities. In the answer, the sort object contains two values: 0.004 and 0.254, which are in km. So, the Airbnb is 4 meters away from the Empire State Building and 254 meters away from the subway station. This is a great result, and we can now return it to the user and let them make a decision.

Conclusion

We walked you through a lot of different tasks and ideas in this blog post. We started with the basics of geospatial search, then added the GenAI and LLM part to it, and finally combined the two to create a powerful search experience. We hope you enjoyed this blog post and that you learned something new.

Ready to try this out on your own? Start a free trial.
Want to get Elastic certified? Find out when the next Elasticsearch Engineer training is running!
Recommended Articles