When and How To Percolate - Part 1

As we move beyond search at Elastic, we are seeing a growing number of the requirements to take action based on changes in your data. Back in June, we released the first version of Watcher to bring alerting and notification functionality to the Elastic product stack, based on the principle that users and administrators should be able to alert on anything for which they can can create a query.

Many of you who are familiar with the existing capabilities of Elasticsearch perceived some functional overlap with the percolate API [reference: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-percolate.html] and have expressed confusion as to when each is applicable.

Although both match documents against queries and can be used to create alerting-type functionality, they each provide discrete capabilities and solve distinct use cases. This post aims to clarify when each is appropriate, before discussing the sizing and scaling considerations for percolator.

Should I Watch or Percolate?

Traditionally, you design documents based on your data, store them into an index, and then define queries via the search API in order to retrieve them. Percolator works in the opposite direction. First, you store queries into an index and then, via the percolate API, you match individual documents against these queries. Percolator provides the capability to ascertain which queries a discrete document would match. It does not consider the corpus of documents and thus does not allow the observation of trends (e.g. using aggregations to spot if avg volume of requests have gone up in last hour).

A common misconception for those new to Percolator is that alerting is a side-effect of inserting documents, similar to a database trigger. This is not the case.

There is no requirement to index percolated documents and any alerting-type functionality is left to the user. Percolator is also not limited to alerting use cases. For example, queries can represent categories and the percolation process is then used to classify documents which may or may not be inserted into Elasticsearch.

By indexing the queries, Percolator demonstrates linearly scalability with respect to the number of registered queries, modulated by the number of concurrent shards processing each request.   Because the Percolator evaluates a single document at a time, it is less suited for applications that must compare multiple documents at once. Furthermore, Percolator brings a real-time capability to matching in comparison to the scheduled nature of Watcher. Percolator is well suited to use cases where the requirement focuses on the need to spot new documents of interest in isolation to the corpus in real-time.

Comparatively, Watcher periodically schedules pre-defined queries for executing against a document corpus before sending alerts and notifications on matches. It is able to consider either properties of the entire corpus or even a subset of documents (e.g. those recently indexed, thus providing trend identification). Functionally, Watcher takes this further by allowing the user to take actions on the matches to the above question without the requirement to write code. Watcher exhibits excellent scaling properties with respect to both the total document corpus size and the number of new documents. It is, however, less suited to requirements where there is a need for real time matches.

Consider the following examples,

Use Case 1:  An e-commerce website, wishes to provide the capability for users to save searches. When a new item is listed, those users whose searches match the product are notified that the item is available for purchase. Approximately 1000 new products are listed per minute. The site has approximately 3 million users, who on average save 0.5 queries each. Searches for products consist of a product type, key terms, price range and geographical distance. Users should be notified within 1 minute of a new product being listed.

Solution: A high number of queries preclude Watcher without significant hardware investment. The cardinality of the values (i.e. high queries), low document throughput, requirement to alert in real time and matching of documents in isolation mean this use case is well suited to Percolator.

Use Case 2: The same website as Use Case 1 monitors items purchased per minute and the number of items listed per minute. This count is measured continuously every 10 minutes and compared to seasonal trends. Should the values fall beneath pre-determined benchmark values a series of internal actions are initiated.

Solution: The queries required to express the above is low. Additionally, the requirement for periodic monitoring and the need to consider more than one document make this a suitable application for Watcher.

Next Steps

Once a user has selected the tool appropriate to their problem, the topic of scaling and sizing is then typically raised. With respect to Watcher, this evolves into the standard search sizing problem with a few extra dimensions: number of queries, query complexity, and frequency of scheduling. Percolator has some interesting scaling challenges and provides several tools for overcoming these. For the remainder of this post we will focus on the general considerations when scaling Percolator, before discussing techniques used to size and optimise performance next week. In future parts of this series, we will investigate the Watcher side of things.

Scaling Percolator

Registering a query with Percolator creates a document in a specific format, in an arbitrary index under a reserved type “.percolator”. Per normal search behaviour in Elasticsearch, and assuming the index has been sharded, each shard for the index receives a subset of these documents. Additionally, each shard keeps an in-memory store of parsed Lucene queries for those it has been assigned. On receiving a document for percolation, a node broadcasts this to every shard in the index - potentially utilising a replica to take advantage of available resources.

For each shard, the document specified in the request gets parsed into a Lucene document and is stored in an in-memory index. The queries are in turn executed sequentially across this in-memory index as required using a single thread. This happens on each shard that the percolate request needs to execute. Assuming all percolator queries are broadly equal, we can expect response times that are linear with respect to the number of queries stored on a shard.

Using a single shard and node on representative production hardware, begin your sizing process by identifying the required latency for a single document percolation. In a controlled manner, replay a set of documents against a percolator index containing a set sample of pre-optimised queries N. The number of queries used here should be a subset of the intended total. Tools such as JMeter [reference: https://jmeter.apache.org/] allow this to be performed with simple configuration, assuming documents are prepared in a simple CSV file.  Documents can be fed sequentially in a single JMeter thread, recording the average response time*. This test can be repeated, resetting the index each time, with increasing percolator query volumes. Any testing here should be controlled and consistent to minimize potential influencing factors that may skew results. For example, utilize the same documents, hardware, and configuration between tests.

For the purposes of this example, we utilized the a public corpus of Best Buy queries available here [reference: https://www.kaggle.com/c/acm-sf-chapter-hackathon-...]. This dataset, consisting of product listings and users queries, was originally made public for the purposes of a prediction challenge. We utilized two specific files from this dataset:

  • A training CSV file (train.csv) containing approximately 1.8 million entries. Each line represents a user clicking on an item. The information captured includes the item selected, its category, and the search term used to locate the item.
  • A product_data.tar containing a folder of xml files. Each file contains a set of product ids. Significant details are provided here beyond the need of a percolation test. For our purposes, we utilized the product name, id, category, and description fields.

Rather than solving the prediction challenge, we treated each user click as a saved percolator query. We assumed the common E-Commerce use case of users wishing to be alerted when a new product is available. For this purpose, we discarded the product id and created a percolation query for each user click consisting of the product category id and their search terms.

The above training file was converted to queries using a script. Each query attempts to replicate a user's search, with a simple search and category filter (i.e. each query would allow a user to be alerted for a product containing terms X in category Y).  

PUT /best_buy/.percolator/1

{

 "query": {

   "bool": {

     "must": {

       "multi_match": {

         "query": "projector screen",

         "fields": [

           "description",

           "name",

           "search_terms"

         ]

       }

     },

     "filter": {

       //term filter

     }

   }

 }

}

Figure 1 - Percolator Query Example.  The above example utilizes Elasticsearch 2.0 syntax.  See https://www.elastic.co/guide/en/elasticsearch/refe... for changes to 1.x.

Practically, this query is likely to be more complex with recall and precision considerations. For the purpose of demonstrating scaling properties we simply utilized a multi_match query.  

500 products were selected from the product data and converted to the appropriate percolate doc format. Each document consisted of a name, description (if available) and list of categories to which it belonged. We also enriched each document with search terms commonly used to locate (using the query data), to ensure each document matched queries when percolated.

The 500 documents were percolated against query sets of increasing size. 10 tests were performed, starting with an index of 100,000 queries. For each test, the number of queries was increased by 100,000 up to a maximum of 1 million. Each test was therefore a superset of all the previous tests. The following illustrates the linear performance relationship of Percolator:

chart.png

Figure 2 - Execution time vs Number of Queries

For the purposes of this example, we assumed we needed all document percolations returned in 1 second. The above relationship indicates each shard needs to evaluate at most 250,000 percolation queries to achieve this latency.

For the above tests, we have assumed sufficient memory for all nodes - memory pressure and garbage collection are therefore not considered.

Once we have identified the number of percolation queries that can be evaluated linearly in the available time, we have several tools for reducing the query set to this desired number. In next week's follow up Blog we’ll discuss these and other recommended strategies for improving Percolator performance.

Closing Considerations

The above delineation between Watcher and Percolator highlighted the scheduled nature of Watcher as being one of the defining characteristics of this plugin. This behavior is achieved through an extendable architecture which is currently only implemented by a time/calendar based trigger. In the future, we will extend it to support additional use cases including real-time alerting. For example, a trigger could potentially execute on a file change. Furthermore, triggers could be based on percolation results (i.e. a percolator triggers a watch!). These components can therefore also be complementary.