Term Vectors API

edit

Term Vectors API returns information and statistics on terms in the fields of a particular document. The document could be stored in the index or artificially provided by the user.

Term Vectors Request

edit

A TermVectorsRequest expects an index, a type and an id to specify a certain document, and fields for which the information is retrieved.

TermVectorsRequest request = new TermVectorsRequest("authors", "_doc", "1");
request.setFields("user");

Term vectors can also be generated for artificial documents, that is for documents not present in the index:

XContentBuilder docBuilder = XContentFactory.jsonBuilder();
docBuilder.startObject().field("user", "guest-user").endObject();
TermVectorsRequest request = new TermVectorsRequest("authors",
    "_doc",
    docBuilder); 

An artificial document is provided as an XContentBuilder object, the Elasticsearch built-in helper to generate JSON content.

Optional arguments

edit
request.setFieldStatistics(false); 
request.setTermStatistics(true); 
request.setPositions(false); 
request.setOffsets(false); 
request.setPayloads(false); 

Map<String, Integer> filterSettings = new HashMap<>();
filterSettings.put("max_num_terms", 3);
filterSettings.put("min_term_freq", 1);
filterSettings.put("max_term_freq", 10);
filterSettings.put("min_doc_freq", 1);
filterSettings.put("max_doc_freq", 100);
filterSettings.put("min_word_length", 1);
filterSettings.put("max_word_length", 10);

request.setFilterSettings(filterSettings);  

Map<String, String> perFieldAnalyzer = new HashMap<>();
perFieldAnalyzer.put("user", "keyword");
request.setPerFieldAnalyzer(perFieldAnalyzer);  

request.setRealtime(false); 
request.setRouting("routing"); 

Set fieldStatistics to false (default is true) to omit document count, sum of document frequencies, sum of total term frequencies.

Set termStatistics to true (default is false) to display total term frequency and document frequency.

Set positions to false (default is true) to omit the output of positions.

Set offsets to false (default is true) to omit the output of offsets.

Set payloads to false (default is true) to omit the output of payloads.

Set filterSettings to filter the terms that can be returned based on their tf-idf scores.

Set perFieldAnalyzer to specify a different analyzer than the one that the field has.

Set realtime to false (default is true) to retrieve term vectors near realtime.

Set a routing parameter

Synchronous Execution

edit

When executing a TermVectorsRequest in the following manner, the client waits for the TermVectorsResponse to be returned before continuing with code execution:

TermVectorsResponse response =
        client.termvectors(request, RequestOptions.DEFAULT);

Synchronous calls may throw an IOException in case of either failing to parse the REST response in the high-level REST client, the request times out or similar cases where there is no response coming back from the server.

In cases where the server returns a 4xx or 5xx error code, the high-level client tries to parse the response body error details instead and then throws a generic ElasticsearchException and adds the original ResponseException as a suppressed exception to it.

Asynchronous Execution

edit

Executing a TermVectorsRequest can also be done in an asynchronous fashion so that the client can return directly. Users need to specify how the response or potential failures will be handled by passing the request and a listener to the asynchronous term-vectors method:

client.termvectorsAsync(request, RequestOptions.DEFAULT, listener); 

The TermVectorsRequest to execute and the ActionListener to use when the execution completes

The asynchronous method does not block and returns immediately. Once it is completed the ActionListener is called back using the onResponse method if the execution successfully completed or using the onFailure method if it failed. Failure scenarios and expected exceptions are the same as in the synchronous execution case.

A typical listener for term-vectors looks like:

listener = new ActionListener<TermVectorsResponse>() {
    @Override
    public void onResponse(TermVectorsResponse termVectorsResponse) {
        
    }
    @Override
    public void onFailure(Exception e) {
        
    }
};

Called when the execution is successfully completed.

Called when the whole TermVectorsRequest fails.

Term Vectors Response

edit

TermVectorsResponse contains the following information:

String index = response.getIndex(); 
String type = response.getType(); 
String id = response.getId(); 
boolean found = response.getFound(); 

The index name of the document.

The type name of the document.

The id of the document.

Indicates whether or not the document found.

Inspecting Term Vectors

edit

If TermVectorsResponse contains non-null list of term vectors, more information about each term vector can be obtained using the following:

for (TermVectorsResponse.TermVector tv : response.getTermVectorsList()) {
    String fieldname = tv.getFieldName(); 
    int docCount = tv.getFieldStatistics().getDocCount(); 
    long sumTotalTermFreq =
            tv.getFieldStatistics().getSumTotalTermFreq(); 
    long sumDocFreq = tv.getFieldStatistics().getSumDocFreq(); 
    if (tv.getTerms() != null) {
        List<TermVectorsResponse.TermVector.Term> terms =
                tv.getTerms(); 
        for (TermVectorsResponse.TermVector.Term term : terms) {
            String termStr = term.getTerm(); 
            int termFreq = term.getTermFreq(); 
            int docFreq = term.getDocFreq(); 
            long totalTermFreq = term.getTotalTermFreq(); 
            float score = term.getScore(); 
            if (term.getTokens() != null) {
                List<TermVectorsResponse.TermVector.Token> tokens =
                        term.getTokens(); 
                for (TermVectorsResponse.TermVector.Token token : tokens) {
                    int position = token.getPosition(); 
                    int startOffset = token.getStartOffset(); 
                    int endOffset = token.getEndOffset(); 
                    String payload = token.getPayload(); 
                }
            }
        }
    }
}

The name of the current field

Fields statistics for the current field - document count

Fields statistics for the current field - sum of total term frequencies

Fields statistics for the current field - sum of document frequencies

Terms for the current field

The name of the term

Term frequency of the term

Document frequency of the term

Total term frequency of the term

Score of the term

Tokens of the term

Position of the token

Start offset of the token

End offset of the token

Payload of the token