Get term vector information | Elasticsearch API documentation

Get term vector information Generally available

POST /{index}/_termvectors/{id}

Api key auth Basic auth Bearer auth

All methods and paths for this operation:

GET /{index}/_termvectors

POST /{index}/_termvectors

GET /{index}/_termvectors/{id}

POST /{index}/_termvectors/{id}

Get information and statistics about terms in the fields of a particular document.

You can retrieve term vectors for documents stored in the index or for artificial documents passed in the body of the request. You can specify the fields you are interested in through the fields parameter or by adding the fields to the request body. For example:

GET /my-index-000001/_termvectors/1?fields=message

Fields can be specified using wildcards, similar to the multi match query.

Term vectors are real-time by default, not near real-time. This can be changed by setting realtime parameter to false.

You can request three types of values: term information, term statistics, and field statistics. By default, all term information and field statistics are returned for all fields but term statistics are excluded.

Term information

term frequency in the field (always returned)
term positions (positions: true)
start and end offsets (offsets: true)
term payloads (payloads: true), as base64 encoded bytes

If the requested information wasn't stored in the index, it will be computed on the fly if possible. Additionally, term vectors could be computed for documents not even existing in the index, but instead provided by the user.

Start and end offsets assume UTF-16 encoding is being used. If you want to use these offsets in order to get the original text that produced this token, you should make sure that the string you are taking a sub-string of is also encoded using UTF-16.

Behaviour

The term and field statistics are not accurate. Deleted documents are not taken into account. The information is only retrieved for the shard the requested document resides in. The term and field statistics are therefore only useful as relative measures whereas the absolute numbers have no meaning in this context. By default, when requesting term vectors of artificial documents, a shard to get the statistics from is randomly selected. Use routing only to hit a particular shard. Refer to the linked documentation for detailed examples of how to use this API.

Required authorization

Index privileges: read

External documentation

Path parameters

index string Required

The name of the index that contains the document.
id string Required

A unique identifier for the document.

Query parameters

fields string | array[string]

A comma-separated list or wildcard expressions of fields to include in the statistics. It is used as the default list unless a specific field list is provided in the completion_fields or fielddata_fields parameters.
field_statistics boolean
If true, the response includes:
- The document count (how many documents contain this field).
- The sum of document frequencies (the sum of document frequencies for all terms in this field).
- The sum of total term frequencies (the sum of total term frequencies of each term in this field).
offsets boolean

If true, the response includes term offsets.
payloads boolean

If true, the response includes term payloads.
positions boolean

If true, the response includes term positions.
preference string

The node or shard the operation should be performed on. It is random by default.
realtime boolean

If true, the request is real-time as opposed to near-real-time.
routing string | array[string]

A custom value that is used to route operations to a specific shard.
term_statistics boolean
If true, the response includes:
- The total term frequency (how often a term occurs in all documents).
- The document frequency (the number of documents containing the current term).
By default these values are not returned since term statistics can have a serious performance impact.
version number

If true, returns the document version as part of a hit.
version_type string
The version type.

Supported values include:
- internal: Use internal versioning that starts at 1 and increments with each update or delete.
- external: Only index the document if the specified version is strictly higher than the version of the stored document or if there is no existing document.
- external_gte: Only index the document if the specified version is equal or higher than the version of the stored document or if there is no existing document. NOTE: The external_gte version type is meant for special use cases and should be used with care. If used incorrectly, it can result in loss of data.
Values are internal, external, or external_gte.

application/json

Body

doc object

An artificial document (a document not present in the index) for which you want to retrieve term vectors.
filter object

Filter terms based on their tf-idf scores. This could be useful in order find out a good characteristic vector of a document. This feature works in a similar manner to the second phase of the More Like This Query.

External documentation
Hide filter attributes Show filter attributes object
- max_doc_freq number
  
  Ignore words which occur in more than this many docs. Defaults to unbounded.
- max_num_terms number
  
  The maximum number of terms that must be returned per field.
  
  Default value is 25.
- max_term_freq number
  
  Ignore words with more than this frequency in the source doc. It defaults to unbounded.
- max_word_length number
  
  The maximum word length above which words will be ignored. Defaults to unbounded.
  
  Default value is 0.
- min_doc_freq number
  
  Ignore terms which do not occur in at least this many docs.
  
  Default value is 1.
- min_term_freq number
  
  Ignore words with less than this frequency in the source doc.
  
  Default value is 1.
- min_word_length number
  
  The minimum word length below which words will be ignored.
  
  Default value is 0.
per_field_analyzer object

Override the default per-field analyzer. This is useful in order to generate term vectors in any fashion, especially when using artificial documents. When providing an analyzer for a field that already stores term vectors, the term vectors will be regenerated.
Hide per_field_analyzer attribute Show per_field_analyzer attribute object
- * string Additional properties
fields array[string]

A list of fields to include in the statistics. It is used as the default list unless a specific field list is provided in the completion_fields or fielddata_fields parameters.
field_statistics boolean
If true, the response includes:
- The document count (how many documents contain this field).
- The sum of document frequencies (the sum of document frequencies for all terms in this field).
- The sum of total term frequencies (the sum of total term frequencies of each term in this field).
Default value is true.
offsets boolean

If true, the response includes term offsets.

Default value is true.
payloads boolean

If true, the response includes term payloads.

Default value is true.
positions boolean

If true, the response includes term positions.

Default value is true.
term_statistics boolean
If true, the response includes:
- The total term frequency (how often a term occurs in all documents).
- The document frequency (the number of documents containing the current term).
By default these values are not returned since term statistics can have a serious performance impact.
Default value is false.
routing string | array[string]

A custom value that is used to route operations to a specific shard.

One of:
string-1 string array-2 array[string]
version number

If true, returns the document version as part of a hit.
version_type string
The version type.

Supported values include:
- internal: Use internal versioning that starts at 1 and increments with each update or delete.
- external: Only index the document if the specified version is strictly higher than the version of the stored document or if there is no existing document.
- external_gte: Only index the document if the specified version is equal or higher than the version of the stored document or if there is no existing document. NOTE: The external_gte version type is meant for special use cases and should be used with care. If used incorrectly, it can result in loss of data.
Values are internal, external, or external_gte.

Responses

200 application/json
Hide response attributes Show response attributes object
- found boolean Required
- _id string
- _index string Required
- term_vectors object
  
  Hide term_vectors attribute Show term_vectors attribute object
  
  * object Additional properties
  
  Hide * attributes Show * attributes object
  
  field_statistics object
  
  Hide field_statistics attributes Show field_statistics attributes object
  
  doc_count number Required
  
  sum_doc_freq number Required
  
  sum_ttf number Required
  
  terms object Required
  
  Hide terms attribute Show terms attribute object
  
  * object Additional properties
  
  Hide * attributes Show * attributes object
  
  doc_freq number
  
  score number
  
  term_freq number Required
  
  tokens array[object]
  
  ttf number
- took number Required
- _version number Required

POST /{index}/_termvectors/{id}

GET /my-index-000001/_termvectors/1
{
  "fields" : ["text"],
  "offsets" : true,
  "payloads" : true,
  "positions" : true,
  "term_statistics" : true,
  "field_statistics" : true
}

resp = client.termvectors(
    index="my-index-000001",
    id="1",
    fields=[
        "text"
    ],
    offsets=True,
    payloads=True,
    positions=True,
    term_statistics=True,
    field_statistics=True,
)

const response = await client.termvectors({
  index: "my-index-000001",
  id: 1,
  fields: ["text"],
  offsets: true,
  payloads: true,
  positions: true,
  term_statistics: true,
  field_statistics: true,
});

response = client.termvectors(
  index: "my-index-000001",
  id: "1",
  body: {
    "fields": [
      "text"
    ],
    "offsets": true,
    "payloads": true,
    "positions": true,
    "term_statistics": true,
    "field_statistics": true
  }
)

$resp = $client->termvectors([
    "index" => "my-index-000001",
    "id" => "1",
    "body" => [
        "fields" => array(
            "text",
        ),
        "offsets" => true,
        "payloads" => true,
        "positions" => true,
        "term_statistics" => true,
        "field_statistics" => true,
    ],
]);

curl -X GET -H "Authorization: ApiKey $ELASTIC_API_KEY" -H "Content-Type: application/json" -d '{"fields":["text"],"offsets":true,"payloads":true,"positions":true,"term_statistics":true,"field_statistics":true}' "$ELASTICSEARCH_URL/my-index-000001/_termvectors/1"

client.termvectors(t -> t
    .fieldStatistics(true)
    .fields("text")
    .id("1")
    .index("my-index-000001")
    .offsets(true)
    .payloads(true)
    .positions(true)
    .termStatistics(true)
);

Request examples

Run `GET /my-index-000001/_termvectors/1` to return all information and statistics for field `text` in document 1.

{
  "fields" : ["text"],
  "offsets" : true,
  "payloads" : true,
  "positions" : true,
  "term_statistics" : true,
  "field_statistics" : true
}

Run `GET /my-index-000001/_termvectors/1` to set per-field analyzers. A different analyzer than the one at the field may be provided by using the `per_field_analyzer` parameter.

{
  "doc" : {
    "fullname" : "John Doe",
    "text" : "test test test"
  },
  "fields": ["fullname"],
  "per_field_analyzer" : {
    "fullname": "keyword"
  }
}

Run `GET /imdb/_termvectors` to filter the terms returned based on their tf-idf scores. It returns the three most "interesting" keywords from the artificial document having the given "plot" field value. Notice that the keyword "Tony" or any stop words are not part of the response, as their tf-idf must be too low.

{
  "doc": {
    "plot": "When wealthy industrialist Tony Stark is forced to build an armored suit after a life-threatening incident, he ultimately decides to use its technology to fight against evil."
  },
  "term_statistics": true,
  "field_statistics": true,
  "positions": false,
  "offsets": false,
  "filter": {
    "max_num_terms": 3,
    "min_term_freq": 1,
    "min_doc_freq": 1
  }
}

Run `GET /my-index-000001/_termvectors/1`. Term vectors which are not explicitly stored in the index are automatically computed on the fly. This request returns all information and statistics for the fields in document 1, even though the terms haven't been explicitly stored in the index. Note that for the field text, the terms are not regenerated.

{
  "fields" : ["text", "some_field_without_term_vectors"],
  "offsets" : true,
  "positions" : true,
  "term_statistics" : true,
  "field_statistics" : true
}

Run `GET /my-index-000001/_termvectors`. Term vectors can be generated for artificial documents, that is for documents not present in the index. If dynamic mapping is turned on (default), the document fields not in the original mapping will be dynamically created.

{
  "doc" : {
    "fullname" : "John Doe",
    "text" : "test test test"
  }
}

Response examples (200)

A successful response from `GET /my-index-000001/_termvectors/1`.

{
  "_index": "my-index-000001",
  "_id": "1",
  "_version": 1,
  "found": true,
  "took": 6,
  "term_vectors": {
    "text": {
      "field_statistics": {
        "sum_doc_freq": 4,
        "doc_count": 2,
        "sum_ttf": 6
      },
      "terms": {
        "test": {
          "doc_freq": 2,
          "ttf": 4,
          "term_freq": 3,
          "tokens": [
            {
              "position": 0,
              "start_offset": 0,
              "end_offset": 4,
              "payload": "d29yZA=="
            },
            {
              "position": 1,
              "start_offset": 5,
              "end_offset": 9,
              "payload": "d29yZA=="
            },
            {
              "position": 2,
              "start_offset": 10,
              "end_offset": 14,
              "payload": "d29yZA=="
            }
          ]
        }
      }
    }
  }
}

A successful response from `GET /my-index-000001/_termvectors` with `per_field_analyzer` in the request body.

{
  "_index": "my-index-000001",
  "_version": 0,
  "found": true,
  "took": 6,
  "term_vectors": {
    "fullname": {
      "field_statistics": {
          "sum_doc_freq": 2,
          "doc_count": 4,
          "sum_ttf": 4
      },
      "terms": {
          "John Doe": {
            "term_freq": 1,
            "tokens": [
                {
                  "position": 0,
                  "start_offset": 0,
                  "end_offset": 8
                }
            ]
          }
      }
    }
  }
}

A successful response from `GET /my-index-000001/_termvectors` with a `filter` in the request body.

{
  "_index": "imdb",
  "_version": 0,
  "found": true,
  "term_vectors": {
      "plot": {
        "field_statistics": {
            "sum_doc_freq": 3384269,
            "doc_count": 176214,
            "sum_ttf": 3753460
        },
        "terms": {
            "armored": {
              "doc_freq": 27,
              "ttf": 27,
              "term_freq": 1,
              "score": 9.74725
            },
            "industrialist": {
              "doc_freq": 88,
              "ttf": 88,
              "term_freq": 1,
              "score": 8.590818
            },
            "stark": {
              "doc_freq": 44,
              "ttf": 47,
              "term_freq": 1,
              "score": 9.272792
            }
        }
      }
  }
}

Get term vector information Generally available

Required authorization

Path parameters

Query parameters

Body

routing string | array[string]

Responses