- Elasticsearch Guide: other versions:
- Getting Started
- Setup
- Breaking changes
- Breaking changes in 2.3
- Breaking changes in 2.2
- Breaking changes in 2.1
- Breaking changes in 2.0
- Removed features
- Network changes
- Multiple
path.data
striping - Mapping changes
- CRUD and routing changes
- Query DSL changes
- Search changes
- Aggregation changes
- Parent/Child changes
- Scripting changes
- Index API changes
- Snapshot and Restore changes
- Plugin and packaging changes
- Setting changes
- Stats, info, and
cat
changes - Java API changes
- API Conventions
- Document APIs
- Search APIs
- Aggregations
- Metrics Aggregations
- Avg Aggregation
- Cardinality Aggregation
- Extended Stats Aggregation
- Geo Bounds Aggregation
- Geo Centroid Aggregation
- Max Aggregation
- Min Aggregation
- Percentiles Aggregation
- Percentile Ranks Aggregation
- Scripted Metric Aggregation
- Stats Aggregation
- Sum Aggregation
- Top hits Aggregation
- Value Count Aggregation
- Bucket Aggregations
- Children Aggregation
- Date Histogram Aggregation
- Date Range Aggregation
- Filter Aggregation
- Filters Aggregation
- Geo Distance Aggregation
- GeoHash grid Aggregation
- Global Aggregation
- Histogram Aggregation
- IPv4 Range Aggregation
- Missing Aggregation
- Nested Aggregation
- Range Aggregation
- Reverse nested Aggregation
- Sampler Aggregation
- Significant Terms Aggregation
- Terms Aggregation
- Pipeline Aggregations
- Avg Bucket Aggregation
- Derivative Aggregation
- Max Bucket Aggregation
- Min Bucket Aggregation
- Sum Bucket Aggregation
- Stats Bucket Aggregation
- Extended Stats Bucket Aggregation
- Percentiles Bucket Aggregation
- Moving Average Aggregation
- Cumulative Sum Aggregation
- Bucket Script Aggregation
- Bucket Selector Aggregation
- Serial Differencing Aggregation
- Caching heavy aggregations
- Returning only aggregation results
- Aggregation Metadata
- Metrics Aggregations
- Indices APIs
- Create Index
- Delete Index
- Get Index
- Indices Exists
- Open / Close Index API
- Put Mapping
- Get Mapping
- Get Field Mapping
- Types Exists
- Index Aliases
- Update Indices Settings
- Get Settings
- Analyze
- Index Templates
- Warmers
- Shadow replica indices
- Indices Stats
- Indices Segments
- Indices Recovery
- Indices Shard Stores
- Clear Cache
- Flush
- Refresh
- Force Merge
- Optimize
- Upgrade
- cat APIs
- Cluster APIs
- Query DSL
- Mapping
- Field datatypes
- Meta-Fields
- Mapping parameters
analyzer
boost
coerce
copy_to
doc_values
dynamic
enabled
fielddata
format
geohash
geohash_precision
geohash_prefix
ignore_above
ignore_malformed
include_in_all
index
index_options
lat_lon
fields
norms
null_value
position_increment_gap
precision_step
properties
search_analyzer
similarity
store
term_vector
- Dynamic Mapping
- Transform
- Analysis
- Analyzers
- Tokenizers
- Token Filters
- Standard Token Filter
- ASCII Folding Token Filter
- Length Token Filter
- Lowercase Token Filter
- Uppercase Token Filter
- NGram Token Filter
- Edge NGram Token Filter
- Porter Stem Token Filter
- Shingle Token Filter
- Stop Token Filter
- Word Delimiter Token Filter
- Stemmer Token Filter
- Stemmer Override Token Filter
- Keyword Marker Token Filter
- Keyword Repeat Token Filter
- KStem Token Filter
- Snowball Token Filter
- Phonetic Token Filter
- Synonym Token Filter
- Compound Word Token Filter
- Reverse Token Filter
- Elision Token Filter
- Truncate Token Filter
- Unique Token Filter
- Pattern Capture Token Filter
- Pattern Replace Token Filter
- Trim Token Filter
- Limit Token Count Token Filter
- Hunspell Token Filter
- Common Grams Token Filter
- Normalization Token Filter
- CJK Width Token Filter
- CJK Bigram Token Filter
- Delimited Payload Token Filter
- Keep Words Token Filter
- Keep Types Token Filter
- Classic Token Filter
- Apostrophe Token Filter
- Decimal Digit Token Filter
- Character Filters
- Modules
- Index Modules
- Testing
- Glossary of terms
- Release Notes
- 2.3.5 Release Notes
- 2.3.4 Release Notes
- 2.3.3 Release Notes
- 2.3.2 Release Notes
- 2.3.1 Release Notes
- 2.3.0 Release Notes
- 2.2.2 Release Notes
- 2.2.1 Release Notes
- 2.2.0 Release Notes
- 2.1.2 Release Notes
- 2.1.1 Release Notes
- 2.1.0 Release Notes
- 2.0.2 Release Notes
- 2.0.1 Release Notes
- 2.0.0 Release Notes
- 2.0.0-rc1 Release Notes
- 2.0.0-beta2 Release Notes
- 2.0.0-beta1 Release Notes
WARNING: Version 2.3 of Elasticsearch has passed its EOL date.
This documentation is no longer being maintained and may be removed. If you are running this version, we strongly advise you to upgrade. For the latest information, see the current release documentation.
Text scoring in scripts
editText scoring in scripts
editText features, such as term or document frequency for a specific term can be accessed in scripts (see scripting documentation ) with the _index
variable. This can be useful if, for example, you want to implement your own scoring model using for example a script inside a function score query.
Statistics over the document collection are computed per shard, not per
index.
Nomenclature:
edit
|
document frequency. The number of documents a term appears in. Computed per field. |
|
term frequency. The number times a term appears in a field in one specific document. |
|
total term frequency. The number of times this term appears in all
documents, that is, the sum of |
df
and ttf
are computed per shard and therefore these numbers can vary
depending on the shard the current document resides in.
Shard statistics:
edit-
_index.numDocs()
- Number of documents in shard.
-
_index.maxDoc()
- Maximal document number in shard.
-
_index.numDeletedDocs()
- Number of deleted documents in shard.
Field statistics:
editField statistics can be accessed with a subscript operator like this:
_index['FIELD']
.
-
_index['FIELD'].docCount()
-
Number of documents containing the field
FIELD
. Does not take deleted documents into account. -
_index['FIELD'].sumttf()
-
Sum of
ttf
over all terms that appear in fieldFIELD
in all documents. -
_index['FIELD'].sumdf()
-
The sum of
df
s over all terms that appear in fieldFIELD
in all documents.
Field statistics are computed per shard and therefore these numbers can vary
depending on the shard the current document resides in.
The number of terms in a field cannot be accessed using the _index
variable. See Token count datatype for how to do that.
Term statistics:
editTerm statistics for a field can be accessed with a subscript operator like
this: _index['FIELD']['TERM']
. This will never return null, even if term or field does not exist.
If you do not need the term frequency, call _index['FIELD'].get('TERM', 0)
to avoid unnecessary initialization of the frequencies. The flag will have only
affect is your set the index_options
to docs
.
-
_index['FIELD']['TERM'].df()
-
df
of termTERM
in fieldFIELD
. Will be returned, even if the term is not present in the current document. -
_index['FIELD']['TERM'].ttf()
-
The sum of term frequencies of term
TERM
in fieldFIELD
over all documents. Will be returned, even if the term is not present in the current document. -
_index['FIELD']['TERM'].tf()
-
tf
of termTERM
in fieldFIELD
. Will be 0 if the term is not present in the current document.
Term positions, offsets and payloads:
editIf you need information on the positions of terms in a field, call
_index['FIELD'].get('TERM', flag)
where flag can be
|
if you need the positions of the term |
|
if you need the offsets of the term |
|
if you need the payloads of the term |
|
if you need to iterate over all positions several times |
The iterator uses the underlying lucene classes to iterate over positions. For efficiency reasons, you can only iterate over positions once. If you need to iterate over the positions several times, set the _CACHE
flag.
You can combine the operators with a |
if you need more than one info. For
example, the following will return an object holding the positions and payloads,
as well as all statistics:
`_index['FIELD'].get('TERM', _POSITIONS | _PAYLOADS)`
Positions can be accessed with an iterator that returns an object
(POS_OBJECT
) holding position, offsets and payload for each term position.
-
POS_OBJECT.position
- The position of the term.
-
POS_OBJECT.startOffset
- The start offset of the term.
-
POS_OBJECT.endOffset
- The end offset of the term.
-
POS_OBJECT.payload
- The payload of the term.
-
POS_OBJECT.payloadAsInt(missingValue)
-
The payload of the term converted to integer. If the current position has
no payload, the
missingValue
will be returned. Call this only if you know that your payloads are integers. -
POS_OBJECT.payloadAsFloat(missingValue)
-
The payload of the term converted to float. If the current position has no
payload, the
missingValue
will be returned. Call this only if you know that your payloads are floats. -
POS_OBJECT.payloadAsString()
-
The payload of the term converted to string. If the current position has
no payload,
null
will be returned. Call this only if you know that your payloads are strings.
Example: sums up all payloads for the term foo
.
termInfo = _index['my_field'].get('foo',_PAYLOADS); score = 0; for (pos in termInfo) { score = score + pos.payloadAsInt(0); } return score;
Term vectors:
editThe _index
variable can only be used to gather statistics for single terms. If you want to use information on all terms in a field, you must store the term vectors (see term_vector
). To access them, call
_index.termVectors()
to get a
Fields
instance. This object can then be used as described in lucene doc to iterate over fields and then for each field iterate over each term in the field.
The method will return null if the term vectors were not stored.
On this page