- Elasticsearch Guide: other versions:
- Getting Started
- Setup Elasticsearch
- Breaking changes
- Breaking changes in 5.3
- Breaking changes in 5.2
- Breaking changes in 5.1
- Breaking changes in 5.0
- Search and Query DSL changes
- Mapping changes
- Percolator changes
- Suggester changes
- Index APIs changes
- Document API changes
- Settings changes
- Allocation changes
- HTTP changes
- REST API changes
- CAT API changes
- Java API changes
- Packaging
- Plugin changes
- Filesystem related changes
- Path to data on disk
- Aggregation changes
- Script related changes
- API Conventions
- Document APIs
- Search APIs
- Aggregations
- Metrics Aggregations
- Avg Aggregation
- Cardinality Aggregation
- Extended Stats Aggregation
- Geo Bounds Aggregation
- Geo Centroid Aggregation
- Max Aggregation
- Min Aggregation
- Percentiles Aggregation
- Percentile Ranks Aggregation
- Scripted Metric Aggregation
- Stats Aggregation
- Sum Aggregation
- Top hits Aggregation
- Value Count Aggregation
- Bucket Aggregations
- Adjacency Matrix Aggregation
- Children Aggregation
- Date Histogram Aggregation
- Date Range Aggregation
- Diversified Sampler Aggregation
- Filter Aggregation
- Filters Aggregation
- Geo Distance Aggregation
- GeoHash grid Aggregation
- Global Aggregation
- Histogram Aggregation
- IP Range Aggregation
- Missing Aggregation
- Nested Aggregation
- Range Aggregation
- Reverse nested Aggregation
- Sampler Aggregation
- Significant Terms Aggregation
- Terms Aggregation
- Pipeline Aggregations
- Avg Bucket Aggregation
- Derivative Aggregation
- Max Bucket Aggregation
- Min Bucket Aggregation
- Sum Bucket Aggregation
- Stats Bucket Aggregation
- Extended Stats Bucket Aggregation
- Percentiles Bucket Aggregation
- Moving Average Aggregation
- Cumulative Sum Aggregation
- Bucket Script Aggregation
- Bucket Selector Aggregation
- Serial Differencing Aggregation
- Matrix Aggregations
- Caching heavy aggregations
- Returning only aggregation results
- Aggregation Metadata
- Metrics Aggregations
- Indices APIs
- Create Index
- Delete Index
- Get Index
- Indices Exists
- Open / Close Index API
- Shrink Index
- Rollover Index
- Put Mapping
- Get Mapping
- Get Field Mapping
- Types Exists
- Index Aliases
- Update Indices Settings
- Get Settings
- Analyze
- Index Templates
- Shadow replica indices
- Indices Stats
- Indices Segments
- Indices Recovery
- Indices Shard Stores
- Clear Cache
- Flush
- Refresh
- Force Merge
- cat APIs
- Cluster APIs
- Query DSL
- Mapping
- Analysis
- Anatomy of an analyzer
- Testing analyzers
- Analyzers
- Normalizers
- Tokenizers
- Token Filters
- Standard Token Filter
- ASCII Folding Token Filter
- Flatten Graph Token Filter
- Length Token Filter
- Lowercase Token Filter
- Uppercase Token Filter
- NGram Token Filter
- Edge NGram Token Filter
- Porter Stem Token Filter
- Shingle Token Filter
- Stop Token Filter
- Word Delimiter Token Filter
- Stemmer Token Filter
- Stemmer Override Token Filter
- Keyword Marker Token Filter
- Keyword Repeat Token Filter
- KStem Token Filter
- Snowball Token Filter
- Phonetic Token Filter
- Synonym Token Filter
- Synonym Graph Token Filter
- Compound Word Token Filter
- Reverse Token Filter
- Elision Token Filter
- Truncate Token Filter
- Unique Token Filter
- Pattern Capture Token Filter
- Pattern Replace Token Filter
- Trim Token Filter
- Limit Token Count Token Filter
- Hunspell Token Filter
- Common Grams Token Filter
- Normalization Token Filter
- CJK Width Token Filter
- CJK Bigram Token Filter
- Delimited Payload Token Filter
- Keep Words Token Filter
- Keep Types Token Filter
- Classic Token Filter
- Apostrophe Token Filter
- Decimal Digit Token Filter
- Fingerprint Token Filter
- Minhash Token Filter
- Character Filters
- Modules
- Index Modules
- Ingest Node
- Pipeline Definition
- Ingest APIs
- Accessing Data in Pipelines
- Handling Failures in Pipelines
- Processors
- Append Processor
- Convert Processor
- Date Processor
- Date Index Name Processor
- Fail Processor
- Foreach Processor
- Grok Processor
- Gsub Processor
- Join Processor
- JSON Processor
- KV Processor
- Lowercase Processor
- Remove Processor
- Rename Processor
- Script Processor
- Set Processor
- Split Processor
- Sort Processor
- Trim Processor
- Uppercase Processor
- Dot Expander Processor
- How To
- Testing
- Glossary of terms
- Release Notes
- 5.3.3 Release Notes
- 5.3.2 Release Notes
- 5.3.1 Release Notes
- 5.3.0 Release Notes
- 5.2.2 Release Notes
- 5.2.1 Release Notes
- 5.2.0 Release Notes
- 5.1.2 Release Notes
- 5.1.1 Release Notes
- 5.1.0 Release Notes
- 5.0.2 Release Notes
- 5.0.1 Release Notes
- 5.0.0 Combined Release Notes
- 5.0.0 GA Release Notes
- 5.0.0-rc1 Release Notes
- 5.0.0-beta1 Release Notes
- 5.0.0-alpha5 Release Notes
- 5.0.0-alpha4 Release Notes
- 5.0.0-alpha3 Release Notes
- 5.0.0-alpha2 Release Notes
- 5.0.0-alpha1 Release Notes
- 5.0.0-alpha1 Release Notes (Changes previously released in 2.x)
- Painless API Reference
Sampler Aggregation
editSampler Aggregation
editThis functionality is in technical preview and may be changed or removed in a future release. Elastic will work to fix any issues, but features in technical preview are not subject to the support SLA of official GA features.
A filtering aggregation used to limit any sub aggregations' processing to a sample of the top-scoring documents.
Example use cases:
- Tightening the focus of analytics to high-relevance matches rather than the potentially very long tail of low-quality matches
-
Reducing the running cost of aggregations that can produce useful results using only samples e.g.
significant_terms
Example:
A query on StackOverflow data for the popular term javascript
OR the rarer term
kibana
will match many documents - most of them missing the word Kibana. To focus
the significant_terms
aggregation on top-scoring documents that are more likely to match
the most interesting parts of our query we use a sample.
POST /stackoverflow/_search?size=0 { "query": { "query_string": { "query": "tags:kibana OR tags:javascript" } }, "aggs": { "sample": { "sampler": { "shard_size": 200 }, "aggs": { "keywords": { "significant_terms": { "field": "tags", "exclude": ["kibana", "javascript"] } } } } } }
Response:
{ ... "aggregations": { "sample": { "doc_count": 1000, "keywords": { "doc_count": 1000, "buckets": [ { "key": "elasticsearch", "doc_count": 150, "score": 1.078125, "bg_count": 200 }, { "key": "logstash", "doc_count": 50, "score": 0.5625, "bg_count": 50 } ] } } } }
1000 documents were sampled in total because we asked for a maximum of 200 from an index with 5 shards. The cost of performing the nested significant_terms aggregation was therefore limited rather than unbounded. |
Without the sampler
aggregation the request query considers the full "long tail" of low-quality matches and therefore identifies
less significant terms such as jquery
and angular
rather than focusing on the more insightful Kibana-related terms.
POST /stackoverflow/_search?size=0 { "query": { "query_string": { "query": "tags:kibana OR tags:javascript" } }, "aggs": { "low_quality_keywords": { "significant_terms": { "field": "tags", "size": 3, "exclude":["kibana", "javascript"] } } } }
Response:
{ ... "aggregations": { "low_quality_keywords": { "doc_count": 1000, "buckets": [ { "key": "angular", "doc_count": 200, "score": 0.02777, "bg_count": 200 }, { "key": "jquery", "doc_count": 200, "score": 0.02777, "bg_count": 200 }, { "key": "logstash", "doc_count": 50, "score": 0.0069, "bg_count": 50 } ] } } }
shard_size
editThe shard_size
parameter limits how many top-scoring documents are collected in the sample processed on each shard.
The default value is 100.
Limitations
editCannot be nested under breadth_first
aggregations
editBeing a quality-based filter the sampler aggregation needs access to the relevance score produced for each document.
It therefore cannot be nested under a terms
aggregation which has the collect_mode
switched from the default depth_first
mode to breadth_first
as this discards scores.
In this situation an error will be thrown.