Codec module

edit

Codecs define how documents are written to disk and read from disk. The postings format is the part of the codec that responsible for reading and writing the term dictionary, postings lists and positions, payloads and offsets stored in the postings list.

Configuring custom postings formats is an expert feature and most likely using the builtin postings formats will suite your needs as is described in the mapping section.

Codecs are available in Elasticsearch from version 0.90.0.beta1.

Configuring a custom postings format

edit

Custom postings format can be defined in the index settings in the codec part. The codec part can be configured when creating an index or updating index settings. An example on how to define your custom postings format:

curl -XPUT 'http://localhost:9200/twitter/' -d '{
    "settings" : {
        "index" : {
            "codec" : {
          "postings_format" : {
             "my_format" : {
                "type" : "pulsing",
                "freq_cut_off" : "5"
             }
          }
       }
        }
    }
}'

Then we defining your mapping your can use the my_format name in the postings_format option as the example below illustrates:

{
  "person" : {
     "properties" : {
         "second_person_id" : {"type" : "string", "postings_format" : "my_format"}
     }
  }
}

Available postings formats

edit

Direct postings format

edit

Wraps the default postings format for on-disk storage, but then at read time loads and stores all terms & postings directly in RAM. This postings format makes no effort to compress the terms and posting list and therefore is memory intensive, but because of this it gives a substantial increase in search performance. Because this holds all term bytes as a single byte[], you cannot have more than 2.1GB worth of terms in a single segment.

This postings format offers the following parameters:

min_skip_count
The minimum number terms with a shared prefix to allow a skip pointer to be written. The default is 8.
low_freq_cutoff
Terms with a lower document frequency use a single array object representation for postings and positions. The default is 32.

Type name: direct

Memory postings format

edit

A postings format that stores terms & postings (docs, positions, payloads) in RAM, using an FST. This postings format does write to disk, but loads everything into memory. The memory postings format has the following options:

pack_fst
A boolean option that defines if the in memory structure should be packed once its build. Packed will reduce the size for the data-structure in memory but requires more memory during building. Default is false.
acceptable_overhead_ratio
The compression ratio specified as a float, that is used to compress internal structures. Example ratios 0 (Compact, no memory overhead at all, but the returned implementation may be slow), 0.5 (Fast, at most 50% memory overhead, always select a reasonably fast implementation), 7 (Fastest, at most 700% memory overhead, no compression). Default is 0.2.

Type name: memory

Bloom filter posting format

edit

The bloom filter postings format wraps a delegate postings format and on top of this creates a bloom filter that is written to disk. During opening this bloom filter is loaded into memory and used to offer "fast-fail" reads. This postings format is useful for low doc-frequency fields such as primary keys. The bloom filter postings format has the following options:

delegate
The name of the configured postings format that the bloom filter postings format will wrap.
fpp
The desired false positive probability specified as a floating point number between 0 and 1.0. The fpp can be configured for multiple expected insertions. Example expression: 10k=0.01,1m=0.03. If number docs per index segment is larger than 1m then use 0.03 as fpp and if number of docs per segment is larger than 10k use 0.01 as fpp. The last fallback value is always 0.03. This example expression is also the default.

Type name: bloom

[0.90.9] Added in 0.90.9. . It can sometime make sense to disable bloom filters. For instance, if you are logging into an index per day, and you have thousands of indices, the bloom filters can take up a sizable amount of memory. For most queries you are only interested in recent indices, so you don’t mind CRUD operations on older indices taking slightly longer.

In these cases you can disable loading of the bloom filter on a per-index basis by updating the index settings:

PUT /old_index/_settings?index.codec.bloom.load=false

This setting, which defaults to true, can be updated on a live index. Note, however, that changing the value will cause the index to be reopened, which will invalidate any existing caches.

Pulsing postings format

edit

The pulsing implementation in-lines the posting lists for very low frequent terms in the term dictionary. This is useful to improve lookup performance for low-frequent terms. This postings format offers the following parameters:

min_block_size
The minimum block size the default Lucene term dictionary uses to encode on-disk blocks. Defaults to 25.
max_block_size
The maximum block size the default Lucene term dictionary uses to encode on-disk blocks. Defaults to 48.
freq_cut_off
The document frequency cut off where pulsing in-lines posting lists into the term dictionary. Terms with a document frequency less or equal to the cutoff will be in-lined. The default is 1.

Type name: pulsing

Default postings format

edit

The default postings format has the following options:

min_block_size
The minimum block size the default Lucene term dictionary uses to encode on-disk blocks. Defaults to 25.
max_block_size
The maximum block size the default Lucene term dictionary uses to encode on-disk blocks. Defaults to 48.

Type name: default