Logs data stream

edit

Logs data streams and the logsdb index mode are in tech preview and may be changed or removed in the future. Don’t use logs data streams or logsdb index mode in production.

A logs data stream is a data stream type that stores log data more efficiently.

In benchmarks, log data stored in a logs data stream used ~2.5 times less disk space than a regular data stream. The exact impact will vary depending on your data set.

Create a logs data stream

edit

To create a logs data stream, set your index template index.mode to logsdb:

resp = client.indices.put_index_template(
    name="my-index-template",
    index_patterns=[
        "logs-*"
    ],
    data_stream={},
    template={
        "settings": {
            "index.mode": "logsdb"
        }
    },
    priority=101,
)
print(resp)
const response = await client.indices.putIndexTemplate({
  name: "my-index-template",
  index_patterns: ["logs-*"],
  data_stream: {},
  template: {
    settings: {
      "index.mode": "logsdb",
    },
  },
  priority: 101,
});
console.log(response);
PUT _index_template/my-index-template
{
  "index_patterns": ["logs-*"],
  "data_stream": { },
  "template": {
     "settings": {
        "index.mode": "logsdb" 
     }
  },
  "priority": 101 
}

The index mode setting.

The index template priority. By default, Elasticsearch ships with an index template with a logs-*-* pattern with a priority of 100. You need to define a priority higher than 100 to ensure that this index template gets selected over the default index template for the logs-*-* pattern. See the avoid index pattern collision section for more information.

After the index template is created, new indices that use the template will be configured as a logs data stream. You can start indexing data and using the data stream.

Synthetic source

edit

By default, logsdb mode uses synthetic source, which omits storing the original _source field and synthesizes it from doc values or stored fields upon document retrieval. Synthetic source comes with a few restrictions which you can read more about in the documentation section dedicated to it.

When dealing with multi-value fields, the index.mapping.synthetic_source_keep setting controls how field values are preserved for synthetic source reconstruction. In logsdb, the default value is arrays, which retains both duplicate values and the order of entries but not necessarily the exact structure when it comes to array elements or objects. Preserving duplicates and ordering could be critical for some log fields. This could be the case, for instance, for DNS A records, HTTP headers, or log entries that represent sequential or repeated events.

For more details on this setting and ways to refine or bypass it, check out this section.

Index sort settings

edit

The following settings are applied by default when using the logsdb mode for index sorting:

  • index.sort.field: ["host.name", "@timestamp"] In logsdb mode, indices are sorted by host.name and @timestamp fields by default. For data streams, the @timestamp field is automatically injected if it is not present.
  • index.sort.order: ["desc", "desc"] The default sort order for both fields is descending (desc), prioritizing the latest data.
  • index.sort.mode: ["min", "min"] The default sort mode is min, ensuring that indices are sorted by the minimum value of multi-value fields.
  • index.sort.missing: ["_first", "_first"] Missing values are sorted to appear first (_first) in logsdb index mode.

logsdb index mode allows users to override the default sort settings. For instance, users can specify their own fields and order for sorting by modifying the index.sort.field and index.sort.order.

When using default sort settings, the host.name field is automatically injected into the mappings of the index as a keyword field to ensure that sorting can be applied. This guarantees that logs are efficiently sorted and retrieved based on the host.name and @timestamp fields.

If subobjects is set to true (which is the default), the host.name field will be mapped as an object field named host, containing a name child field of type keyword. On the other hand, if subobjects is set to false, a single host.name field will be mapped as a keyword field.

Once an index is created, the sort settings are immutable and cannot be modified. To apply different sort settings, a new index must be created with the desired configuration. For data streams, this can be achieved by means of an index rollover after updating relevant (component) templates.

If the default sort settings are not suitable for your use case, consider modifying them. Keep in mind that sort settings can influence indexing throughput, query latency, and may affect compression efficiency due to the way data is organized after sorting. For more details, refer to our documentation on index sorting.

For data streams, the @timestamp field is automatically injected if not already present. However, if custom sort settings are applied, the @timestamp field is injected into the mappings, but it is not automatically added to the list of sort fields.

Specialized codecs

edit

logsdb index mode uses the best_compression codec by default, which applies ZSTD compression to stored fields. Users are allowed to override it and switch to the default codec for faster compression at the expense of slightly larger storage footprint.

logsdb index mode also adopts specialized codecs for numeric doc values that are crafted to optimize storage usage. Users can rely on these specialized codecs being applied by default when using logsdb index mode.

Doc values encoding for numeric fields in logsdb follows a static sequence of codecs, applying each one in the following order: delta encoding, offset encoding, Greatest Common Divisor GCD encoding, and finally Frame Of Reference (FOR) encoding. The decision to apply each encoding is based on heuristics determined by the data distribution. For example, before applying delta encoding, the algorithm checks if the data is monotonically non-decreasing or non-increasing. If the data fits this pattern, delta encoding is applied; otherwise, the next encoding is considered.

The encoding is specific to each Lucene segment and is also re-applied at segment merging time. The merged Lucene segment may use a different encoding compared to the original Lucene segments, based on the characteristics of the merged data.

The following methods are applied sequentially:

  • Delta encoding: a compression method that stores the difference between consecutive values instead of the actual values.
  • Offset encoding: a compression method that stores the difference from a base value rather than between consecutive values.
  • Greatest Common Divisor (GCD) encoding: a compression method that finds the greatest common divisor of a set of values and stores the differences as multiples of the GCD.
  • Frame Of Reference (FOR) encoding: a compression method that determines the smallest number of bits required to encode a block of values and uses bit-packing to fit such values into larger 64-bit blocks.

For keyword fields, Run Length Encoding (RLE) is applied to the ordinals, which represent positions in the Lucene segment-level keyword dictionary. This compression is used when multiple consecutive documents share the same keyword.

ignore_malformed, ignore_above, ignore_dynamic_beyond_limit

edit

By default, logsdb index mode sets ignore_malformed to true. This setting allows documents with malformed fields to be indexed without causing indexing failures, ensuring that log data ingestion continues smoothly even when some fields contain invalid or improperly formatted data.

Users can override this setting by setting index.mapping.ignore_malformed to false. However, this is not recommended as it might result in documents with malformed fields being rejected and not indexed at all.

In logsdb index mode, the index.mapping.ignore_above setting is applied by default at the index level to ensure efficient storage and indexing of large keyword fields.The index-level default for ignore_above is set to 8191 characters. If using UTF-8 encoding, this results in a limit of 32764 bytes, depending on character encoding. The mapping-level ignore_above setting still takes precedence. If a specific field has an ignore_above value defined in its mapping, that value will override the index-level index.mapping.ignore_above value. This default behavior helps to optimize indexing performance by preventing excessively large string values from being indexed, while still allowing users to customize the limit, overriding it at the mapping level or changing the index level default setting.

In logsdb index mode, the setting index.mapping.total_fields.ignore_dynamic_beyond_limit is set to true by default. This allows dynamically mapped fields to be added on top of statically defined fields without causing document rejection, even after the total number of fields exceeds the limit defined by index.mapping.total_fields.limit. The index.mapping.total_fields.limit setting specifies the maximum number of fields an index can have (static, dynamic and runtime). When the limit is reached, new dynamically mapped fields will be ignored instead of failing the document indexing, ensuring continued log ingestion without errors.

When automatically injected, host.name and @timestamp contribute to the limit of mapped fields. When host.name is mapped with subobjects: true it consists of two fields. When host.name is mapped with subobjects: false it only consists of one field.

Fields without doc values

edit

When logsdb index mode uses synthetic _source, and doc_values are disabled for a field in the mapping, Elasticsearch may set the store setting to true for that field as a last resort option to ensure that the field’s data is still available for reconstructing the document’s source when retrieving it via synthetic source.

For example, this happens with text fields when store is false and there is no suitable multi-field available to reconstruct the original value in synthetic source.

This automatic adjustment allows synthetic source to work correctly, even when doc values are not enabled for certain fields.

LogsDB settings summary

edit

The following is a summary of key settings that apply when using logsdb index mode in Elasticsearch:

  • index.mode: "logsdb"
  • index.mapping.synthetic_source_keep: "arrays"
  • index.sort.field: ["host.name", "@timestamp"]
  • index.sort.order: ["desc", "desc"]
  • index.sort.mode: ["min", "min"]
  • index.sort.missing: ["_first", "_first"]
  • index.codec: "best_compression"
  • index.mapping.ignore_malformed: true
  • index.mapping.ignore_above: 8191
  • index.mapping.total_fields.ignore_dynamic_beyond_limit: true