Logs data stream

edit

The Elasticsearch logsdb index mode is generally available in Elastic Cloud Hosted and self-managed Elasticsearch as of version 8.17, and is enabled by default for logs in Elastic Cloud Serverless.

A logs data stream is a data stream type that stores log data more efficiently.

In benchmarks, log data stored in a logs data stream used ~2.5 times less disk space than a regular data stream. The exact impact varies by data set.

Create a logs data stream

edit

To create a logs data stream, set your template index.mode to logsdb:

resp = client.indices.put_index_template(
    name="my-index-template",
    index_patterns=[
        "logs-*"
    ],
    data_stream={},
    template={
        "settings": {
            "index.mode": "logsdb"
        }
    },
    priority=101,
)
print(resp)
const response = await client.indices.putIndexTemplate({
  name: "my-index-template",
  index_patterns: ["logs-*"],
  data_stream: {},
  template: {
    settings: {
      "index.mode": "logsdb",
    },
  },
  priority: 101,
});
console.log(response);
PUT _index_template/my-index-template
{
  "index_patterns": ["logs-*"],
  "data_stream": { },
  "template": {
     "settings": {
        "index.mode": "logsdb" 
     }
  },
  "priority": 101 
}

The index mode setting.

The index template priority. By default, Elasticsearch ships with a logs-*-* index template with a priority of 100. To make sure your index template takes priority over the default logs-*-* template, set its priority to a number higher than 100. For more information, see Avoid index pattern collisions.

After the index template is created, new indices that use the template will be configured as a logs data stream. You can start indexing data and using the data stream.

You can also set the index mode and adjust other template settings in the Elastic UI.

Synthetic source

edit

If you have the required subscription, logsdb index mode uses synthetic _source, which omits storing the original _source field. Instead, the document source is synthesized from doc values or stored fields upon document retrieval.

If you don’t have the required subscription, logsdb mode uses the original _source field.

Before using synthetic source, make sure to review the restrictions.

When working with multi-value fields, the index.mapping.synthetic_source_keep setting controls how field values are preserved for synthetic source reconstruction. In logsdb, the default value is arrays, which retains both duplicate values and the order of entries. However, the exact structure of array elements and objects is not necessarily retained. Preserving duplicates and ordering can be critical for some log fields, such as DNS A records, HTTP headers, and log entries that represent sequential or repeated events.

Index sort settings

edit

In logsdb index mode, the following sort settings are applied by default:

index.sort.field: ["host.name", "@timestamp"]
Indices are sorted by host.name and @timestamp by default. The @timestamp field is automatically injected if it is not present.
index.sort.order: ["desc", "desc"]
Both host.name and @timestamp are sorted in descending (desc) order, prioritizing the latest data.
index.sort.mode: ["min", "min"]
The min mode sorts indices by the minimum value of multi-value fields.
index.sort.missing: ["_first", "_first"]
Missing values are sorted to appear _first.

You can override these default sort settings. For example, to sort on different fields and change the order, manually configure index.sort.field and index.sort.order. For more details, see Index Sorting.

When using the default sort settings, the host.name field is automatically injected into the index mappings as a keyword field to ensure that sorting can be applied. This guarantees that logs are efficiently sorted and retrieved based on the host.name and @timestamp fields.

If subobjects is set to true (default), the host field is mapped as an object field named host with a name child field of type keyword. If subobjects is set to false, a single host.name field is mapped as a keyword field.

To apply different sort settings to an existing data stream, update the data stream’s component templates, and then perform or wait for a rollover.

In logsdb mode, the @timestamp field is automatically injected if it’s not already present. If you apply custom sort settings, the @timestamp field is injected into the mappings but is not automatically added to the list of sort fields.

Existing data streams

edit

If you’re enabling logsdb index mode on a data stream that already exists, make sure to check mappings and sorting. The logsdb mode automatically maps host.name as a keyword if it’s included in the sort settings. If a host.name field already exists but has a different type, mapping errors might occur, preventing logsdb mode from being fully applied.

To avoid mapping conflicts, consider these options:

  • Adjust mappings: Check your existing mappings to ensure that host.name is mapped as a keyword.
  • Change sorting: If needed, you can remove host.name from the sort settings and use a different set of fields. Sorting by @timestamp can be a good fallback.
  • Switch to a different index mode: If resolving host.name mapping conflicts is not feasible, you can choose not to use logsdb mode.

On existing data streams, logsdb mode is applied on rollover (automatic or manual).

Specialized codecs

edit

By default, logsdb index mode uses the best_compression codec, which applies ZSTD compression to stored fields. You can switch to the default codec for faster compression with a slightly larger storage footprint.

The logsdb index mode also automatically applies specialized codecs for numeric doc values, in order to optimize storage usage. Numeric fields are encoded using the following sequence of codecs:

  • Delta encoding: Stores the difference between consecutive values instead of the actual values.
  • Offset encoding: Stores the difference from a base value rather than between consecutive values.
  • Greatest Common Divisor (GCD) encoding: Finds the greatest common divisor of a set of values and stores the differences as multiples of the GCD.
  • Frame Of Reference (FOR) encoding: Determines the smallest number of bits required to encode a block of values and uses bit-packing to fit such values into larger 64-bit blocks.

Each encoding is evaluated according to heuristics determined by the data distribution. For example, the algorithm checks whether the data is monotonically non-decreasing or non-increasing. If so, delta encoding is applied; otherwise, the process continues with the next encoding method (offset).

Encoding is specific to each Lucene segment and is reapplied when segments are merged. The merged Lucene segment might use a different encoding than the original segments, depending on the characteristics of the merged data.

For keyword fields, Run Length Encoding (RLE) is applied to the ordinals, which represent positions in the Lucene segment-level keyword dictionary. This compression is used when multiple consecutive documents share the same keyword.

ignore settings

edit

The logsdb index mode uses the following ignore settings. You can override these settings as needed.

ignore_malformed

edit

By default, logsdb index mode sets ignore_malformed to true. With this setting, documents with malformed fields can be indexed without causing ingestion failures.

ignore_above

edit

In logsdb index mode, the index.mapping.ignore_above setting is applied by default at the index level to ensure efficient storage and indexing of large keyword fields.The index-level default for ignore_above is 8191 characters. Using UTF-8 encoding, this results in a limit of 32764 bytes, depending on character encoding.

The mapping-level ignore_above setting takes precedence. If a specific field has an ignore_above value defined in its mapping, that value overrides the index-level index.mapping.ignore_above value. This default behavior helps to optimize indexing performance by preventing excessively large string values from being indexed.

If you need to customize the limit, you can override it at the mapping level or change the index level default.

ignore_dynamic_beyond_limit

edit

In logsdb index mode, the setting index.mapping.total_fields.ignore_dynamic_beyond_limit is set to true by default. This setting allows dynamically mapped fields to be added on top of statically defined fields, even when the total number of fields exceeds the index.mapping.total_fields.limit. Instead of triggering an index failure, additional dynamically mapped fields are ignored so that ingestion can continue.

When automatically injected, host.name and @timestamp count toward the limit of mapped fields. If host.name is mapped with subobjects: true, it has two fields. When mapped with subobjects: false, host.name has only one field.

Fields without doc_values

edit

When the logsdb index mode uses synthetic _source and doc_values are disabled for a field in the mapping, Elasticsearch might set the store setting to true for that field. This ensures that the field’s data remains accessible for reconstructing the document’s source when using synthetic source.

For example, this adjustment occurs with text fields when store is false and no suitable multi-field is available for reconstructing the original value.

Settings reference

edit

The logsdb index mode uses the following settings:

  • index.mode: "logsdb"
  • index.mapping.synthetic_source_keep: "arrays"
  • index.sort.field: ["host.name", "@timestamp"]
  • index.sort.order: ["desc", "desc"]
  • index.sort.mode: ["min", "min"]
  • index.sort.missing: ["_first", "_first"]
  • index.codec: "best_compression"
  • index.mapping.ignore_malformed: true
  • index.mapping.ignore_above: 8191
  • index.mapping.total_fields.ignore_dynamic_beyond_limit: true