Data frame analytics limitations
editData frame analytics limitations
editThis functionality is in technical preview and may be changed or removed in a future release. Elastic will work to fix any issues, but features in technical preview are not subject to the support SLA of official GA features.
The following limitations and known problems apply to the 7.6.2 release of the Elastic data frame analytics feature:
Cross-cluster search is not supported
editCross-cluster search is not supported for data frame analytics.
Deleting a data frame analytics job does not delete the destination index
editThe delete data frame analytics job API does not delete the destination index that contains the annotated data of the data frame analytics. That index must be deleted separately.
Data frame analytics jobs cannot be updated
editYou cannot update data frame analytics configurations. Instead, delete the data frame analytics job and create a new one.
Data frame analytics memory limitation
editData frame analytics can only perform analyses that fit into the memory available for machine learning. Overspill to disk is not currently possible. For general machine learning settings, see Machine learning settings in Elasticsearch.
Data frame analytics jobs runtime may vary
editThe runtime of data frame analytics jobs depends on numerous factors, such as the number of data points in the data set, the type of analytics, the number of fields that are included in the analysis, the supplied hyperparameters, the type of analyzed fields, and so on. For this reason, a general runtime value that applies to all or most of the situations does not exist. The runtime of a data frame analytics job may take from a couple of minutes up to 35 hours in extreme cases.
The runtime increases with an increasing number of analyzed fields in a nearly linear fashion. For data sets of more than 100,000 points, start with a low training percent. Run a few data frame analytics jobs to see how the runtime scales with the increased number of data points and how the quality of results scales with an increased training percentage.
Documents with missing values in analyzed fields are skipped
editIf there are missing values in feature fields (fields that are subjects of the data frame analytics), the document that contains these fields is skipped during the analysis.
Outlier detection field types
editOutlier detection requires numeric or boolean data to analyze. The algorithms don’t support missing values (see also Documents with missing values in analyzed fields are skipped), therefore fields that have data types other than numeric or boolean are ignored. Documents where included fields contain missing values, null values, or an array are also ignored. Therefore a destination index may contain documents that don’t have an outlier score. These documents are still reindexed from the source index to the destination index, but they are not included in the outlier detection analysis and therefore no outlier score is computed.
Regression field types
editRegression supports fields that are numeric, boolean, text, keyword and ip. It is also tolerant of missing values. Fields that are supported are included in the analysis, other fields are ignored. Documents where included fields contain an array are also ignored. Documents in the destination index that don’t contain a results field are not included in the regression analysis.
Classification field types
editClassification supports fields that have numeric, boolean, text, keyword, or ip data types. It is also tolerant of missing values. Fields that are supported are included in the analysis, other fields are ignored. Documents where included fields contain an array are also ignored. Documents in the destination index that don’t contain a results field are not included in the classification analysis.
Imbalanced class sizes affect classification performance
editIf your training data is very imbalanced, classification analysis may not provide good predictions. Try to avoid highly imbalanced situations. We recommend having at least 50 examples of each class and a ratio of no more than 10 to 1 for the majority to minority class labels in the training data. If your training data set is very imbalanced, consider downsampling the majority class, upsampling the minority class, or gathering more data.
Deeply nested objects affect inference performance
editIf the data that you run inference against contains documents that have a series
of combinations of dot delimited and nested fields (for example:
{"a.b": "c", "a": {"b": "c"},...}
), the performance of the operation might be
slightly slower. Consider using as simple mapping as possible for the best
performance profile.
Analytics runtime performance may significantly slow down with feature importance computation
editFor complex models (such as those with many deep trees), the calculation of feature importance takes significantly more time. Feature importance is calculated at the end of the analysis and therefore the job may appear to be stuck at 99% for several hours.
If a reduction in runtime is important to you, try strategies such as disabling feature importance, using a smaller transform, setting hyperparameter values, or only selecting fields that are relevant for analysis.
Analytics training on multi-field values may affect inference
editData frame analytics jobs dynamically select the best field when multi-field
values are included. For example, if a multi-field foo
is included for training,
the foo.keyword
is actually used. This poses a complication for inference with
the inference processor. Documents supplied to ingest pipelines are not mapped. Consequently,
only the field foo
is present. This means that a model trained with the field foo.keyword
does not take the field foo
into account.
You can work around this limitation by using the field_mappings
parameter in the inference processor.
Example:
{ "inference": { "model_id": "my_model_with_multi-fields", "field_mappings": { "foo": "foo.keyword" }, "inference_config": { "regression": {} } } }