Data frame analytics job resources

edit

Data frame analytics job resources

edit

Data frame analytics resources relate to APIs such as Create data frame analytics jobs and Get data frame analytics jobs.

Properties

edit
analysis
(object) The type of analysis that is performed on the source. For example: outlier_detection or regression. For more information, see Analysis objects.
analyzed_fields

(object) You can specify both includes and/or excludes patterns. If analyzed_fields is not set, only the relevant fields will be included. For example, all the numeric fields for outlier detection. For the supported field types, see Supported fields.

includes
(array) An array of strings that defines the fields that will be included in the analysis.
excludes
(array) An array of strings that defines the fields that will be excluded from the analysis.
PUT _ml/data_frame/analytics/loganalytics
{
  "source": {
    "index": "logdata"
  },
  "dest": {
    "index": "logdata_out"
  },
  "analysis": {
    "outlier_detection": {
    }
  },
  "analyzed_fields": {
        "includes": [ "request.bytes", "response.counts.error" ],
        "excludes": [ "source.geo" ]
  }
}
description
(Optional, string) A description of the job.
dest

(object) The destination configuration of the analysis.

index
(Required, string) Defines the destination index to store the results of the data frame analytics job.
results_field
(Optional, string) Defines the name of the field in which to store the results of the analysis. Default to ml.
id
(string) The unique identifier for the data frame analytics job. This identifier can contain lowercase alphanumeric characters (a-z and 0-9), hyphens, and underscores. It must start and end with alphanumeric characters. This property is informational; you cannot change the identifier for existing jobs.
model_memory_limit
(string) The approximate maximum amount of memory resources that are permitted for analytical processing. The default value for data frame analytics jobs is 1gb. If your elasticsearch.yml file contains an xpack.ml.max_model_memory_limit setting, an error occurs when you try to create data frame analytics jobs that have model_memory_limit values greater than that setting. For more information, see Machine learning settings.
source

(object) The source configuration consisting an index and optionally a query object.

index
(Required, string or array) Index or indices on which to perform the analysis. It can be a single index or index pattern as well as an array of indices or patterns.
query
(Optional, object) The Elasticsearch query domain-specific language (DSL). This value corresponds to the query object in an Elasticsearch search POST body. All the options that are supported by Elasticsearch can be used, as this object is passed verbatim to Elasticsearch. By default, this property has the following value: {"match_all": {}}.

Analysis objects

edit

Data frame analytics resources contain analysis objects. For example, when you create a data frame analytics job, you must define the type of analysis it performs.

Outlier detection configuration objects

edit

An outlier_detection configuration object has the following properties:

feature_influence_threshold
(double) The minimum outlier score that a document needs to have in order to calculate its feature influence score. Value range: 0-1 (0.1 by default).
method
(string) Sets the method that outlier detection uses. If the method is not set outlier detection uses an ensemble of different methods and normalises and combines their individual outlier scores to obtain the overall outlier score. We recommend to use the ensemble method. Available methods are lof, ldof, distance_kth_nn, distance_knn.
n_neighbors
(integer) Defines the value for how many nearest neighbors each method of outlier detection will use to calculate its outlier score. When the value is not set, different values will be used for different ensemble members. This helps improve diversity in the ensemble. Therefore, only override this if you are confident that the value you choose is appropriate for the data set.

Regression configuration objects

edit
PUT _ml/data_frame/analytics/house_price_regression_analysis
{
  "source": {
    "index": "houses_sold_last_10_yrs" 
  },
  "dest": {
    "index": "house_price_predictions" 
  },
  "analysis":
    {
      "regression": { 
        "dependent_variable": "price" 
      }
    }
}

Training data is taken from source index houses_sold_last_10_yrs.

Analysis results will be output to destination index house_price_predictions.

The regression analysis configuration object.

Regression analysis will use field price to train on. As no other parameters have been specified it will train on 100% of eligible data, store its prediction in destination index field price_prediction and use in-built hyperparameter optimization to give minimum validation errors.

Standard parameters
edit
dependent_variable
(Required, string) Defines which field of the document is to be predicted. This parameter is supplied by field name and must match one of the fields in the index being used to train. If this field is missing from a document, then that document will not be used for training, but a prediction with the trained model will be generated for it. The data type of the field must be numeric. It is also known as continuous target variable.
prediction_field_name
(Optional, string) Defines the name of the prediction field in the results. Defaults to <dependent_variable>_prediction.
training_percent
(Optional, integer) Defines what percentage of the eligible documents that will be used for training. Documents that are ignored by the analysis (for example those that contain arrays) won’t be included in the calculation for used percentage. Defaults to 100.
Advanced parameters
edit

Advanced parameters are for fine-tuning regression analysis. They are set automatically by hyperparameter optimization to give minimum validation error. It is highly recommended to use the default values unless you fully understand the function of these parameters. If these parameters are not supplied, their values are automatically tuned to give minimum validation error.

eta
(Optional, double) The shrinkage applied to the weights. Smaller values result in larger forests which have better generalization error. However, the smaller the value the longer the training will take. For more information, see this wiki article about shrinkage.
feature_bag_fraction
(Optional, double) Defines the fraction of features that will be used when selecting a random bag for each candidate split.
maximum_number_trees
(Optional, integer) Defines the maximum number of trees the forest is allowed to contain. The maximum value is 2000.
gamma
(Optional, double) Regularization parameter to prevent overfitting on the training dataset. Multiplies a linear penalty associated with the size of individual trees in the forest. The higher the value the more training will prefer smaller trees. The smaller this parameter the larger individual trees will be and the longer train will take.
lambda
(Optional, double) Regularization parameter to prevent overfitting on the training dataset. Multiplies an L2 regularisation term which applies to leaf weights of the individual trees in the forest. The higher the value the more training will attempt to keep leaf weights small. This makes the prediction function smoother at the expense of potentially not being able to capture relevant relationships between the features and the dependent variable. The smaller this parameter the larger individual trees will be and the longer train will take.

Hyperparameter optimization

edit

If you don’t supply regression parameters, hyperparameter optimization will be performed by default to set a value for the undefined parameters. The starting point is calculated for data dependent parameters by examining the loss on the training data. Subject to the size constraint, this operation provides an upper bound on the improvement in validation loss.

A fixed number of rounds is used for optimization which depends on the number of parameters being optimized. The optimitazion starts with random search, then Bayesian Optimisation is performed that is targeting maximum expected improvement. If you override any parameters, then the optimization will calculate the value of the remaining parameters accordingly and use the value you provided for the overridden parameter. The number of rounds are reduced respectively. The validation error is estimated in each round by using 4-fold cross validation.