Classification
editClassification
editThis functionality is in technical preview and may be changed or removed in a future release. Elastic will work to fix any issues, but features in technical preview are not subject to the support SLA of official GA features.
Classification is a machine learning process for predicting the class or category of a given data point in a dataset. Typical examples of classification problems are predicting loan risk, classifying music, or detecting cancer in a DNA sequence. In the first case, for example, our dataset consists of data on loan applicants that covers investment history, employment status, debit status, and so on. Based on historical data, the classification analysis predicts whether it is safe or risky to lend money to a given loan applicant. In the second case, the data we have represents songs and the analysis – based on the features of the data points – classifies the songs as hip-hop, country, classical, or any other genres available in the set of categories we have. Therefore, classification is for predicting discrete, categorical values, unlike regression analysis which predicts continuous, numerical values.
From the perspective of the possible output, there are two types of
classification: binary and multi-class classification. In binary
classification the variable you want to predict has only two potential values.
The loan example above is a binary classification problem where the two
potential outputs are safe
or risky
. The music classification problem is an
example of multi-class classification where there are many different potential
outputs; one for every possible music genre. In the 7.6.2 version of the
Elastic Stack, you can perform only binary classification analysis.
Feature variables
editWhen you perform classification, you must identify a subset of fields that you want to use to create a model for predicting another field value. We refer to these fields as feature variables and dependent variable, respectively. Feature variables are the values that the dependent variable value depends on. There are three different types of feature variables that you can use with our classification algorithm: numerical, categorical, and boolean. Arrays are not supported in the feature variable fields.
Training the classification model
editClassification – just like regression – is a supervised machine learning process. It means that you need to supply a labeled training dataset that has some feature variables and a dependent variable. The classification algorithm learns the relationships between the features and the dependent variable. Once you’ve trained the model on your training dataset, you can reuse the knowledge that the model has learned about the relationships between the data points to classify new data. Your training dataset should be approximately balanced which means the number of data points belonging to the various classes should not be widely different, otherwise the classification analysis may not provide the best predictions. Read Imbalanced class sizes affect classification performance to learn more.
Classification algorithms
editThe ensemble algorithm that we use in the Elastic Stack is a type of boosting called boosted tree regression model which combines multiple weak models into a composite one. We use decision trees to learn to predict the probability that a data point belongs to a certain class.
Interpreting classification results
editThe following sections help you understand and interpret the results of a classification analysis.
class_probability
editThe value of class_probability
shows how likely it is that a given datapoint
belongs to a certain class. It is a value between 0 and 1. The higher the
number, the higher the probability that the data point belongs to the named
class. This information is stored in the top_classes
array for each document in your destination index. See the
Viewing classification results
section in the classification example.
class_score
editThe value of class_score
controls the probability at which a class label is
assigned to a datapoint. In normal case – that you maximize the number of
correct labels – a class label is assigned when its predicted probability is
greater than 0.5. The class_score
makes it possible to change this behavior,
so it can be less than or greater than 0.5. For example, suppose our two classes
are denoted class 0
and class 1
, then the value of class_score
is always
non-negative and its definition is:
class_score(class 0) = 0.5 / (1.0 - k) * probability(class 0) class_score(class 1) = 0.5 / k * probability(class 1)
Here, k
is a positive constant less than one. It represents the predicted
probability of class 1
for a datapoint at which to label it class 1
and is
chosen to maximise the minimum recall of any class. This is useful for example
in case of highly imbalanced data. If class 0
is much more frequent in the
training data than class 1
, then it can mean that you achieve the best
accuracy by assigning class 0
to every datapoint. This is equivalent to zero
recall for class 1
. Instead of this behavior, the default scheme of the
Elastic Stack classification analysis is to choose k < 0.5
and accept a higher rate of
actual class 0
predicted class 1
errors, or in other words, a slight
degradation of the overall accuracy.
Feature importance
editFeature importance is calculated for supervised machine learning methods such as regression and classification. This value provides further insight into the results of a data frame analytics job and therefore helps interpret these results. As we mentioned, there are multiple features of a data point that are analyzed during data frame analytics. These features are responsible for a particular prediction to varying degrees. Feature importance shows to what degree a given feature of a data point contributes to the prediction. The feature importance value of a feature can be either positive or negative depending on its effect on the prediction. If the feature reduces the prediction value, the value is negative. If the feature increases the prediction, the feature importance value positive. The magnitude of the feature importance value shows how significantly the feature affects the prediction both locally (for a given data point) or generally (for the whole data set).
Feature importance in the Elastic Stack is calculated using the SHAP (SHapley Additive exPlanations) method as described in Lundberg, S. M., & Lee, S.-I. A Unified Approach to Interpreting Model Predictions. In NeurIPS 2017.
By default, feature importance values are not calculated. To generate this
information, when you create a data frame analytics job you must specify the
num_top_feature_importance_values
property. The feature importance values are
stored in the destination index in fields prefixed by ml.feature_importance
.
The number of feature importance values for each document might be less
than the num_top_feature_importance_values
property value. For example, it
returns only features that had a positive or negative effect on the prediction.
Measuring model performance
editYou can measure how well the model has performed on your dataset by using the
classification
evaluation type of the
evaluate data frame analytics API. The metric that the
evaluation provides you is the multi-class confusion matrix which tells you how
many times a given data point that belongs to a given class was classified
correctly and incorrectly. In other words, how many times your data point that
belongs to the X class was mistakenly classified as Y.
Another crucial measurement is how well your model performs on unseen data points. To assess how well the trained model will perform on data it has never seen before, you must set aside a proportion of the training dataset for testing. This split of the dataset is the testing dataset. Once the model has been trained, you can let the model predict the value of the data points it has never seen before and compare the prediction to the actual value by using the evaluate data frame analytics API.