Classification

edit

This functionality is in technical preview and may be changed or removed in a future release. Elastic will work to fix any issues, but features in technical preview are not subject to the support SLA of official GA features.

Classification is a machine learning process that enables you to predict the class or category of a data point in your data set. Typical examples of classification problems are predicting loan risk, classifying music, or detecting the potential for cancer in a DNA sequence. In the first case, for example, the data set might contain the investment history, employment status, debit status, and other financial details for loan applicants. Based on this data, you could use classification analysis to create a model that predicts whether it is safe or risky to lend money to applicants. In the second case, the data contains song details that enable you to classify music into genres like hip-hop, country, or classical, for example. Classification is for predicting discrete, categorical values, whereas regression analysis predicts continuous, numerical values.

When you create a classification job, you must specify which field contains the classes that you want to predict. This field is known as the dependent variable. It must contain no more than 30 classes. By default, all other supported fields are included in the analysis and are known as feature variables. The runtime and resources used by the job increase with the number of feature variables. Therefore, you can optionally include or exclude fields from the analysis. For more information about field selection, see the explain data frame analytics API.

Training the classification model

edit

Classification – just like regression – is a supervised machine learning process. When you create the data frame analytics job, you must provide a data set that contains the ground truth. That is to say, your data set must contain the dependent variable and the feature variables fields that are related to it. You can divide the data set into training and testing data by specifying a training_percent. By default when you use the create data frame analytics jobs API, 100% of the eligible documents in the data set are used for training. If you divide your data set, the job stratifies the data to ensure that both the training and testing data sets contains classes in proportions that are representative of the class proportions in the full data set.

When you are collecting a data set to train your model, ensure that it captures information for all of the classes. If some classes are poorly represented in the training data set (that is, you have very few data points per class), the model might be unaware of them. In general, complex decision boundaries between classes are harder to learn and require more data points per class in the training data.

Classification algorithms

edit

Classification analysis uses an ensemble algorithm that is a type of boosting called boosted tree regression model which combines multiple weak models into a composite one. It uses decision trees to learn to predict the probability that a data point belongs to a certain class. A sequence of decision trees are trained and every decision tree learns from the mistakes of the previous one. Every tree is an iteration of the last one, hence it improves the decision made by the previous tree.

Classification performance

edit

As a rule of thumb, a classification analysis with many classes takes more time to run than a binary classification process when there are only two classes. The relationship between the number of classes and the runtime is roughly linear.

The runtime also scales approximately linearly with the number of involved documents below 200,000 data points. Therefore, if you double the number of documents, then the runtime of the analysis doubles respectively.

To improve the performance of your classification analysis, consider using a smaller training_percent value when you create the job. That is to say, use a smaller percentage of your documents to train the model more quickly. It is a good strategy to make progress iteratively: use a smaller training percentage first, run the analysis, and evaluate the performance. Then, based on the results, decide if it is necessary to increase the training_percent value. If possible, prepare your input data such that it has less classes. You can also remove the fields that are not relevant from the analysis by specifying excludes patterns in the analyzed_fields object when configuring the data frame analytics job.

Interpreting classification results

edit

The following sections help you understand and interpret the results of a classification analysis.

class_probability

edit

The value of class_probability shows how likely it is that a given data point belongs to a certain class. It is a value between 0 and 1. The higher the number, the higher the probability that the data point belongs to the named class. This information is stored in the top_classes array for each document in your destination index. See the Viewing classification results section in the classification example.

class_score

edit

The value of class_score controls the probability at which a class label is assigned to a data point. In normal case – that you maximize the number of correct labels – a class label is assigned when its predicted probability is greater than 0.5. The class_score makes it possible to change this behavior, so it can be less than or greater than 0.5. For example, suppose our two classes are denoted class 0 and class 1, then the value of class_score is always non-negative and its definition is:

\$\text{class_score(class 0)} = 0.5 / (1.0 - k) * probability(class 0)\$

\$\text{class_score(class 1)}= 0.5 / k * probability(class 1)\$

Here, k is a positive constant less than one. It represents the predicted probability of class 1 for a data point at which to label it class 1 and is chosen to maximize the minimum recall of any class. This is useful for example in case of highly imbalanced data. If class 0 is much more frequent in the training data than class 1, then it can mean that you achieve the best accuracy by assigning class 0 to every data point. This is equivalent to zero recall for class 1. Instead of this behavior, the default scheme of the Elastic Stack classification analysis is to choose k < 0.5 and accept a higher rate of actual class 0 predicted class 1 errors, or in other words, a slight degradation of the overall accuracy.

Feature importance

edit

Feature importance provides further information about the results of an analysis and helps to interpret the results in a more subtle way. If you want to learn more about feature importance, click here.

Measuring model performance

edit

You can measure how well the model has performed on your data set by using the classification evaluation type of the evaluate data frame analytics API. The metric that the evaluation provides you is a confusion matrix. The more classes you have, the more complex the confusion matrix is. The matrix tells you how many data points that belong to a given class were classified correctly and incorrectly.

If you split your data set into training and testing data, you can determine how well your model performs on data it has never seen before and compare the prediction to the actual value.

For more information, see Evaluating data frame analytics.