Predicting flight delays with regression analysis
editPredicting flight delays with regression analysis
editThis functionality is in beta and is subject to change. The design and code is less mature than official GA features and is being provided as-is with no warranties. Beta features are not subject to the support SLA of official GA features.
Let’s try to predict flight delays by using the
sample flight data. The data set contains
information such as weather conditions, flight destinations and origins, flight
distances, carriers, and the number of minutes each flight was delayed. When you
create a data frame analytics job for regression analysis, it learns the relationships
between the fields in your data in order to predict the value of a
dependent variable, which in this case is the numeric FlightDelayMins
field.
For an overview of these concepts, see Regression and
Introduction to supervised learning.
Preparing your data
editEach document in the data set contains details for a single flight, so this data is ready for analysis; it is already in a two-dimensional entity-based data structure. In general, you often need to transform the data into an entity-centric index before you analyze the data.
In order to be analyzed, a document must contain at least one field with a
supported data type (numeric
, boolean
, text
, keyword
or ip
) and must
not contain arrays with more than one item. If your source data consists of some
documents that contain the dependent variable and some that do not, the model is
trained on the subset of the documents that contain it.
Example source document
{ "_index": "kibana_sample_data_flights", "_type": "_doc", "_id": "S-JS1W0BJ7wufFIaPAHe", "_version": 1, "_seq_no": 3356, "_primary_term": 1, "found": true, "_source": { "FlightNum": "N32FE9T", "DestCountry": "JP", "OriginWeather": "Thunder & Lightning", "OriginCityName": "Adelaide", "AvgTicketPrice": 499.08518599798685, "DistanceMiles": 4802.864932998549, "FlightDelay": false, "DestWeather": "Sunny", "Dest": "Chubu Centrair International Airport", "FlightDelayType": "No Delay", "OriginCountry": "AU", "dayOfWeek": 3, "DistanceKilometers": 7729.461862731618, "timestamp": "2019-10-17T11:12:29", "DestLocation": { "lat": "34.85839844", "lon": "136.8049927" }, "DestAirportID": "NGO", "Carrier": "ES-Air", "Cancelled": false, "FlightTimeMin": 454.6742272195069, "Origin": "Adelaide International Airport", "OriginLocation": { "lat": "-34.945", "lon": "138.531006" }, "DestRegion": "SE-BD", "OriginAirportID": "ADL", "OriginRegion": "SE-BD", "DestCityName": "Tokoname", "FlightTimeHour": 7.577903786991782, "FlightDelayMin": 0 } }
The sample flight data is used in this example because it is easily accessible. However, the data has been manually created and contains some inconsistencies. For example, a flight can be both delayed and canceled. This is a good reminder that the quality of your input data affects the quality of your results.
Creating a regression model
editTo predict the number of minutes delayed for each flight:
- Verify that your environment is set up properly to use machine learning features. If the Elastic Stack security features are enabled, you need a user that has authority to create and manage data frame analytics jobs. See Setup and security.
-
Create a data frame analytics job.
You can use the wizard on the Machine Learning > Data Frame Analytics tab in Kibana or the create data frame analytics jobs API.
-
Choose
kibana_sample_data_flights
as the source index. -
Choose
regression
as the job type. - Optionally improve the quality of the analysis by adding a query that removes erroneous data. In this case, we omit flights with a distance of 0 kilometers or less.
-
Choose
FlightDelayMin
as the dependent variable, which is the field that we want to predict with the regression analysis. -
Add
Cancelled
,FlightDelay
, andFlightDelayType
to the list of excluded fields. These fields will be excluded from the analysis. It is recommended to exclude fields that either contain erroneous data or describe thedependent_variable
.The wizard includes a scatterplot matrix, which enables you to explore the relationships between the numeric fields. The color of each point is affected by the value of the dependent variable for that document, as shown in the legend. You can use this matrix to help you decide which fields to include or exclude from the analysis.
If you want these charts to represent data from a larger sample size or from a randomized selection of documents, you can change the default behavior. However, a larger sample size might slow down the performance of the matrix and a randomized selection might put more load on the cluster due to the more intensive query.
-
Choose a training percent of
90
which means it randomly selects 90% of the source data for training. - If you want to experiment with feature importance, specify a value in the advanced configuration options. In this example, we choose to return a maximum of 5 feature importance values per document. This option affects the speed of the analysis, so by default it is disabled.
- Use the default memory limit for the job. If the job requires more than this amount of memory, it fails to start. If the available memory on the node is limited, this setting makes it possible to prevent job execution.
- Add a job ID and optionally a job description.
-
Add the name of the destination index that will contain the results of the analysis. In Kibana, the index name matches the job ID by default. It will contain a copy of the source index data where each document is annotated with the results. If the index does not exist, it will be created automatically.
API example
PUT _ml/data_frame/analytics/model-flight-delays { "source": { "index": [ "kibana_sample_data_flights" ], "query": { "range": { "DistanceKilometers": { "gt": 0 } } } }, "dest": { "index": "model-flight-delays" }, "analysis": { "regression": { "dependent_variable": "FlightDelayMin", "training_percent": 90 } }, "analyzed_fields": { "includes": [], "excludes": [ "Cancelled", "FlightDelay", "FlightDelayType" ] } }
-
Choose
-
Start the job in Kibana or use the start data frame analytics jobs API.
The job takes a few minutes to run. Runtime depends on the local hardware and also on the number of documents and fields that are analyzed. The more fields and documents, the longer the job runs. It stops automatically when the analysis is complete.
API example
POST _ml/data_frame/analytics/model-flight-delays/_start
-
Check the job stats to follow the progress in Kibana or use the get data frame analytics jobs statistics API.
When the job stops, the results are ready to view and evaluate. To learn more about the job phases, see How it works.
API example
GET _ml/data_frame/analytics/model-flight-delays/_stats
The API call returns the following response:
{ "count" : 1, "data_frame_analytics" : [ { "id" : "model-flight-delays", "state" : "stopped", "progress" : [ { "phase" : "reindexing", "progress_percent" : 100 }, { "phase" : "loading_data", "progress_percent" : 100 }, { "phase" : "feature_selection", "progress_percent" : 100 }, { "phase" : "coarse_parameter_search", "progress_percent" : 100 }, { "phase" : "fine_tuning_parameters", "progress_percent" : 100 }, { "phase" : "final_training", "progress_percent" : 100 }, { "phase" : "writing_results", "progress_percent" : 100 }, { "phase" : "inference", "progress_percent" : 100 } ], "data_counts" : { "training_docs_count" : 11210, "test_docs_count" : 1246, "skipped_docs_count" : 0 }, "memory_usage" : { "timestamp" : 1599773614155, "peak_usage_bytes" : 50156565, "status" : "ok" }, "analysis_stats" : { "regression_stats" : { "timestamp" : 1599773614155, "iteration" : 18, "hyperparameters" : { "alpha" : 19042.721566629778, "downsample_factor" : 0.911884068909842, "eta" : 0.02331774683318904, "eta_growth_rate_per_tree" : 1.0143154178910303, "feature_bag_fraction" : 0.5504020748926737, "gamma" : 53.373570122718846, "lambda" : 2.94058933878574, "max_attempts_to_add_tree" : 3, "max_optimization_rounds_per_hyperparameter" : 2, "max_trees" : 894, "num_folds" : 4, "num_splits_per_feature" : 75, "soft_tree_depth_limit" : 2.945317520946171, "soft_tree_depth_tolerance" : 0.13448633124842999 }, "timing_stats" : { "elapsed_time" : 302959, "iteration_time" : 13075 }, "validation_loss" : { "loss_type" : "mse" } } } } ] }
Viewing regression results
editNow you have a new index that contains a copy of your source data with predictions for your dependent variable.
When you view the regression results in Kibana, it shows the contents of the destination index in a tabular format:
In this example, the table shows a column for the dependent variable
(FlightDelayMin
), which contains the ground truth values that we are trying to
predict with the regression analysis. It also shows a column for the prediction values
(ml.FlightDelayMin_prediction
) and a column that indicates whether the
document was used in the training set (ml.is_training
). You can filter the
table to show only testing or training data and you can select which fields are
shown in the table. You can also enable histogram charts to get a better
understanding of the distribution of values in your data.
If you chose to calculate feature importance, the destination index also contains
ml.feature_importance
objects. Every field that is included in the
regression analysis (known as a feature of the data point) is assigned a feature importance
value. This value has both a magnitude and a direction (positive or negative),
which indicates how each field affects a particular prediction. Only the most
significant values (in this case, the top 5) are stored in the index. However,
the trained model metadata also contains the average magnitude of the feature importance
values for each field across all the training data. You can view this
summarized information in Kibana:
You can also see the feature importance values for each individual prediction in the form of a decision plot:
The decision path starts at a baseline, which is the average of the predictions for all the data points in the training data set. From there, the feature importance values are added to the decision path until it arrives at its final prediction. The features with the most significant positive or negative impact appear at the top. Thus in this example, the features related to the flight distance had the most significant influence on this particular predicted flight delay. This type of information can help you to understand how models arrive at their predictions. It can also indicate which aspects of your data set are most influential or least useful when you are training and tuning your model.
If you do not use Kibana, you can see summarized feature importance values by using the get trained model API and the individual values by searching the destination index.
API example
GET _ml/inference/model-flight-delays*?include=total_feature_importance,feature_importance_baseline
The snippet below shows an example of the total feature importance details in the trained model metadata:
{ "count" : 1, "trained_model_configs" : [ { "model_id" : "model-flight-delays-1601312043770", ... "metadata" : { ... "feature_importance_baseline" : { "baseline" : 47.43643652716527 }, "total_feature_importance" : [ { "feature_name" : "dayOfWeek", "importance" : { "mean_magnitude" : 0.38674590521018903, "min" : -9.42823116446923, "max" : 8.707461689065173 } }, { "feature_name" : "OriginWeather", "importance" : { "mean_magnitude" : 0.18548393012368913, "min" : -9.079576266629092, "max" : 5.142479101907649 } ...
This value is the baseline for the feature importance decision path. It is the average of the prediction values across all the training data. |
|
This value is the average of the absolute feature importance values for the
|
|
This value is the minimum feature importance value across all the training data for this field. |
|
This value is the maximum feature importance value across all the training data for this field. |
To see the top feature importance values for each prediction, search the destination index. For example:
GET model-flight-delays/_search
The snippet below shows a part of a document with the annotated results:
... "DestCountry" : "CH", "DestRegion" : "CH-ZH", "OriginAirportID" : "VIE", "DestCityName" : "Zurich", "ml": { "FlightDelayMin_prediction": 277.5392150878906, "feature_importance": [ { "feature_name": "DestCityName", "importance": 0.6285966753441136 }, { "feature_name": "DistanceKilometers", "importance": 84.4982943868267 }, { "feature_name": "DistanceMiles", "importance": 103.90011847132116 }, { "feature_name": "FlightTimeHour", "importance": 3.7119156097309345 }, { "feature_name": "FlightTimeMin", "importance": 38.700587425831365 } ], "is_training": true } ...
Evaluating regression results
editThough you can look at individual results and compare the predicted value
(ml.FlightDelayMin_prediction
) to the actual value (FlightDelayMins
), you
typically need to evaluate the success of the regression model as a whole.
Kibana provides training error metrics, which represent how well the model performed on the training data set. It also provides generalization error metrics, which represent how well the model performed on testing data.
A mean squared error (MSE) of zero means that the models predicts the dependent variable with perfect accuracy. This is the ideal, but is typically not possible. Likewise, an R-squared value of 1 indicates that all of the variance in the dependent variable can be explained by the feature variables. Typically, you compare the MSE and R-squared values from multiple regression models to find the best balance or fit for your data.
For more information about the interpreting the evaluation metrics, see Regression evaluation.
You can alternatively generate these metrics with the data frame analytics evaluate API.
API example
POST _ml/data_frame/_evaluate { "index": "model-flight-delays", "query": { "bool": { "filter": [{ "term": { "ml.is_training": true } }] } }, "evaluation": { "regression": { "actual_field": "FlightDelayMin", "predicted_field": "ml.FlightDelayMin_prediction", "metrics": { "r_squared": {}, "mse": {}, "msle": {}, "huber": {} } } } }
Calculate the training error by evaluating only the training data. |
|
The field that contains the actual (ground truth) value. |
|
The field that contains the predicted value. |
The API returns a response like this:
{ "regression" : { "huber" : { "value" : 30.216037330465102 }, "mse" : { "value" : 2847.2211476422967 }, "msle" : { "value" : "NaN" }, "r_squared" : { "value" : 0.6956530017255125 } } }
Next, we calculate the generalization error:
POST _ml/data_frame/_evaluate { "index": "model-flight-delays", "query": { "bool": { "filter": [{ "term": { "ml.is_training": false } }] } }, "evaluation": { "regression": { "actual_field": "FlightDelayMin", "predicted_field": "ml.FlightDelayMin_prediction", "metrics": { "r_squared": {}, "mse": {}, "msle": {}, "huber": {} } } } }
When you have trained a satisfactory model, you can deploy it to make predictions about new data. Those steps are not covered in this example. See Inference.
If you don’t want to keep the data frame analytics job, you can delete it. For example, use Kibana or the delete data frame analytics job API. When you delete data frame analytics jobs in Kibana, you have the option to also remove the destination indices and index patterns.