New

The executive guide to generative AI

Read more
Loading

Generating alerts for anomaly detection jobs

Elastic Stack Serverless

Kibana alerting features include support for machine learning rules, which run scheduled checks for anomalies in one or more anomaly detection jobs or check the health of the job with certain conditions. If the conditions of the rule are met, an alert is created and the associated action is triggered. For example, you can create a rule to check an anomaly detection job every fifteen minutes for critical anomalies and to notify you in an email. To learn more about Kibana alerting features, refer to Alerting.

The following machine learning rules are available:

Anomaly detection alert
Checks if the anomaly detection job results contain anomalies that match the rule conditions.
Anomaly detection jobs health
Monitors job health and alerts if an operational issue occurred that may prevent the job from detecting anomalies.
Tip

If you have created rules for specific anomaly detection jobs and you want to monitor whether these jobs work as expected, anomaly detection jobs health rules are ideal for this purpose.

In Stack Management > Rules, you can create both types of machine learning rules. In the Machine Learning app, you can create only anomaly detection alert rules; create them from the anomaly detection job wizard after you start the job or from the anomaly detection job list.

When you create an anomaly detection alert rule, you must select the job that the rule applies to.

You must also select a type of machine learning result. In particular, you can create rules based on bucket, record, or influencer results.

Selecting result type, severity, and test interval

For each rule, you can configure the anomaly_score that triggers the action. The anomaly_score indicates the significance of a given anomaly compared to previous anomalies. The default severity threshold is 75 which means every anomaly with an anomaly_score of 75 or higher triggers the associated action.

You can select whether you want to include interim results. Interim results are created by the anomaly detection job before a bucket is finalized. These results might disappear after the bucket is fully processed. Include interim results if you want to be notified earlier about a potential anomaly even if it might be a false positive. If you want to get notified only about anomalies of fully processed buckets, do not include interim results.

You can also configure advanced settings. Lookback interval sets an interval that is used to query previous anomalies during each condition check. Its value is derived from the bucket span of the job and the query delay of the datafeed by default. It is not recommended to set the lookback interval lower than the default value as it might result in missed anomalies. Number of latest buckets sets how many buckets to check to obtain the highest anomaly from all the anomalies that are found during the Lookback interval. An alert is created based on the anomaly with the highest anomaly score from the most anomalous bucket.

You can also test the configured conditions against your existing data and check the sample results by providing a valid interval for your data. The generated preview contains the number of potentially created alerts during the relative time range you defined.

Tip

You must also provide a check interval that defines how often to evaluate the rule conditions. It is recommended to select an interval that is close to the bucket span of the job.

As the last step in the rule creation process, define its actions.

When you create an anomaly detection jobs health rule, you must select the job or group that the rule applies to. If you assign more jobs to the group, they are included the next time the rule conditions are checked.

You can also use a special character (*) to apply the rule to all your jobs. Jobs created after the rule are automatically included. You can exclude jobs that are not critically important by using the Exclude field.

Enable the health check types that you want to apply. All checks are enabled by default. At least one check needs to be enabled to create the rule. The following health checks are available:

Datafeed is not started
Notifies if the corresponding datafeed of the job is not started but the job is in an opened state. The notification message recommends the necessary actions to solve the error.
Model memory limit reached
Notifies if the model memory status of the job reaches the soft or hard model memory limit. Optimize your job by following these guidelines or consider amending the model memory limit.
Data delay has occurred
Notifies when the job missed some data. You can define the threshold for the amount of missing documents you get alerted on by setting Number of documents. You can control the lookback interval for checking delayed data with Time interval. Refer to the Handling delayed data page to see what to do about delayed data.
Errors in job messages
Notifies when the job messages contain error messages. Review the notification; it contains the error messages, the corresponding job IDs and recommendations on how to fix the issue. This check looks for job errors that occur after the rule is created; it does not look at historic behavior.
Selecting health checkers
Tip

You must also provide a check interval that defines how often to evaluate the rule conditions. It is recommended to select an interval that is close to the bucket span of the job.

As the last step in the rule creation process, define its actions.

You can optionally send notifications when the rule conditions are met and when they are no longer met. In particular, these rules support:

  • alert summaries
  • actions that run when the anomaly score matches the conditions (for anomaly detection alert rules)
  • actions that run when an issue is detected (for anomaly detection jobs health rules)
  • recovery actions that run when the conditions are no longer met

Each action uses a connector, which stores connection information for a Kibana service or supported third-party integration, depending on where you want to send the notifications. For example, you can use a Slack connector to send a message to a channel. Or you can use an index connector that writes a JSON object to a specific index. For details about creating connectors, refer to Connectors.

After you select a connector, you must set the action frequency. You can choose to create a summary of alerts on each check interval or on a custom interval. For example, send slack notifications that summarize the new, ongoing, and recovered alerts:

Adding an alert summary action to the rule
Tip

If you choose a custom action interval, it cannot be shorter than the rule's check interval.

Alternatively, you can set the action frequency such that actions run for each alert. Choose how often the action runs (at each check interval, only when the alert status changes, or at a custom action interval). For anomaly detection alert rules, you must also choose whether the action runs when the anomaly score matches the condition or when the alert recovers:

Adding an action for each alert in the rule

In anomaly detection jobs health rules, choose whether the action runs when the issue is detected or when it is recovered:

Adding an action for each alert in the rule

You can further refine the rule by specifying that actions run only when they match a KQL query or when an alert occurs within a specific time frame.

There is a set of variables that you can use to customize the notification messages for each action. Click the icon above the message text box to get the list of variables or refer to action variables. For example:

Customizing your message

After you save the configurations, the rule appears in the Stack Management > Rules list; you can check its status and see the overview of its configuration information.

When an alert occurs for an anomaly detection alert rule, it is always the same name as the job ID of the associated anomaly detection job that triggered it. You can review how the alerts that are occured correlate with the anomaly detection results in the Anomaly explorer by using the Anomaly timeline swimlane and the Alerts panel.

If necessary, you can snooze rules to prevent them from generating actions. For more details, refer to Snooze and disable rules.

The following variables are specific to the machine learning rule types. An asterisk (*) marks the variables that you can use in actions related to recovered alerts.

You can also specify variables common to all rules.

Every anomaly detection alert has the following action variables:

context.anomalyExplorerUrl*
URL to open in the Anomaly Explorer.
context.isInterim
Indicates if top hits contain interim results.
context.jobIds*
List of job IDs that triggered the alert.
context.message*
A preconstructed message for the alert.
context.score
Anomaly score at the time of the notification action.
context.timestamp
The bucket timestamp of the anomaly.
context.timestampIso8601
The bucket timestamp of the anomaly in ISO8601 format.
context.topInfluencers
The list of top influencers. Limited to a maximum of 3 documents.
context.topRecords
The list of top records. Limited to a maximum of 3 documents.

Every health check has two main variables: context.message and context.results. The properties of context.results may vary based on the type of check. You can find the possible properties for all the checks below.

context.message*
A preconstructed message for the alert.
context.results
Contains the following properties:
context.message*
A preconstructed message for the rule.
context.results
Contains the following properties:
context.message*
A preconstructed message for the rule.
context.results
For recovered alerts, context.results is either empty (when there is no delayed data) or the same as for an active alert (when the number of missing documents is less than the Number of documents threshold set by the user).
Contains the following properties:
context.message*
A preconstructed message for the rule.
context.results
Contains the following properties: