Troubleshooting anomaly detection

edit

Use the information in this section to troubleshoot common problems and known issues.

Unintended anomaly detection job closures on upgrade

edit

When you perform a rolling upgrade to or from versions 7.14.0 or 7.14.1 you may find that anomaly detection jobs that were opened during the upgrade incorrectly end up closed after the upgrade.

Symptoms:

  • Some (but not necessarily all) anomaly detection jobs that were in the opened state before the upgrade are closed after the upgrade. The response from the get anomaly detection jobs API for these jobs contains a blocked property with revert as its reason.
  • The datafeed associated with a closed anomaly detection job is in the started state; this combination should be impossible.

Resolution:

To avoid this problem, enable machine learning upgrade mode before you start the rolling upgrade and disable it after the rolling upgrade is complete. Do not enable and disable machine learning upgrade mode more than once; enable it before upgrading the first node of the rolling upgrade and disable it after upgrading the last node. It is only safe to enable machine learning upgrade mode again after all anomaly detection jobs that were opened have been assigned to nodes and fully recovered; this may take 30 minutes in large environments.

To remediate the problem if you experience it:

  1. Force-stop the started datafeed associated with the closed anomaly detection job by calling the stop datafeeds API with force set to true.
  2. Complete the revert operation that the anomaly detection job is blocked on by calling the revert model snapshots API with delete_intervening_results set to true. To find the appropriate model snapshot to revert to, look in the "Job Messages" tab for the anomaly detection job in Kibana, for the model snapshot reversion that started during your rolling upgrade.
  3. Open the incorrectly closed anomaly detection job.
  4. Start the associated datafeed.

Steps 3 and 4 can be done by clicking the start button for the job in Kibana.

Incorrect mappings in 7.9.0 or higher

edit

This problem occurs when you upgrade to 7.9.0 and incorrect mappings are added to the machine learning annotations index or the machine learning config index.

It is also possible for this problem to occur for the machine learning config index when you upgrade to 7.9.1 or higher after previously upgrading to several prior 7.x versions. If you skip version 7.9.0 and upgrade directly to version 7.9.1 or higher then the mappings on the machine learning annotations index will be correct. However, if you upgraded to version 7.9.0 and suffered incorrect mappings then upgrading to 7.9.1 will not fix these; you will need to follow the procedure detailed below.

Symptoms:

  • Some pages in the Machine Learning UI do not display correctly. For example, the Anomaly Explorer fails to load.
  • The following error occurs in Kibana when you try to view annotations for anomaly detection jobs: Error loading the list of annotations for this job
  • Cannot create or update any machine learning jobs. The error messages in this case are illegal argument exceptions like mapper [model_plot_config.annotations_enabled] cannot be changed from type [keyword] to [boolean]. This problem is most likely to occur if after upgrading you open an existing anomaly detection job in 7.9.0 before you create or update a job.

Resolution:

To avoid this problem, manually update the mappings on the machine learning annotations and config indices in your old Elasticsearch version before you upgrade to 7.9.0. For example:

PUT .ml-annotations-6/_mapping
{
  "properties": {
    "event" : {
      "type" : "keyword"
    },
    "detector_index" : {
      "type" : "integer"
    },
    "partition_field_name" : {
      "type" : "keyword"
    },
    "partition_field_value" : {
      "type" : "keyword"
    },
    "over_field_name" : {
      "type" : "keyword"
    },
    "over_field_value" : {
      "type" : "keyword"
    },
    "by_field_name" : {
      "type" : "keyword"
    },
    "by_field_value" : {
      "type" : "keyword"
    }
  }
}

PUT .ml-config/_mapping
{
  "properties": {
    "analysis_config": {
      "properties": {
        "per_partition_categorization" : {
          "properties" : {
            "enabled" : {
              "type" : "boolean"
            },
            "stop_on_warn" : {
              "type" : "boolean"
            }
          }
        }
      }
    },
    "max_num_threads" : {
      "type" : "integer"
    },
    "model_plot_config" : {
      "properties" : {
        "annotations_enabled" : {
          "type" : "boolean"
        }
      }
    }
  }
}

If security features are enabled, you must have the superuser role to alter the .ml-config index.

If you did not manually update the mappings before the upgrade, you can nonetheless try to do it after the upgrade. If either update fails, you must reindex that index. For example, follow these steps:

  1. To reindex the machine learning annotations index:

    1. Enable upgrade mode:

      POST _ml/set_upgrade_mode?enabled=true&timeout=10m
    2. Create a temporary index:

      PUT temp_ml_annotations
    3. Reindex the .ml-annotations-6 index into the temporary index:

      POST _reindex
      {
        "source": { "index": ".ml-annotations-6" },
        "dest": { "index": "temp_ml_annotations" }
      }
    4. Delete the .ml-annotations-6 index:

      DELETE .ml-annotations-6
    5. Disable upgrade mode:

      POST _ml/set_upgrade_mode?enabled=false&timeout=10m
    6. Wait for .ml-annotations-6 to be recreated.
    7. Reindex the temporary index into the .ml-annotations-6 index:

      POST _reindex
      {
        "source": { "index": "temp_ml_annotations" },
        "dest": { "index": ".ml-annotations-6" }
      }
    8. Delete the temporary index:

      DELETE temp_ml_annotations
  2. To reindex the machine learning config index, follow these steps:

    1. Enable upgrade mode:

      POST _ml/set_upgrade_mode?enabled=true&timeout=10m
    2. Create a temporary index:

      PUT temp_ml_config
    3. Reindex the .ml-config index into the temporary index:

      POST _reindex
      {
        "source": { "index": ".ml-config" },
        "dest": { "index": "temp_ml_config" }
      }
    4. Delete the .ml-config index:

      DELETE .ml-config
    5. Create the .ml-config index:

      PUT .ml-config
      {
        "settings": { "auto_expand_replicas": "0-1"}
      }
    6. Reindex the temporary index into the .ml-config index:

      POST _reindex
      {
        "source": { "index": "temp_ml_config" },
        "dest": { "index": ".ml-config" }
      }
    7. Disable upgrade mode:

      POST _ml/set_upgrade_mode?enabled=false&timeout=10m
    8. Delete the temporary index:

      DELETE temp_ml_config

Suboptimal job assignment on Debian 8

edit

Where possible, machine learning jobs are assigned to nodes based on the memory requirement of the job and the memory available on the node. However, in certain cases, the amount of memory on a node cannot be accurately determined and jobs are assigned by balancing the number of jobs per machine learning node. It may lead to a situation where all the jobs with high memory requirements are on one node and the less memory-intensive jobs on another.

One particular case of this problem is that Elasticsearch fails to determine the amount of memory on a machine that is running Debian 8 with the default Cgroups setup and certain updates of Java versions earlier than Java 15. For example, Java 8u271 is known to be affected while Java 8u272 is not. Java 15 was fixed from its initial release.

If you are running Elasticsearch on Debian 8 with an old version of Java and have not already modified the Cgroups setup then it is recommended to do one of the following:

  • Upgrade Java to version 15.
  • Upgrade to the latest Java update for the version of Java you are running.
  • Enable the "memory" Cgroup by editing /etc/default/grub and adding:

    GRUB_CMDLINE_LINUX_DEFAULT="quiet cgroup_enable=memory swapaccount=1"

    Update your GRUB configuration by running sudo update-grub, then reboot the machine.