Resolve lifecycle policy execution errors

edit

Resolve lifecycle policy execution errors

edit

When ILM executes a lifecycle policy, it’s possible for errors to occur while performing the necessary index operations for a step. When this happens, ILM moves the index to an ERROR step. If {ilm-init] cannot resolve the error automatically, execution is halted until you resolve the underlying issues with the policy, index, or cluster.

For example, you might have a shrink-index policy that shrinks an index to four shards once it is at least five days old:

PUT _ilm/policy/shrink-index
{
  "policy": {
    "phases": {
      "warm": {
        "min_age": "5d",
        "actions": {
          "shrink": {
            "number_of_shards": 4
          }
        }
      }
    }
  }
}

There is nothing that prevents you from applying the shrink-index policy to a new index that has only two shards:

PUT /my-index-000001
{
  "settings": {
    "index.number_of_shards": 2,
    "index.lifecycle.name": "shrink-index"
  }
}

After five days, ILM attempts to shrink my-index-000001 from two shards to four shards. Because the shrink action cannot increase the number of shards, this operation fails and ILM moves my-index-000001 to the ERROR step.

You can use the ILM Explain API to get information about what went wrong:

GET /my-index-000001/_ilm/explain

Which returns the following information:

{
  "indices" : {
    "my-index-000001" : {
      "index" : "my-index-000001",
      "managed" : true,
      "policy" : "shrink-index",                
      "lifecycle_date_millis" : 1541717265865,
      "age": "5.1d",                            
      "phase" : "warm",                         
      "phase_time_millis" : 1541717272601,
      "action" : "shrink",                      
      "action_time_millis" : 1541717272601,
      "step" : "ERROR",                         
      "step_time_millis" : 1541717272688,
      "failed_step" : "shrink",                 
      "step_info" : {
        "type" : "illegal_argument_exception",  
        "reason" : "the number of target shards [4] must be less that the number of source shards [2]"
      },
      "phase_execution" : {
        "policy" : "shrink-index",
        "phase_definition" : {                  
          "min_age" : "5d",
          "actions" : {
            "shrink" : {
              "number_of_shards" : 4
            }
          }
        },
        "version" : 1,
        "modified_date_in_millis" : 1541717264230
      }
    }
  }
}

The policy being used to manage the index: shrink-index

The index age: 5.1 days

The phase the index is currently in: warm

The current action: shrink

The step the index is currently in: ERROR

The step that failed to execute: shrink

The type of error and a description of that error.

The definition of the current phase from the shrink-index policy

To resolve this, you could update the policy to shrink the index to a single shard after 5 days:

PUT _ilm/policy/shrink-index
{
  "policy": {
    "phases": {
      "warm": {
        "min_age": "5d",
        "actions": {
          "shrink": {
            "number_of_shards": 1
          }
        }
      }
    }
  }
}

Retrying failed lifecycle policy steps

edit

Once you fix the problem that put an index in the ERROR step, you might need to explicitly tell ILM to retry the step:

POST /my-index-000001/_ilm/retry

ILM subsequently attempts to re-run the step that failed. You can use the ILM Explain API to monitor the progress.