Troubleshooting index lifecycle management errors

edit

Troubleshooting index lifecycle management errors

edit

When ILM executes a lifecycle policy, it’s possible for errors to occur while performing the necessary index operations for a step. When this happens, ILM moves the index to an ERROR step. If ILM cannot resolve the error automatically, execution is halted until you resolve the underlying issues with the policy, index, or cluster.

See this video for a walkthrough of troubleshooting current ILM health issues, and this video for a walkthrough of troubleshooting historical ILM issues.

For example, you might have a shrink-index policy that shrinks an index to four shards once it is at least five days old:

resp = client.ilm.put_lifecycle(
    name="shrink-index",
    policy={
        "phases": {
            "warm": {
                "min_age": "5d",
                "actions": {
                    "shrink": {
                        "number_of_shards": 4
                    }
                }
            }
        }
    },
)
print(resp)
response = client.ilm.put_lifecycle(
  policy: 'shrink-index',
  body: {
    policy: {
      phases: {
        warm: {
          min_age: '5d',
          actions: {
            shrink: {
              number_of_shards: 4
            }
          }
        }
      }
    }
  }
)
puts response
const response = await client.ilm.putLifecycle({
  name: "shrink-index",
  policy: {
    phases: {
      warm: {
        min_age: "5d",
        actions: {
          shrink: {
            number_of_shards: 4,
          },
        },
      },
    },
  },
});
console.log(response);
PUT _ilm/policy/shrink-index
{
  "policy": {
    "phases": {
      "warm": {
        "min_age": "5d",
        "actions": {
          "shrink": {
            "number_of_shards": 4
          }
        }
      }
    }
  }
}

There is nothing that prevents you from applying the shrink-index policy to a new index that has only two shards:

resp = client.indices.create(
    index="my-index-000001",
    settings={
        "index.number_of_shards": 2,
        "index.lifecycle.name": "shrink-index"
    },
)
print(resp)
response = client.indices.create(
  index: 'my-index-000001',
  body: {
    settings: {
      'index.number_of_shards' => 2,
      'index.lifecycle.name' => 'shrink-index'
    }
  }
)
puts response
const response = await client.indices.create({
  index: "my-index-000001",
  settings: {
    "index.number_of_shards": 2,
    "index.lifecycle.name": "shrink-index",
  },
});
console.log(response);
PUT /my-index-000001
{
  "settings": {
    "index.number_of_shards": 2,
    "index.lifecycle.name": "shrink-index"
  }
}

After five days, ILM attempts to shrink my-index-000001 from two shards to four shards. Because the shrink action cannot increase the number of shards, this operation fails and ILM moves my-index-000001 to the ERROR step.

You can use the ILM Explain API to get information about what went wrong:

resp = client.ilm.explain_lifecycle(
    index="my-index-000001",
)
print(resp)
response = client.ilm.explain_lifecycle(
  index: 'my-index-000001'
)
puts response
const response = await client.ilm.explainLifecycle({
  index: "my-index-000001",
});
console.log(response);
GET /my-index-000001/_ilm/explain

Which returns the following information:

{
  "indices" : {
    "my-index-000001" : {
      "index" : "my-index-000001",
      "managed" : true,
      "index_creation_date_millis" : 1541717265865,
      "time_since_index_creation": "5.1d",
      "policy" : "shrink-index",                
      "lifecycle_date_millis" : 1541717265865,
      "age": "5.1d",                            
      "phase" : "warm",                         
      "phase_time_millis" : 1541717272601,
      "action" : "shrink",                      
      "action_time_millis" : 1541717272601,
      "step" : "ERROR",                         
      "step_time_millis" : 1541717272688,
      "failed_step" : "shrink",                 
      "step_info" : {
        "type" : "illegal_argument_exception",  
        "reason" : "the number of target shards [4] must be less that the number of source shards [2]"
      },
      "phase_execution" : {
        "policy" : "shrink-index",
        "phase_definition" : {                  
          "min_age" : "5d",
          "actions" : {
            "shrink" : {
              "number_of_shards" : 4
            }
          }
        },
        "version" : 1,
        "modified_date_in_millis" : 1541717264230
      }
    }
  }
}

The policy being used to manage the index: shrink-index

The index age: 5.1 days

The phase the index is currently in: warm

The current action: shrink

The step the index is currently in: ERROR

The step that failed to execute: shrink

The type of error and a description of that error.

The definition of the current phase from the shrink-index policy

To resolve this, you could update the policy to shrink the index to a single shard after 5 days:

resp = client.ilm.put_lifecycle(
    name="shrink-index",
    policy={
        "phases": {
            "warm": {
                "min_age": "5d",
                "actions": {
                    "shrink": {
                        "number_of_shards": 1
                    }
                }
            }
        }
    },
)
print(resp)
response = client.ilm.put_lifecycle(
  policy: 'shrink-index',
  body: {
    policy: {
      phases: {
        warm: {
          min_age: '5d',
          actions: {
            shrink: {
              number_of_shards: 1
            }
          }
        }
      }
    }
  }
)
puts response
const response = await client.ilm.putLifecycle({
  name: "shrink-index",
  policy: {
    phases: {
      warm: {
        min_age: "5d",
        actions: {
          shrink: {
            number_of_shards: 1,
          },
        },
      },
    },
  },
});
console.log(response);
PUT _ilm/policy/shrink-index
{
  "policy": {
    "phases": {
      "warm": {
        "min_age": "5d",
        "actions": {
          "shrink": {
            "number_of_shards": 1
          }
        }
      }
    }
  }
}

Retrying failed lifecycle policy steps

edit

Once you fix the problem that put an index in the ERROR step, you might need to explicitly tell ILM to retry the step:

resp = client.ilm.retry(
    index="my-index-000001",
)
print(resp)
const response = await client.ilm.retry({
  index: "my-index-000001",
});
console.log(response);
POST /my-index-000001/_ilm/retry

ILM subsequently attempts to re-run the step that failed. You can use the ILM Explain API to monitor the progress.

Common ILM setting issues

edit

How min_age is calculated

edit

When setting up an ILM policy or automating rollover with ILM, be aware that min_age can be relative to either the rollover time or the index creation time.

If you use ILM rollover, min_age is calculated relative to the time the index was rolled over. This is because the rollover API generates a new index and updates the age of the previous index to reflect the rollover time. If the index hasn’t been rolled over, then the age is the same as the creation_date for the index.

You can override how min_age is calculated using the index.lifecycle.origination_date and index.lifecycle.parse_origination_date ILM settings.

Common ILM errors

edit

Here’s how to resolve the most common errors reported in the ERROR step.

Problems with rollover aliases are a common cause of errors. Consider using data streams instead of managing rollover with aliases.

Rollover alias [x] can point to multiple indices, found duplicated alias [x] in index template [z]

edit

The target rollover alias is specified in an index template’s index.lifecycle.rollover_alias setting. You need to explicitly configure this alias one time when you bootstrap the initial index. The rollover action then manages setting and updating the alias to roll over to each subsequent index.

Do not explicitly configure this same alias in the aliases section of an index template.

See this resolving duplicate alias video for an example troubleshooting walkthrough.

index.lifecycle.rollover_alias [x] does not point to index [y]

edit

Either the index is using the wrong alias or the alias does not exist.

Check the index.lifecycle.rollover_alias index setting. To see what aliases are configured, use _cat/aliases.

See this resolving not point to index video for an example troubleshooting walkthrough.

Setting [index.lifecycle.rollover_alias] for index [y] is empty or not defined

edit

The index.lifecycle.rollover_alias setting must be configured for the rollover action to work.

Update the index settings to set index.lifecycle.rollover_alias.

See this resolving empty or not defined video for an example troubleshooting walkthrough.

Alias [x] has more than one write index [y,z]

edit

Only one index can be designated as the write index for a particular alias.

Use the aliases API to set is_write_index:false for all but one index.

See this resolving more than one write index video for an example troubleshooting walkthrough.

index name [x] does not match pattern ^.*-\d+

edit

The index name must match the regex pattern ^.*-\d+ for the rollover action to work. The most common problem is that the index name does not contain trailing digits. For example, my-index does not match the pattern requirement.

Append a numeric value to the index name, for example my-index-000001.

See this resolving does not match pattern video for an example troubleshooting walkthrough.

CircuitBreakingException: [x] data too large, data for [y]

edit

This indicates that the cluster is hitting resource limits.

Before continuing to set up ILM, you’ll need to take steps to alleviate the resource issues. For more information, see Circuit breaker errors.

High disk watermark [x] exceeded on [y]

edit

This indicates that the cluster is running out of disk space. This can happen when you don’t have index lifecycle management set up to roll over from hot to warm nodes. For more information, see Fix watermark errors.

security_exception: action [<action-name>] is unauthorized for user [<user-name>] with roles [<role-name>], this action is granted by the index privileges [manage_follow_index,manage,all]

edit

This indicates the ILM action cannot be executed because the user used by ILM to perform the action doesn’t have the proper privileges. This can happen when user’s privileges has been dropped after updating the ILM policy. ILM actions are run as though they were performed by the last user who modify the policy. The account used to create or modify the policy from should have permissions to perform all operations that are part of that policy.