Troubleshooting corruption

edit

Elasticsearch expects that the data it reads from disk is exactly the data it previously wrote. If it detects that the data on disk is different from what it wrote then it will report some kind of exception such as:

  • org.apache.lucene.index.CorruptIndexException
  • org.elasticsearch.gateway.CorruptStateException
  • org.elasticsearch.index.translog.TranslogCorruptedException

Typically these exceptions happen due to a checksum mismatch. Most of the data that Elasticsearch writes to disk is followed by a checksum using a simple algorithm known as CRC32 which is fast to compute and good at detecting the kinds of random corruption that may happen when using faulty storage. A CRC32 checksum mismatch definitely indicates that something is faulty, although of course a matching checksum doesn’t prove the absence of corruption.

Verifying a checksum is expensive since it involves reading every byte of the file which takes significant effort and might evict more useful data from the filesystem cache, so systems typically don’t verify the checksum on a file very often. This is why you tend only to encounter a corruption exception when something unusual is happening. For instance, corruptions are often detected during merges, shard movements, and snapshots. This does not mean that these processes are causing corruption: they are examples of the rare times where reading a whole file is necessary. Elasticsearch takes the opportunity to verify the checksum at the same time, and this is when the corruption is detected and reported. It doesn’t indicate the cause of the corruption or when it happened. Corruptions can remain undetected for many months.

The files that make up a Lucene index are written sequentially from start to end and then never modified or overwritten. This access pattern means the checksum computation is very simple and can happen on-the-fly as the file is initially written, and also makes it very unlikely that an incorrect checksum is due to a userspace bug at the time the file was written. The routine that computes the checksum is straightforward, widely used, and very well-tested, so you can be very confident that a checksum mismatch really does indicate that the data read from disk is different from the data that Elasticsearch previously wrote.

The files that make up a Lucene index are written in full before they are used. If a file is needed to recover an index after a restart then your storage system previously confirmed to Elasticsearch that this file was durably synced to disk. On Linux this means that the fsync() system call returned successfully. Elasticsearch sometimes detects that an index is corrupt because a file needed for recovery has been truncated or is missing its footer. This indicates that your storage system acknowledges durable writes incorrectly.

There are many possible explanations for Elasticsearch detecting corruption in your cluster. Databases like Elasticsearch generate a challenging I/O workload that may find subtle infrastructural problems which other tests may miss. Elasticsearch is known to expose the following problems as file corruptions:

  • Filesystem bugs, especially in newer and nonstandard filesystems which might not have seen enough real-world production usage to be confident that they work correctly.
  • Kernel bugs.
  • Bugs in firmware running on the drive or RAID controller.
  • Incorrect configuration, for instance configuring fsync() to report success before all durable writes have completed.
  • Faulty hardware, which may include the drive itself, the RAID controller, your RAM or CPU.

Data corruption typically doesn’t result in other evidence of problems apart from the checksum mismatch. Do not interpret this as an indication that your storage subsystem is working correctly and therefore that Elasticsearch itself caused the corruption. It is rare for faulty storage to show any evidence of problems apart from the data corruption, but data corruption itself is a very strong indicator that your storage subsystem is not working correctly.

To rule out Elasticsearch as the source of data corruption, generate an I/O workload using something other than Elasticsearch and look for data integrity errors. On Linux the fio and stress-ng tools can both generate challenging I/O workloads and verify the integrity of the data they write. Use version 0.12.01 or newer of stress-ng since earlier versions do not have strong enough integrity checks. You can check that durable writes persist across power outages using a script such as diskchecker.pl.

To narrow down the source of the corruptions, systematically change components in your cluster’s environment until the corruptions stop. The details will depend on the exact configuration of your hardware, but may include the following:

  • Try a different filesystem or a different kernel.
  • Try changing each hardware component in turn, ideally changing to a different model or manufacturer.
  • Try different firmware versions for each hardware component.