Understanding the YARN environment

edit

YARN stands for "Yet Another Resource Negotiator" and was added later as part of Hadoop 2.0. YARN takes the resource management capabilities that were in MapReduce and packages them so they can be used by new engines. This also streamlines MapReduce to do what it does best, process data. With YARN, you can now run multiple applications in Hadoop, all sharing a common resource management. As of September, 2014, YARN manages only CPU (number of cores) and memory [..]

-- Wikipedia

In its current incarnation, Elasticsearch on YARN interacts with the YARN APIs in order to start and stop Elasticsearch nodes on YARN infrastructure. In YARN terminology, Elasticsearch on YARN has several components:

Client Application
The entity that bootstraps the entire process and controls the life-cycle of the cluster based on user feedback. This is the CLI (Command-Line Interface) that the user interacts with.
Application Manager
Based on the user configuration, the dedicated ApplicationManager negotiates with YARN the number of Elasticsearch nodes to be created as YARN containers and their capabilities (memory and CPU). It oversees the cluster life-cycle and handles the container allocation.
Node/Container setup
Handles the actual Elasticsearch invocation for each allocated Container.

As YARN is all about cluster resource management, it is important to properly configure YARN and Elasticsearch accordingly since over or under-allocating resources can lead to undesirable consequences. There are plenty of resources available on how to configure and plan your YARN cluster; the section below will touch on the core components and their impact on Elasticsearch.

CPU

edit

As of Hadoop 2.4, YARN can restrict the amount of CPU allocated to a container: each has a number of so-called vcores (virtual cores) with a minimum of 1. What this translates in practice depends highly on your underlying hardware configuration and cluster configuration; a good approximation is to map each vcore to an actual core on the CPU; just like with native hardware, expect the core to be shared across the rest of the applications so depending on system load, the amount of actual CPU available can be considerably lower. Thus, it is recommended to allocate multiple vcores to Elasticsearch - a good start number being the number of actual cores your CPU supports.

Memory

edit

Simplifying things a lot, YARN requires containers to specify the amount of memory they need within a certain band - specifying more or less memory results in the container allocation request being denied. By default, YARN enforces a minimum limit of 1 GB (1024 MB) and a maximum of 8 GB (8192 MB). While Elasticsearch can work with this amount of RAM, you typically want to increase this amount for performance reasons. Out of the box, Elasticsearch on YARN requests only 2 GB of memory per container so that users can easily try it out even within a testing YARN environment (such as pseudo-distributed or VMs); significantly increase this amount once you get beyond the YARN ‘Hello World’ stage.

Storage

edit

Each container inside YARN is responsible for saving its state and storage between container restarts. In general, there are two main strategies one can take:

Use a globally accessible file-system (like HDFS)
With a storage accessible by all Hadoop nodes, each container can use it as its backing store. For example one can use HDFS to save the data in one container and read it from another. With Elasticsearch, one can simply mount HDFS as a NFS gateway and simply point each Elasticsearch node to a folder on it. Note however that performance is going to suffer significantly - HDFS is design as big, network-based storage thus each call is likely to have a significant delay (due to network latency).
Use the container node local storage
Each container can currently access its local storage - with proper configuration this can be kept outside the disposable container folder thus allowing the data to live between restarts. This is the recommended approach as it offers the best performance and due to Elasticsearch itself, redundancy as well (through replicas).

Note that the approaches above require additional configuration to either Elasticsearch or your YARN cluster. There are plans to simplify this procedure in the future.

If no storage is configured, out of the box Elasticsearch will use its container storage which means when the container is disposed, so is its data. In other words, between restarts any existing data is destroyed.

Node affinity

edit

Currently, Elasticsearch on YARN does not provide any option for tying Elasticsearch nodes to specify YARN nodes however this will be addressed in the future. In practice this means that unless YARN is specifically configured, there are no guarantees on its topology between restarts, that is on what machines Elasticsearch nodes will run each time.

Security

edit

The Hadoop Ecosystem uses Kerberos for authentication between different components. The services that make up the components of the Ecosystem require keytab files to be present on each node to allow for passwordless login. When submitting an application to a secured YARN cluster, YARN runs the application as if it were the user that submitted it. In the case of this integration, The Application Master requires access to HDFS to distribute the Elasticsearch packages during container provisioning. The Application Master must be able to find a valid keytab file on any of the nodes that the it may start on as well as a principal that can be found within said keytab. You can do this by using a unique user as the owner of the Elasticsearch cluster in YARN, and distributing keytabs for this user to all machines in the cluster.