Requirements

edit

Before running elasticsearch-hadoop, please do check out the requirements below. This is even more so important when deploying elasticsearch-hadoop across a cluster where the software on some machines might be slightly out of sync. While elasticsearch-hadoop tries its best to fall back and do various validations of its environment, doing a quick sanity check especially during upgrades can save you a lot of headaches.

make sure to verify all nodes in a cluster when checking the version of a certain artifact.

elasticsearch-hadoop adds no extra requirements to Hadoop (or the various libraries built on top of it, such as Cascading or Pig) or Elasticsearch however as a rule of thumb, do use the latest stable version of the said library (checking the compatibility with Hadoop and the JDK, where applicable).

JDK

edit

JDK level 6.0 (or above) just like Hadoop. As JDK 6 as well as JDK 7 have been both EOL-ed and are not supported by recent product updates, we strongly recommend using the latest JDK 8 (at least u20 or higher). If that is not an option, use JDK 7.0 update u55 (required for Elasticsearch 1.2 or higher). An up-to-date support matrix for Elasticsearch is available here. Do note that the JVM versions are critical for a stable environment as an incorrect version can corrupt the data underneath as explained in this blog post.

One can check the available JDK version from the command line:

$ java -version
java version "1.8.0_45"

Elasticsearch

edit

We highly recommend using the latest Elasticsearch (currently 6.2.4). While elasticsearch-hadoop maintains backwards compatibility with previous versions of Elasticsearch, we strongly recommend using the latest, stable version of Elasticsearch. You can find a matrix of supported versions here.

The Elasticsearch version is shown in its folder name:

$ ls
elasticsearch-6.2.4

If Elasticsearch is running (locally or remotely), one can find out its version through REST:

$ curl -XGET http://localhost:9200
{
  "status" : 200,
  "name" : "Dazzler",
  "version" : {
    "number" : "6.2.4",
    ...
  },
  "tagline" : "You Know, for Search"
}

Hadoop

edit

Hadoop 2.x (ideally the latest stable version, currently 2.7.3). elasticsearch-hadoop is tested daily against Apache Hadoop; any distro compatible with Apache Hadoop should work just fine.

To check the version of Hadoop, one can refer either to its folder or jars (which contain the version in their names) or from the command line:

$ bin/hadoop version
Hadoop 2.4.1

Apache YARN / Hadoop 2.x

edit

elasticsearch-hadoop binary is tested against Hadoop 2.x and designed to run on Yarn without any changes or modifications.

Cascading

edit

Cascading version 2.1.x (2.1.6) or higher. We recommend using the latest release of Cascading (currently 2.6.3).

Since Cascading is a library, the best way to find out the target version is to look at its file name:

$ ls
cascading-2.6.3.jar

Apache Hive

edit

Apache Hive 0.10 or higher. We recommend using the latest release of Hive (currently 1.2.1).

One can find out the Hive version from its folder name or command-line:

$ bin/hive --version
Hive version 1.2.1

Apache Pig

edit

Pig 0.10.0 or higher. We recommend using the latest release of Pig (currently 0.15.0).

In a similar fashion, Pig version can be discovered from its folder path or through the command-line:

$ bin/pig -i
Apache Pig version 0.15.0

Apache Spark

edit

Spark 1.3.0 or higher. We recommend using the latest release of Spark (currently 2.2.0). As elasticsearch-hadoop provides native integration (which is recommended) with Apache Spark it does not matter what binary one is using. The same applies when using the Hadoop layer to integrate the two as elasticsearch-hadoop supports the majority of Hadoop distributions out there.

The Spark version can be typically discovery by looking at its folder name:

$ pwd
/libs/spark/spark-2.2.0-bin-XXXXX

or by running its shell:

$ bin/spark-shell
...
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.2.0
      /_/
...

Apache Spark SQL

edit

If planning on using Spark SQL make sure to download the appropriate jar. While it is part of the Spark distribution, it is not part of Spark core but rather has its own jar. Thus, when constructing the classpath make sure to include spark-sql-<scala-version>.jar or the Spark assembly : spark-assembly-2.2.0-<distro>.jar

elasticsearch-hadoop supports Spark SQL 1.3 though 1.6 and also Spark SQL 2.0. Since Spark 2.x is not compatible with Spark 1.x, two different artifacts are provided by elasticsearch-hadoop. elasticsearch-hadoop supports Spark SQL 2.2.0 through its main jar. Since Spark SQL 2.0 is not backwards compatible with Spark SQL 1.6 or lower, elasticsearch-hadoop provides a dedicated jar. See the Spark chapter for more information. Note that Spark 1.0-1.2 are no longer supported (again due to backwards incompatible changes in Spark).

Apache Storm

edit

Storm 1.0.0 or higher. Do note that Storm 1.0.0 broke backwards compatibility with the previous versions (by changing the package name) however upgrading is easy and recommended. We recommend using the latest release of Storm (currently 1.0.1).

One can discover the Storm version by looking at its folder or by invoking the command:

$ bin/storm version
1.0.1