Requirements

edit

Before running elasticsearch-hadoop, please do check out the requirements below. This is even more so important when deploying elasticsearch-hadoop across a cluster where the software on some machines might be slightly out of sync. While elasticsearch-hadoop tries its best to fall back and do various validations of its environment, doing a quick sanity check especially during upgrades can save you a lot of headaches.

make sure to verify all nodes in a cluster when checking the version of a certain artifact.

elasticsearch-hadoop adds no extra requirements to Hadoop (or the various libraries built on top of it, such as Cascading or Pig) or Elasticsearch however as a rule of thumb, do use the latest stable version of the said library (checking the compatibility with Hadoop and the JDK, where applicable).

JDK

edit

JDK level 6.0 (or above) just like Hadoop. We strongly recommend using JDK 6.0 (update 25) or if possible, JDK 7.0 update u55 (required for Elasticsearch 1.2 or higher). See this blog post for more information on why the JVM versions are important (from Lucene point of view) and how some JVM updates (for example anything between JDK 7 update 25 and update 55) can cause index corruption due to some nasty bugs.

One can check the available JDK version from the command line:

$ java -version
java version "1.7.0_55"

Elasticsearch

edit

version 0.90 or higher, though we highly recommend using the latest Elasticsearch (currently 1.3.x) is needed to run Elasticsearch. Using a lower version is not possible as elasticsearch-hadoop uses new features added in 0.90 for distributed, parallel interactions with Elasticsearch. We strongly recommend using the latest, stable version of Elasticsearch.

The Elasticsearch version is shown in its folder name:

$ ls
elasticsearch-1.3.4

If Elasticsearch is running (locally or remotely), one can find out through REST its version:

$ curl -XGET http://localhost:9200
{
  "status" : 200,
  "name" : "Dazzler",
  "version" : {
    "number" : "1.3.4",
    "build_hash" : "a70f3ccb52200f8f2c87e9c370c6597448eb3e45",
    "build_timestamp" : "2014-09-30T09:07:17Z",
    "build_snapshot" : false,
    "lucene_version" : "4.9"
  },
  "tagline" : "You Know, for Search"
}

Hadoop

edit

Hadoop 1.x (ideally the latest stable version in the 1.x line, currently 1.2.1) or 2.x (ideally the latest stable version, currently 2.2.0). elasticsearch-hadoop is tested daily against Apache Hadoop; any distro compatible with Apache Hadoop should work just fine.

To check the version of Hadoop, one can refer either to its folder or jars (which contain the version in their names) or from the command line:

$ bin/hadoop version
Hadoop 1.2.1

As a guide, the table below lists the Hadoop-based distributions against with this version has been tested against at various points in time:

Distribution Release

Apache Hadoop

2.4.x

Apache Hadoop

2.2.x

Apache Hadoop

1.2.x

Apache Hadoop

1.1.x

Amazon EMR

3.1.x

Amazon EMR

3.0.x

Amazon EMR

2.4.x

Cloudera CDH

5.1.x

Cloudera CDH

5.0.x

Cloudera CDH

4.5.x

Cloudera CDH

4.4.x

Cloudera CDH

4.2.2

Hortonworks HDP

2.1.x

Hortonworks HDP

2.0.x

Hortonworks HDP

1.3.x

Greenplum GPHD

1.2

Intel Hadoop

2.5.1

Pivotal HD

2.1.x

Pivotal HD

2.0.x

Pivotal HD

1.1.x

MapR

4.0.x

MapR

3.1.x

MapR

3.0.x

MapR

2.1.x

Use the table above for guidance only; if your distro (or its version) is not there, it does not mean elasticsearch-hadoop is not compatible with it; rather go ahead and try it out and let us know how it went.

Apache YARN / Hadoop 2.x

edit

elasticsearch-hadoop binary can run transparently on both Hadoop 1.x and Yarn / Hadoop 2.x without any changes or modifications.

Cascading

edit

Cascading version 2.1.x (2.1.6) or higher. We recommend using the latest release of Cascading (currently 2.5.6).

Since Cascading is a library, the best way to find out the target version is to look at its file name:

$ ls
cascading-2.5.6.jar

Apache Hive

edit

Apache Hive 0.9 or higher. We recommend using the latest release of Hive (currently 0.12.0 or 0.13.1).

One can find out the Hive version from its folder name or command-line:

$ bin/hive --version
Hive version 0.13.1

Apache Pig

edit

Pig 0.10.0 or higher. We recommend using the latest release of Pig (currently 0.13).

In a similar fashion, Pig version can be discovered from its folder path or through the command-line:

$ bin/pig -i
Apache Pig version 0.13.0