Installation

edit

elasticsearch-hadoop binaries can be obtained either by downloading them from the elastic.co site as a ZIP (containing project jars, sources and documentation) or by using any Maven-compatible tool with the following dependency:

<dependency>
  <groupId>org.elasticsearch</groupId>
  <artifactId>elasticsearch-hadoop</artifactId>
  <version>7.10.2</version>
</dependency>

The jar above contains all the features of elasticsearch-hadoop and does not require any other dependencies at runtime; in other words it can be used as is.

elasticsearch-hadoop binary is suitable for Hadoop 2.x (also known as YARN) environments. Support for Hadoop 1.x environments are deprecated in 5.5 and will no longer be tested against in 6.0.

Minimalistic binaries

edit

In addition to the uber jar, elasticsearch-hadoop provides minimalistic jars for each integration, tailored for those who use just one module (in all other situations the uber jar is recommended); the jars are smaller in size and use a dedicated pom, covering only the needed dependencies. These are available under the same groupId, using an artifactId with the pattern elasticsearch-hadoop-{integration}:

Map/Reduce.

<dependency>
  <groupId>org.elasticsearch</groupId>
  <artifactId>elasticsearch-hadoop-mr</artifactId> 
  <version>7.10.2</version>
</dependency>

mr artifact

Apache Hive.

<dependency>
  <groupId>org.elasticsearch</groupId>
  <artifactId>elasticsearch-hadoop-hive</artifactId> 
  <version>7.10.2</version>
</dependency>

hive artifact

Apache Pig.

<dependency>
  <groupId>org.elasticsearch</groupId>
  <artifactId>elasticsearch-hadoop-pig</artifactId> 
  <version>7.10.2</version>
</dependency>

pig artifact

Apache Spark.

<dependency>
  <groupId>org.elasticsearch</groupId>
  <artifactId>elasticsearch-spark-20_2.10</artifactId> 
  <version>7.10.2</version>
</dependency>

spark artifact. Notice the -20 part of the suffix which indicates the Spark version compatible with the artifact. Use 20 for Spark 2.0+ and 13 for Spark 1.3-1.6. Notice the _2.10 suffix which indicates the Scala version compatible with the artifact. Currently it is the same as the version used by Spark itself.

The Spark connector framework is the most sensitive to version incompatibilities. For your convenience, a version compatibility matrix has been provided below:

Spark Version Scala Version ES-Hadoop Artifact ID

1.0 - 1.2

2.10

<unsupported>

1.0 - 1.2

2.11

<unsupported>

1.3 - 1.6

2.10

elasticsearch-spark-13_2.10

1.3 - 1.6

2.11

elasticsearch-spark-13_2.11

2.0+

2.10

elasticsearch-spark-20_2.10

2.0+

2.11

elasticsearch-spark-20_2.11

Storm.

<dependency>
  <groupId>org.elasticsearch</groupId>
  <artifactId>elasticsearch-storm</artifactId> 
  <version>7.10.2</version>
</dependency>

storm artifact

Releases are available in the central Maven repository.

Development Builds

edit

Development (or nightly or snapshots) builds are published daily at sonatype-oss repository (see below). Make sure to use snapshot versioning:

<dependency>
  <groupId>org.elasticsearch</groupId>
  <artifactId>elasticsearch-hadoop</artifactId>
  <version>{version}-SNAPSHOT</version> 
</dependency>

notice the BUILD-SNAPSHOT suffix indicating a development build

but also enable the dedicated snapshots repository :

<repositories>
  <repository>
    <id>sonatype-oss</id>
    <url>http://oss.sonatype.org/content/repositories/snapshots</url> 
    <snapshots><enabled>true</enabled></snapshots> 
  </repository>
</repositories>

add snapshot repository

enable snapshots capability on the repository otherwise these will not be found by Maven

Upgrading Your Stack

edit

Elasticsearch for Apache Hadoop is a client library for Elasticsearch, albeit one with extended functionality for supporting operations on Hadoop/Spark. When upgrading Hadoop/Spark versions, it is best to check to make sure that your new versions are supported by the connector, upgrading your elasticsearch-hadoop version as appropriate.

Elasticsearch for Apache Hadoop maintains backwards compatibility with the most recent minor version of Elasticsearch’s previous major release (5.X supports back to 2.4.X, 6.X supports back to 5.6.X, etc…​). When you are upgrading your version of Elasticsearch, it is best to upgrade elasticsearch-hadoop to the new version (or higher) first. The new elasticsearch-hadoop version should continue to work for your previous Elasticsearch version, allowing you to upgrade as normal.

Elasticsearch for Apache Hadoop does not support rolling upgrades well. During a rolling upgrade, nodes that elasticsearch-hadoop is communicating with will be regularly disappearing and coming back online. Due to the constant connection failures that elasticsearch-hadoop will experience during the time frame of a rolling upgrade there is high probability that your jobs will fail. Thus, it is recommended that you disable any elasticsearch-hadoop based write or read jobs against Elasticsearch during your rolling upgrade process.