WARNING: Version 5.1 has passed its EOL date.
This documentation is no longer being maintained and may be removed. If you are running this version, we strongly advise you to upgrade. For the latest information, see the current release documentation.
Key features
editKey features
editThe key features of Elasticsearch for Apache Hadoop include:
- Scalable Map/Reduce model
- elasticsearch-hadoop is built around Map/Reduce: every operation done in elasticsearch-hadoop results in multiple Hadoop tasks (based on the number of target shards) that interact, in parallel with Elasticsearch.
- REST based
- elasticsearch-hadoop uses Elasticsearch REST interface for communication, allowing for flexible deployments by minimizing the number of ports needed to be open within a network.
- Self contained
- the library has been designed to be small and efficient. At around 300KB and no extra dependencies outside Hadoop itself, distributing elasticsearch-hadoop within your cluster is simple and fast.
- Universal jar
- whether you are using Hadoop 1.x or Hadoop 2.x, vanilla Apache Hadoop or a certain distro, the same elasticsearch-hadoop jar works transparently across all of them.
- Memory and I/O efficient
- elasticsearch-hadoop is focused on performance. From pull-based parsing, to bulk updates and direct conversion to/of native types, elasticsearch-hadoop keeps its memory and network I/O usage finely-tuned.
- Adaptive I/O
- elasticsearch-hadoop detects transport errors and retries automatically. If the Elasticsearch node died, re-routes the request to the available nodes (which are discovered automatically). Additionally, if Elasticsearch is overloaded, elasticsearch-hadoop detects the data rejected and resents it, until it is either processed or the user-defined policy applies.
- Facilitates data co-location
- elasticsearch-hadoop fully integrates with Hadoop exposing its network access information, allowing co-located Elasticsearch and Hadoop clusters to be aware of each other and reduce network IO.
- Map/Reduce API support
- At its core, elasticsearch-hadoop uses the low-level Map/Reduce API to read and write data to Elasticsearch allowing for maximum integration flexibility and performance.
-
old(
mapred
) & new(mapreduce
) Map/Reduce APIs supported -
elasticsearch-hadoop automatically adjusts to your environment; one does not have to change between using the
mapred
ormapreduce
APIs - both are supported, by the same classes, at the same time. - Apache Hive support
- Run Hive queries against Elasticsearch for advanced analystics and real_time responses. elasticsearch-hadoop exposes Elasticsearch as a Hive table so your scripts can crunch through data faster then ever.
- Apache Pig support
-
elasticsearch-hadoop supports Apache Pig exposing Elasticsearch as a native Pig
Storage
. Run your Pig scripts against Elasticsearch without any modifications to your configuration or the Pig client. - Cascading support
- Cascading is an application framework for Java developers to simply develop robust applications on Apache Hadoop. And with elasticsearch-hadoop, Cascading can run its flows directly onto Elasticsearch.
- Apache Spark
-
Run fast transformations directly against Elasticsearch, either by streaming data or indexing arbitrary
RDD
s. Available in both Java and Scala flavors. - Apache Storm
-
elasticsearch-hadoop supports Apache Storm exposing Elasticsearch as both a
Spout
(source) or aBolt
(sink).