WARNING: Version 2.1 has passed its EOL date.
This documentation is no longer being maintained and may be removed. If you are running this version, we strongly advise you to upgrade. For the latest information, see the current release documentation.
Configuration
editConfiguration
editelasticsearch-hadoop behavior can be customized through the properties below, typically by setting them on the target job Hadoop Configuration
. However some of them can be specified through other means depending on the library used (see the relevant section).
All configuration properties start with the es
prefix. The namespace es.internal
is reserved by the library for its internal use and should not be used by the user at any point.
Required settings
edit-
es.resource
-
Elasticsearch resource location, where data is read and written to. Requires the format
<index>/<type>
(relative to the Elasticsearch host/port (see below))).
es.resource = twitter/tweet # index 'twitter', type 'tweet'
-
es.resource.read
(defaults toes.resource
) - Elasticsearch resource used for reading (but not writing) data. Useful when reading and writing data to different Elasticsearch indices within the same job. Typically set automatically (expect for the Map/Reduce module which requires manual configuration).
-
es.resource.write
(defaults toes.resource
) - Elasticsearch resource used for writing (but not reading) data. Used typically for dynamic resource writes or when writing and reading data to different Elasticsearch indices within the same job. Typically set automatically (expect for the Map/Reduce module which requires manual configuration).
Note that [multiple](https://www.elastic.co/guide/en/elasticsearch/guide/current/multi-index-multi-type.html) indices and/or types are allowed only for reading. Use _all/types
to search types
in all indices or index/
to search
all types within index
.
Do note that reading multiple indices/types typically works only when the have the same structure and only with some libraries. Integrations that require a strongly type mapping (such as a table like Hive or SparkSQL) are likely to fail.
Dynamic/multi resource writes
editFor writing, elasticsearch-hadoop allows the target resource to be resolved at runtime by using patterns (by using the {<field-name>}
format), resolved at runtime based on the data being streamed to Elasticsearch. That is, one can save documents to a certain index
or type
based on one or multiple fields resolved from the document about to be saved.
For example, assuming the following document set (described here in JSON for readability - feel free to translate this into the actual Java objects):
{ "media_type":"game", "title":"Final Fantasy VI", "year":"1994" }, { "media_type":"book", "title":"Harry Potter", "year":"2010" }, { "media_type":"music", "title":"Surfing With The Alien", "year":"1987" }
to index each of them based on their media_type
one would use the following pattern:
# index the documents based on their type es.resource.write = my-collection/{media_type}
which would result in Final Fantasy VI
indexed under my-collection/game
, Harry Potter
under my-collection/book
and Surfing With The Alien
under my-collection/music
.
For more information, please refer to the dedicated integration section.
Dynamic resources are supported only for writing, for doing multi-index/types reads, use an appropriate search query.
Formatting dynamic/multi resource writes
editWhen using dynamic/multi writes, one can also specify a formatting of the value returned by the field. Out of the box, elasticsearch-hadoop provides formatting for date/timestamp fields which is useful for automatically grouping time-based data (such as logs) within a certain time range under the same index. By using the Java SimpleDataFormat syntax, one can format and parse the date in a locale-sensitive manager.
For example assuming the data contains a @timestamp
field, one can group the documents in daily indices using the following configuration:
The same configuration property is used (es.resource.write
) however, through the special :
characters a formatting pattern is specified.
Please refer to the SimpleDataFormat javadocs for more information on the syntax supported.
In this case YYYY.MM.dd
translates the date into the year (specified by four digits), month by 2 digits followed by the day by two digits (such as 2015.01.28
).
Logstash users will find this pattern quite familiar.
Essential settings
editNetwork
edit-
es.nodes
(default localhost) -
List of Elasticsearch nodes to connect to. When using Elasticsearch remotely, do set this option. Note that the list does not have to contain every node inside the Elasticsearch cluster; these are discovered automatically by elasticsearch-hadoop by default (see below). Each node can have its HTTP/REST port specified manually (e.g.
mynode:9600
). -
es.port
(default 9200) -
Default HTTP/REST port used for connecting to Elasticsearch - this setting is applied to the nodes in
es.nodes
that do not have any port specified.
Querying
edit-
es.query
(default none) -
Holds the query used for reading data from the specified
es.resource
. By default it is not set/empty, meaning the entire data under the specified index/type is returned.es.query
can have three forms:- uri query
-
using the form
?uri_query
, one can specify a query string. Notice the leading?
. - query dsl
-
using the form
query_dsl
- note the query dsl needs to start with{
and end with}
as mentioned here - external resource
-
if none of the two above do match, elasticsearch-hadoop will try to interpret the parameter as a path within the HDFS file-system. If that is not the case, it will try to load the resource from the classpath or, if that fails, from the Hadoop
DistributedCache
. The resource should contain either auri query
or aquery dsl
.
To wit, here is an example:
# uri (or parameter) query es.query = ?q=costinl # query dsl es.query = { "query" : { "term" : { "user" : "costinl" } } } # external resource es.query = org/mypackage/myquery.json
In other words, es.query
is flexible enough so that you can use whatever search api you prefer, either inline or by loading it from an external resource.
We recommend using query dsl externalized in a file, included within the job jar (and thus available on its classpath). This makes it easy to identify, debug and organize your queries. Through-out the documentation we use the uri query to save text and increase readability - real-life queries quickly become unwielding when used as uris.
Operation
edit-
es.input.json
(default false) - Whether the input is already in JSON format or not (the default). Please see the appropriate section of each integration for more details about using JSON directly.
-
es.write.operation
(default index) -
The write operation elasticsearch-hadoop should peform - can be any of:
-
index
(default) - new data is added while existing data (based on its id) is replaced (reindexed).
-
create
- adds new data - if the data already exists (based on its id), an exception is thrown.
-
update
- updates existing data (based on its id). If no data is found, an exception is thrown.
-
upsert
- known as merge or insert if the data does not exist, updates if the data exists (based on its id).
-
Added in 2.1.
-
es.output.json
(default false) - Whether the output from the connector should be in JSON format or not (the default). When enabled, the documents are returned in raw JSON format (as returned from Elasticsearch). Please see the appropriate section of each integration for more details about using JSON directly.
Mapping (when writing to Elasticsearch)
edit-
es.mapping.id
(default none) - The document field/property name containing the document id.
-
es.mapping.parent
(default none) -
The document field/property name containing the document parent. To specify a constant, use the
<CONSTANT>
format. -
es.mapping.version
(default none) -
The document field/property name containing the document version. To specify a constant, use the
<CONSTANT>
format. -
es.mapping.version.type
(default depends ones.mapping.version
) -
Indicates the type of versioning used.
If
es.mapping.version
is undefined (default), its value is unspecified. Ifes.mapping.version
is specified, its value becomesexternal
. -
es.mapping.routing
(default none) -
The document field/property name containing the document routing. To specify a constant, use the
<CONSTANT>
format. -
es.mapping.ttl
(default none) -
The document field/property name containing the document time-to-live. To specify a constant, use the
<CONSTANT>
format. -
es.mapping.timestamp
(default none) -
The document field/property name containing the document timestamp. To specify a constant, use the
<CONSTANT>
format.
Added in 2.1.
-
es.mapping.date.rich
(default true) -
Whether to create a rich
Date
like object forDate
fields in Elasticsearch or returned them as primitives (String
orlong
). By default this is true. The actual object type is based on the library used; noteable exception being Map/Reduce which provides no built-inDate
object and as suchLongWritable
andText
are returned regardless of this setting.
Added in 2.1.
-
es.mapping.include
(default none) - Field/property to be included in the document sent to Elasticsearch. Useful for extracting the needed data from entities. The syntax is similar to that of Elasticsearch include/exclude. Multiple values can be specified by using a comma. By default, no value is specified meaning all properties/fields are included.
Added in 2.1.
-
es.mapping.exclude
(default none) - Field/property to be excluded in the document sent to Elasticsearch. Useful for eliminating unneeded data from entities. The syntax is similar to that of Elasticsearch include/exclude. Multiple values can be specified by using a comma. By default, no value is specified meaning no properties/fields are excluded.
For example:
# extracting the id from the field called 'uuid' es.mapping.id = uuid # specifying a parent with id '123' es.mapping.parent = <123> # combine include / exclude for complete control # include es.mapping.include = u*, foo.* # exclude es.mapping.exclude = *.description
Using the configuration above, each entry will have only its top-level fields, starting with u and nested fields under foo
included
in the document with the exception of any nested field named description
. Additionally the document parent will be 123
while the
document id extracted from field uuid
.
Metadata (when reading from Elasticsearch)
edit-
es.read.metadata
(default false) - Whether to include the document metadata (such as id and version) in the results or not (default).
-
es.read.metadata.field
(default _metadata) -
The field under which the metadata information is placed. When
es.read.metadata
is set to true, the information is returned as aMap
under the specified field. -
es.read.metadata.version
(default false) -
Whether to include the document version in the returned metadata. Applicable only if
es.read.metadata
is enabled.
Update settings (when writing to Elasticsearch)
editOne using the update
or upsert
operation, additional settings (that mirror the update API) are available:
-
es.update.script
(default none) - Script used for updating the document.
-
es.update.script.lang
(default none) - Script language. By default, no value is specified applying the node configuration.
-
es.update.script.params
(default none) -
Script parameters (if any). The document (currently read) field/property who’s value is used. To specify a constant, use the
<CONSTANT>
format. Multiple values can be specified through commas (,
)
For example:
# specifying 2 parameters, one extracting the value from field 'number', the other containing the value '123': es.update.script.params = param1:number,param2:<123>
-
es.update.script.params.json
-
Script parameters specified in
raw
, JSON format. The specified value is passed as is, without any further processing or filtering. Typically used for migrating existing update scripts.
For example:
es.update.script.params.json = {"param1":1, "param2":2}
-
es.update.retry.on.conflict
(default 0) - How many times an update to a document is retried in case of conflict. Useful in concurrent environments.
Advanced settings
editIndex
edit-
es.index.auto.create
(default yes) - Whether elasticsearch-hadoop should create an index (if its missing) when writing data to Elasticsearch or fail.
-
es.index.read.missing.as.empty
(default no) - Whether elasticsearch-hadoop will allow reading of non existing indices (and return an empty data set) or not (and throw an exception)
-
es.field.read.empty.as.null
(default yes) -
Whether elasticsearch-hadoop will treat empty fields as
null
. This settings is typically not needed (as elasticsearch-hadoop already handles the null case) but is enabled for making it easier to work with text fields that haven’t been sanitized yet. -
es.field.read.validate.presence
(default warn) -
To help out spot possible mistakes when querying data from Hadoop (which results in incorrect data being returned), elasticsearch-hadoop can perform validation spotting missing fields and potential typos. Possible values are :
-
ignore
- no validation is performed
-
warn
- a warning message is logged in case the validation fails
-
strict
- an exception is thrown, halting the job, if a field is missing
-
The default (warn
) will log any typos to the console when the job starts:
WARN main mr.EsInputFormat - Field(s) [naem, adress] not found in the Elasticsearch mapping specified; did you mean [name, location.address]?
Network
edit-
es.nodes.discovery
(default true) -
Whether to discovery the nodes within the Elasticsearch cluster or only to use the ones given in
es.nodes
for metadata queries. Note that this setting only applies during start-up; afterwards when reading and writing, elasticsearch-hadoop uses the target index shards (and their hosting nodes) unlesses.nodes.client.only
is enabled. -
es.nodes.client.only
(default false) -
Whether to use Elasticsearch client nodes (or load-balancers). When enabled, elasticsearch-hadoop will route all its requests (after nodes discovery, if enabled) through the client nodes within the cluster. Note this typically significantly reduces the node parallelism and thus it is disabled by default. Enabling it also
disables
es.nodes.data.only
(since a client node is a non-data node).
Added in 2.1.2.
-
es.nodes.data.only
(default true) - Whether to use Elasticsearch data nodes only. When enabled, elasticsearch-hadoop will route all its requests (after nodes discovery, if enabled) through the data nodes within the cluster. The purpose of this configuration setting is to avoid overwhelming non-data nodes as these tend to be "smaller" nodes. This is enabled by default.
Added in 2.2.
-
es.http.timeout
(default 1m) - Timeout for HTTP/REST connections to Elasticsearch.
-
es.http.retries
(default 3) -
Number of retries for establishing a (broken) http connection. The retries are applied for each conversation with an Elasticsearch node. Once the retries are depleted, the connection will automatically be re-reouted to the next
available Elasticsearch node (based on the declaration of
es.nodes
, followed by the discovered nodes - if enabled). -
es.scroll.keepalive
(default 10m) - The maximum duration of result scrolls between query requests.
-
es.scroll.size
(default 50) - Number of results/items returned by each individual scroll.
-
es.action.heart.beat.lead
(default 15s) - The lead to task timeout before elasticsearch-hadoop informs Hadoop the task is still running to prevent task restart.
Added in 2.1.
Basic Authentication
edit-
es.net.http.auth.user
- Basic Authentication user name
-
es.net.http.auth.pass
- Basic Authentication password
Added in 2.1.
SSL
edit-
es.net.ssl
(default false) - Enable SSL
-
es.net.ssl.keystore.location
- key store (if used) location (typically a URL, without a prefix it is interpreted as a classpath entry)
-
es.net.ssl.keystore.pass
- key store password
-
es.net.ssl.keystore.type
(default JKS) - key store type. PK12 is an common, alternative format
-
es.net.ssl.truststore.location
- trust store location (typically a URL, without a prefix it is interpreted as a classpath entry)
-
es.net.ssl.truststore.pass
- trust store password
-
es.net.ssl.cert.allow.self.signed
(default false) - Whether or not to allow self signed certificates
-
es.net.ssl.protocol
(default TLS) - SSL protocol to be used
Proxy
edit-
es.net.proxy.http.host
- Http proxy host name
-
es.net.proxy.http.port
- Http proxy port
-
es.net.proxy.http.user
- Http proxy user name
-
es.net.proxy.http.pass
- Http proxy password
-
es.net.proxy.http.use.system.props
(default yes) -
Whether the use the system Http proxy properties (namely
http.proxyHost
andhttp.proxyPort
) or not -
es.net.proxy.socks.host
- Http proxy host name
-
es.net.proxy.socks.port
- Http proxy port
-
es.net.proxy.socks.user
- Http proxy user name
-
es.net.proxy.socks.pass
- Http proxy password
-
es.net.proxy.socks.use.system.props
(default yes) -
Whether the use the system Socks proxy properties (namely
socksProxyHost
andsocksProxyHost
) or not
elasticsearch-hadoop allows proxy settings to be applied only to its connection using the setting above. Take extra care when there is already a JVM-wide proxy setting (typically through system properties) to avoid unexpected behavior.
Serialization
edit-
es.batch.size.bytes
(default 1mb) - Size (in bytes) for batch writes using Elasticsearch bulk API. Note the bulk size is allocated per task instance. Always multiply by the number of tasks within a Hadoop job to get the total bulk size at runtime hitting Elasticsearch.
-
es.batch.size.entries
(default 1000) -
Size (in entries) for batch writes using Elasticsearch bulk API - (0 disables it). Companion to
es.batch.size.bytes
, once one matches, the batch update is executed. Similar to the size, this setting is per task instance; it gets multiplied at runtime by the total number of Hadoop tasks running. -
es.batch.write.refresh
(default true) - Whether to invoke an index refresh or not after a bulk update has been completed. Note this is called only after the entire write (meaning multiple bulk updates) have been executed.
-
es.batch.write.retry.count
(default 3) - Number of retries for a given batch in case Elasticsearch is overloaded and data is rejected. Note that only the rejected data is retried. If there is still data rejected after the retries have been performad, the Hadoop job is cancelled (and fails). A negative value indicates infinite retries; be careful in setting this value as it can have unwanted side effects.
-
es.batch.write.retry.wait
(default 10s) - Time to wait between batch write retries.
-
es.ser.reader.value.class
(default depends on the library used) -
Name of the
ValueReader
implementation for converting JSON to objects. This is set by the framework depending on the library (Map/Reduce, Cascading, Hive, Pig, etc…) used. -
es.ser.writer.value.class
(default depends on the library used) -
Name of the
ValueWriter
implementation for converting objects to JSON. This is set by the framework depending on the library (Map/Reduce, Cascading, Hive, Pig, etc…) used.