Kerberos
editKerberos
editAdded in 6.7.
Kerberos support for Elasticsearch for Apache Hadoop requires Elasticsearch 6.7 or greater
Securing Hadoop means using Kerberos. Elasticsearch supports Kerberos as an authentication method. While the use of Kerberos is not required for securing Elasticsearch, it is a convenient option for those who already deploy Kerberos to secure their Hadoop clusters. This chapter aims to explain the steps needed to set up elasticsearch-hadoop to use Kerberos authentication for Elasticsearch.
Elasticsearch for Apache Hadoop communicates with Elasticsearch entirely over HTTP. In order to support Kerberos authentication over HTTP, elasticsearch-hadoop uses the Simple and Protected GSSAPI Negotiation Mechanism (SPNEGO) to negotiate which underlying authentication method to use (in this case, Kerberos) and to transmit the agreed upon credentials to the server. This authentication mechanism is performed using the HTTP Negotiate authentication standard, where a request is sent to the server and a response is received back with a payload that further advances the negotiation. Once the negotiation between the client and server is complete, the request is accepted and a successful response is returned.
Elasticsearch for Apache Hadoop makes use of Hadoop’s user management processes; The Kerberos credentials of the current Hadoop user are used when authenticating to Elasticsearch. This means that Kerberos authentication in Hadoop must be enabled in order for elasticsearch-hadoop to obtain a user’s Kerberos credentials. In the case of using an integration that does not depend on Hadoop’s runtime (e.g. Storm), additional steps may be required to ensure that a running process has Kerberos credentials available for authentication. It is recommended that you consult the documentation of each framework that you are using on how to configure security.
Setting up your environment
editThis documentation assumes that you have already provisioned a Hadoop cluster with Kerberos authentication enabled (required). The general process of deploying Kerberos and securing Hadoop is beyond the scope of this documentation.
Before starting, you will need to ensure that principals for your users are provisioned in your Kerberos deployment, as well as service principals for each Elasticsearch node. To enable Kerberos authentication on Elasticsearch, it must be configured with a Kerberos realm. It is recommended that you familiarize yourself with how to configure Elasticsearch Kerberos realms so that you can make appropriate adjustments to fit your deployment. You can find more information on how they work in the Elastic Stack documentation.
Additionally, you will need to configure the API Key Realm in Elasticsearch. Hadoop and other distributed data processing frameworks only authenticate with Kerberos in the process that launches a job. Once a job has been launched, the worker processes are often cut off from the original Kerberos credentials and need some other form of authentication. Hadoop services often provide mechanisms for obtaining Delegation Tokens during job submission. These tokens are then distributed to worker processes which use the tokens to authenticate on behalf of the user running the job. Elasticsearch for Apache Hadoop obtains API Keys in order to provide tokens for worker processes to authenticate with.
Connector Settings
editThe following settings are used to configure elasticsearch-hadoop to use Kerberos authentication:
-
es.security.authentication
(defaultsimple
, orbasic
ifes.net.http.auth.user
is set) -
Required. Similar to most Hadoop integrations, this property signals which method to use in order to authenticate with
Elasticsearch. By default, the value is
simple
, unlesses.net.http.auth.user
is set, in which case it will default tobasic
. The available options for this setting aresimple
for no authentication,basic
for basic http authentication,pki
if relying on certificates, andkerberos
if Kerberos authentication over SPNEGO should be used. -
es.net.spnego.auth.elasticsearch.principal
(default none) -
Required if
es.security.authentication
is set tokerberos
. Details the name of the service principal that the Elasticsearch server is running as. This will usually be of the formHTTP/node.address@REALM
. Since Elasticsearch is distributed and should be using a service principal per node, you can use the_HOST
pattern (like soHTTP/_HOST@REALM
) to have elasticsearch-hadoop substitute the address of the node it is communicating with at runtime. Note that elasticsearch-hadoop will attempt to reverse resolve node IP addresses to hostnames in order to perform this substitution. -
es.net.spnego.auth.mutual
(default false) -
Optional. The SPNEGO mechanism assumes that authentication may take multiple back and forth request-response cycles for
a request to be fully accepted by the server. When a request is finally accepted by the server, the response contains a
payload that can be verified to ensure that the server is the principal they say they are. Setting this to
true
instructs elasticsearch-hadoop to perform this mutual authentication, and to fail the response if it detects invalid credentials from the server.
Kerberos on Hadoop
editRequirements
editBefore using Kerberos authentication to Elasticsearch, Kerberos authentication must be enabled in Hadoop.
Configure elasticsearch-hadoop
editElasticsearch for Apache Hadoop only needs a few settings to configure Kerberos authentication. It is best to
set these properties in your core-site.xml
configuration so that they can be obtained across your entire Hadoop
deployment, just like you would for turning on security options for services in Hadoop.
<configuration> ... <property> <name>es.security.authentication</name> <value>kerberos</value> </property> <property> <name>es.net.spnego.auth.elasticsearch.principal</name> <value>HTTP/_HOST@REALM.NAME.HERE</value> </property> ... </configuration>
Kerberos on YARN
editWhen applications launch on a YARN cluster, they send along all of their application credentials to the Resource Manager process for them to be distributed to the containers. The Resource Manager has the ability to renew any tokens in those credentials that are about to expire and to cancel tokens once a job has completed. The tokens from Elasticsearch have a default lifespan of 7 days and they are not renewable. It is a best practice to configure YARN so that it is able to cancel those tokens at the end of a run in order to lower the risk of unauthorized use, and to lower the amount of bookkeeping Elasticsearch must perform to maintain them.
In order to configure YARN to allow it to cancel Elasticsearch tokens at the end of a run, you must add the elasticsearch-hadoop jar to the
Resource Manager’s classpath. You can do that by placing the jar on the Resource Manager’s local filesystem, and setting
the path to the jar in the YARN_USER_CLASSPATH
environment variable. Once the jar is added, the Resource Manager will
need to be restarted.
export YARN_USER_CLASSPATH=/path/to/elasticsearch-hadoop.jar
Additionally, the connection information for elasticsearch-hadoop should be present in the Hadoop configuration,
preferably the core-site.xml
. This is because when Resource Manager cancels a token, it does not take the job
configuration into account. Without the connection settings in the Hadoop configuration, the Resource Manager will not
be able to communicate to Elasticsearch in order to cancel the token.
Here is a few common security properties that you will need in order for the Resource Manager to contact Elasticsearch to cancel tokens:
<configuration> ... <property> <name>es.nodes</name> <value>es-master-1,es-master-2,es-master-3</value> </property> <property> <name>es.security.authentication</name> <value>kerberos</value> </property> <property> <name>es.net.spnego.auth.elasticsearch.principal</name> <value>HTTP/_HOST@REALM</value> </property> <property> <name>es.net.ssl</name> <value>true</value> </property> <property> <name>es.net.ssl.keystore.location</name> <value>file:///path/to/ssl/keystore</value> </property> <property> <name>es.net.ssl.truststore.location</name> <value>file:///path/to/ssl/truststore</value> </property> <property> <name>es.keystore.location</name> <value>file:///path/to/es/secure/store</value> </property> ... </configuration>
The addresses of some Elasticsearch nodes. These can be any nodes (or all of them) as long as they all belong to the same cluster. |
|
Authentication must be configured as |
|
The name of the Elasticsearch service principal is not required for token cancellation but having the property in the
|
|
SSL should be enabled if you are using a secured Elasticsearch deployment. |
|
Location on the local filesystem to reach the SSL Keystore. |
|
Location on the local filesystem to reach the SSL Truststore. |
|
Location on the local filesystem to reach the elasticsearch-hadoop secure store for secure settings. |
Kerberos with Map/Reduce
editBefore launching your Map/Reduce job, you must add a delegation token for Elasticsearch to the job’s credential set. The
EsMapReduceUtil
utility class can be used to do this for you. Simply pass your job to it before submitting it to the
cluster. Using the local Kerberos credentials, the utility will establish a connection to Elasticsearch, request an API Key, and
stow the key in the job’s credential set for the worker processes to use.
Job job = Job.getInstance(getConf(), "My-Job-Name"); // Configure Job Here... EsMapReduceUtil.initCredentials(job); if (!job.waitForCompletion(true)) { return 1; }
Creating a new job instance |
|
EsMapReduceUtil obtains job delegation tokens for Elasticsearch |
|
Submit the job to the cluster |
You can obtain the job delegation tokens at any time during the configuration of the Job object, as long as your elasticsearch-hadoop specific configurations are set. It’s usually sufficient to do it right before submitting the job. You should only do this once per job since each call will wastefully obtain another API Key.
Additionally, the utility is also compatible with the mapred
API classes:
Kerberos with Hive
editRequirements
editUsing Kerberos auth on Elasticsearch is only supported using HiveServer2.
Before using Kerberos authentication to Elasticsearch in Hive, Kerberos authentication must be enabled for Hadoop. Make sure you have done all the required steps for configuring your Hadoop cluster as well as the steps for configuring your YARN services before using Kerberos authentication for Elasticsearch.
Finally, ensure that Hive Security is enabled.
Since Hive relies on user impersonation in Elasticsearch it is advised that you familiarise yourself with Elasticsearch authentication and authorization.
Configure user impersonation settings for Hive
editHive’s security model follows a proxy-based approach. When a client submits a query to a secured Hive server, Hive authenticates the client using Kerberos. Once Hive is sure of the client’s identity, it wraps its own identity with a proxy user. The proxy user contains the client’s simple user name, but contains no credentials. Instead, it is expected that all interactions are executed as the Hive principal impersonating the client user. This is why when configuring Hive security, one must specify in the Hadoop configuration which users Hive is allowed to impersonate:
<property> <name>hadoop.proxyuser.hive.hosts</name> <value>*</value> </property> <property> <name>hadoop.proxyuser.hive.groups</name> <value>*</value> </property>
Elasticsearch supports user impersonation, but only users from certain realm implementations can be impersonated. Most deployments of Kerberos include other identity management components like LDAP or Active Directory. In those cases, you can configure those realms in Elasticsearch to allow for user impersonation.
If you are only using Kerberos, or you are using a solution for which Elasticsearch does not support user impersonation, you must
mirror your Kerberos principals to either a
native realm or a
file realm in Elasticsearch. When mirroring a
Kerberos principal to one of these realms, set the new user’s username to just the main part of the principal name, without
any realm or host information. For instance, client@REALM
would just be client
and someservice/domain.name@REALM
would just be someservice
.
You can follow this step by step process for mirroring users:
Create End User Roles
editCreate a role for your end users that will be querying Hive. In this example, we will make a simple role for accessing
indices that match hive-index-*
. All our Hive users will end up using this role to read, write, and update indices
in Elasticsearch.
Create role mapping for Kerberos user principal
editNow that the user role is created, we must map the Kerberos user principals to the role. Elasticsearch does not know the complete list of principals that are managed by Kerberos. As such, each principal that wishes to connect to Elasticsearch must be mapped to a list of roles that they will be granted after authentication.
Mirror the user to the native realm
editYou may not have to perform this step if you are deploying LDAP or Active Directory along with Kerberos. Elasticsearch will perform user impersonation by looking up the user names in those realms as long as the simple names (e.g. hive.user.1) on the Kerberos principals match the user names LDAP or Active Directory exactly.
Mirroring the user to the native realm will allow Elasticsearch to accept authentication requests from the original principal as well as accept requests from Hive which is impersonating the user. You can create a user in the native realm like so:
PUT /_xpack/security/user/hive.user.1 { "enabled" : true, "password" : "swordfish", "roles" : [ "hive_user_role" ], "metadata" : { "principal" : "hive.user.1@REALM" } }
The user name is |
|
Provide a password here for the user. This should ideally be a securely generated random password since this mirrored user is just for impersonation purposes. |
|
Setting the user’s roles to be the example role |
|
This is not required, but setting the original principal on the user as metadata may be helpful for your own bookkeeping. |
Create a role to impersonate Hive users
editOnce you have configured Elasticsearch with a role mapping for your Kerberos principals and native users for impersonation, you must create a role that Hive will use to impersonate those users.
Create role mapping for Hive’s service principal
editNow that there are users to impersonate, and a role that can impersonate them, make sure to map the Hive principal to the proxier role, as well as any of the roles that the users it is impersonating would have. This allows the Hive principal to create and read indices, documents, or do anything else its impersonated users might be able to do. While Hive is impersonating the user, it must have these roles or else it will not be able to fully impersonate that user.
POST /_xpack/security/role_mapping/hive_hiveserver2_mapping { "roles": [ "hive_user_role", "hive_proxier" ], "enabled": true, "rules": { "field" : { "username" : "hive/hiveserver2.address@REALM" } } }
Here we set the roles to be the superset of the roles from the users we want to impersonate. In our example, the
|
|
The role that allows Hive to impersonate Hive end users. |
|
The name of the Hive server principal to match against. |
If managing Kerberos role mappings via the API’s is not desired, they can instead be managed in a role mapping file.
Running your Hive queries
editOnce all user accounts are configured and all previous steps for enabling Kerberos auth in Hadoop and Hive are complete, there should be no differences in creating Hive queries from before.
Kerberos with Pig
editRequirements
editBefore using Kerberos authentication to Elasticsearch in Pig, Kerberos authentication must be enabled for Hadoop. Make sure you have done all the required steps for configuring your Hadoop cluster as well as the steps for configuring your YARN services before using Kerberos authentication for Elasticsearch.
If elasticsearch-hadoop is configured for Kerberos authentication and Hadoop security is enabled, elasticsearch-hadoop’s storage functions in Pig will automatically obtain delegation tokens for jobs when submitting them to the cluster.
Kerberos with Spark
editRequirements
editUsing Kerberos authentication in elasticsearch-hadoop for Spark has the following requirements:
- Your Spark jobs must be deployed on YARN. Using Kerberos authentication in elasticsearch-hadoop does not support any other Spark cluster deployments (Mesos, Standalone).
- Your version of Spark must be on or above version 2.1.0. It is this version that Spark added the ability to plug in third-party credential providers to obtain delegation tokens.
Before using Kerberos authentication to Elasticsearch in Spark, Kerberos authentication must be enabled for Hadoop. Make sure you have done all the required steps for configuring your Hadoop cluster as well as the steps for configuring your YARN services before using Kerberos authentication for Elasticsearch.
EsServiceCredentialProvider
editBefore Spark submits an application to a YARN cluster,
it loads a number of
credential provider implementations that are used to determine if any additional credentials must be obtained before
the application is started. These implementations are loaded using Java’s ServiceLoader
architecture. Thus, any jar
that is on the classpath when the Spark application is submitted can offer implementations to be loaded and used.
EsServiceCredentialProvider
is one such implementation that is loaded whenever elasticsearch-hadoop is on the job’s classpath.
Once loaded, EsServiceCredentialProvider
determines if Kerberos authentication is enabled for elasticsearch-hadoop. If it is determined
that Kerberos authentication is enabled for elasticsearch-hadoop, then the credential provider will automatically obtain delegation tokens
from Elasticsearch and add them to the credentials on the YARN application submission context. Additionally, in the case that
the job is a long lived process like a Spark Streaming job, the credential provider is used to update or obtain new
delegation tokens when the current tokens approach their expiration time.
The time that Spark’s credential providers are loaded and called depends on the cluster deploy mode when submitting your
Spark app. When running in client
deploy mode, Spark runs the user’s driver code in the local JVM, and launches the
YARN application to oversee the processing as needed. The providers are loaded and run whenever the YARN application
first comes online. When running in cluster
deploy mode, Spark launches the YARN application immediately, and the
user’s driver code is run from the resulting Application Master in YARN. The providers are loaded and run immediately,
before any user code is executed.
Configuring the credential provider
editAll implementations of the Spark credential providers use settings from only a few places:
- The entries from the local Hadoop configuration files
- The entries of the local Spark configuration file
- The entries that are specified from the command line when the job is initially launched
Settings that are configured from the user code are not used because the provider must run once for all jobs that are submitted for a particular Spark application. User code is not guaranteed to be run before the provider is loaded. To make things more complicated, a credential provider is only given the local Hadoop configuration to determine if they should load delegation tokens.
These limitations mean that the settings to configure elasticsearch-hadoop for Kerberos authentication need to be in specific places:
First, es.security.authentication
MUST be set in the local Hadoop configuration files as kerberos. If it is not set
in the Hadoop configurations, then the credential provider will assume that simple authentication is to be used, and
will not obtain delegation tokens.
Secondly, all general connection settings for elasticsearch-hadoop (like es.nodes
, es.ssl.enabled
, etc…) must be specified either
in the local Hadoop configuration files, in the local Spark configuration file, or from the command
line. If these settings are not available here, then the credential provider will not be able to contact Elasticsearch in order
to obtain the delegation tokens that it requires.
$> bin/spark-submit \ --class org.myproject.MyClass \ --master yarn \ --deploy-mode cluster \ --jars path/to/elasticsearch-hadoop.jar \ --conf 'spark.es.nodes=es-node-1,es-node-2,es-node-3' --conf 'spark.es.ssl.enabled=true' --conf 'spark.es.net.spnego.auth.elasticsearch.principal=HTTP/_HOST@REALM' path/to/jar.jar
An example of some connection settings specified at submission time |
|
Be sure to include the Elasticsearch service principal. |
Specifying this many configurations in the spark-submit command line is a pretty sure fire way to miss important settings. Thus, it is advised to set them in the cluster wide Hadoop config.
Renewing credentials for streaming jobs
editIn the event that you are running a streaming job, it is best to use the cluster
deploy mode to allow YARN to
manage running the driver code for the streaming application.
Since streaming jobs are expected to run continuously without stopping, you should configure Spark so that the credential provider can obtain new tokens before the original tokens expire.
Configuring Spark to obtain new tokens is different from configuring YARN to renew and cancel tokens. YARN can only renew existing tokens up to their maximum lifetime. Tokens from Elasticsearch are not renewable. Instead, they have a simple lifetime of 7 days. After those 7 days elapse, the tokens are expired. In order for an ongoing streaming job to continue running without interruption, completely new tokens must be obtained and sent to worker tasks. Spark has facilities for automatically obtaining and distributing completely new tokens once the original token lifetime has ended.
When submitting a Spark application on YARN, users can provide a principal and keytab file to the spark-submit
command. Spark will log in with these credentials instead of depending on the local Kerberos TGT Cache for the current
user. In the event that any delegation tokens are close to expiring, the loaded credential providers are given the
chance to obtain new tokens using the given principal and keytab before the current tokens fully expire. Any new tokens
are automatically distributed by Spark to the containers on the YARN cluster.
Disabling the credential provider
editWhen elasticsearch-hadoop is on the classpath, EsServiceCredentialProvider
is ALWAYS loaded by Spark. If Kerberos authentication is
enabled for elasticsearch-hadoop in the local Hadoop configuration, then the provider will attempt to load delegation tokens for Elasticsearch
regardless of if they are needed for that particular job.
It is advised that you do not add elasticsearch-hadoop libraries to jobs that are not configured to connect to or interact with Elasticsearch. This is the easiest way to avoid the confusion of unrelated jobs failing to launch because they cannot connect to Elasticsearch.
If you find yourself in a place where you cannot easily remove elasticsearch-hadoop from the classpath of jobs that do not need to interact with Elasticsearch, then you can explicitly disable the credential provider by setting a property at launch time. The property to set is dependent on your version of Spark:
-
For Spark 2.3.0 and up: set the
spark.security.credentials.elasticsearch.enabled
property tofalse
. -
For Spark 2.1.0-2.3.0: set the
spark.yarn.security.credentials.elasticsearch.enabled
property tofalse
. This property is still accepted in Spark 2.3.0+, but is marked as deprecated.
Kerberos with Storm
editRequirements
editYour Storm deployment should be secured, but configuring it for security is not strictly required.
Storm is not always deployed alongside a Hadoop distribution. Thus, configuring Kerberos authentication for Hadoop is not required for using Kerberos authentication to Elasticsearch on Storm.
Using Storm’s AutoCredential plugins
editStorm provides a
myriad of plugin interfaces that can be loaded and used to collect, update, and renew credentials over the lifetime of
a running topology. elasticsearch-hadoop provides the AutoElasticsearch
class which Storm can use to automatically obtain and renew
Elasticsearch delegation tokens for a topology.
AutoElasticsearch
implements Storm’s INimbusCredentialPlugin
, IAutoCredentials
, and ICredentialsRenewer
interfaces. The first of which is used to obtain delegation tokens on Nimbus before submitting a topology. The second
is used for updating the credentials on the worker nodes, and the third is used for obtaining new delegation tokens
when the current tokens are close to expiring.
Configuring AutoElasticsearch
editIn order for the AutoElasticsearch
plugin to obtain credentials, Kerberos authentication must be enabled for elasticsearch-hadoop in its
settings. You must specify the es.security.authentication
setting in either the storm.yaml file or on the topology
configuration.
The AutoElasticsearch
plugin provides two settings for denoting the principal and keytab to be used when executing:
-
es.storm.autocredentials.user.principal
(default none) - Required. The principal that the plugin should use for obtaining credentials for this topology. Can be set in the storm.yaml configuration or in the topology configuration.
-
es.storm.autocredentials.user.keytab
(default none) - Required. The path to the keytab on Nimbus that will be used for logging in as the given principal. This can be set in the storm.yaml configuration or in the topology configuration. The file must exist on Nimbus.
Configuring Nimbus
editNimbus must be configured to use AutoElasticsearch
as a credential plugin from the storm.yaml
configuration file.
It is safe to specify AutoElasticsearch
in these settings even if your topology does not interact with Elasticsearch. The
plugin will perform no operations unless AutoElasticsearch
is explicitly enabled on the topology.
nimbus.autocredential.plugins.classes: ["org.elasticsearch.storm.security.AutoElasticsearch"] nimbus.credential.renewers.classes: ["org.elasticsearch.storm.security.AutoElasticsearch"] nimbus.credential.renewers.freq.secs: 30
The list of auto credential plugins to be run on Nimbus when submitting a topology |
|
The list of all the credential renewers available for Nimbus to run |
|
The frequency at which the credential renewers on Nimbus should be executed to check and update credentials. |
In order for the plugin to be loaded, elasticsearch-hadoop must be present on the Nimbus classpath. You can add it to the classpath by using an environment variable on Nimbus.
export STORM_EXT_CLASSPATH=/path/to/elasticsearch-hadoop.jar
Configuring topologies
editOnce Nimbus is configured, you must add AutoElasticsearch
to your topology configuration in order for delegation
tokens to be obtained and updated. If you do not specify it in the topology configuration, then Storm will not attempt
to obtain Elasticsearch delegation tokens when the topology is submitted.
Config conf = new Config(); List plugins = new ArrayList(); plugins.add(AutoElasticsearch.class.getName()); conf.put(Config.TOPOLOGY_AUTO_CREDENTIALS, plugins); ... conf.put(ConfigurationOptions.ES_SECURITY_AUTHENTICATION, "kerberos"); conf.put(ConfigurationOptions.ES_NET_SPNEGO_AUTH_ELASTICSEARCH_PRINCIPAL, "HTTP/elasticsearch.node.address@REALM"); ...
Configure the topology with |
|
If you have not enabled Kerberos authentication for elasticsearch-hadoop in the storm.yaml configuration file, you will need to set the properties here. |