Webhdfs output plugin

edit
  • Plugin version: v3.1.0
  • Released on: 2023-10-03
  • Changelog

For other versions, see the Versioned plugin docs.

Getting Help

edit

For questions about the plugin, open a topic in the Discuss forums. For bugs or feature requests, open an issue in Github. For the list of Elastic supported plugins, please consult the Elastic Support Matrix.

Description

edit

This plugin sends Logstash events into files in HDFS via the webhdfs REST API.

Dependencies

edit

This plugin has no dependency on jars from hadoop, thus reducing configuration and compatibility problems. It uses the webhdfs gem from Kazuki Ohta and TAGOMORI Satoshi (@see: https://github.com/kzk/webhdfs). Optional dependencies are zlib and snappy gem if you use the compression functionality.

Operational Notes

edit

If you get an error like:

Max write retries reached. Exception: initialize: name or service not known {:level=>:error}

make sure that the hostname of your namenode is resolvable on the host running Logstash. When creating/appending to a file, webhdfs somtime sends a 307 TEMPORARY_REDIRECT with the HOSTNAME of the machine its running on.

Usage

edit

This is an example of Logstash config:

input {
  ...
}
filter {
  ...
}
output {
  webhdfs {
    host => "127.0.0.1"                 # (required)
    port => 50070                       # (optional, default: 50070)
    path => "/user/logstash/dt=%{+YYYY-MM-dd}/logstash-%{+HH}.log"  # (required)
    user => "hue"                       # (required)
  }
}

Webhdfs Output Configuration Options

edit

This plugin supports the following configuration options plus the Common Options described later.

Also see Common Options for a list of options supported by all output plugins.

 

compression

edit
  • Value can be any of: none, snappy, gzip
  • Default value is "none"

Compress output. One of [none, snappy, gzip]

flush_size

edit
  • Value type is number
  • Default value is 500

Sending data to webhdfs if event count is above, even if store_interval_in_secs is not reached.

host

edit
  • This is a required setting.
  • Value type is string
  • There is no default value for this setting.

The server name for webhdfs/httpfs connections.

idle_flush_time

edit
  • Value type is number
  • Default value is 1

Sending data to webhdfs in x seconds intervals.

kerberos_keytab

edit
  • Value type is string
  • There is no default value for this setting.

Set kerberos keytab file. Note that the gssapi library needs to be available to use this.

open_timeout

edit
  • Value type is number
  • Default value is 30

WebHdfs open timeout, default 30s.

path

edit
  • This is a required setting.
  • Value type is string
  • There is no default value for this setting.

The path to the file to write to. Event fields can be used here, as well as date fields in the joda time format, e.g.: /user/logstash/dt=%{+YYYY-MM-dd}/%{@source_host}-%{+HH}.log

port

edit
  • Value type is number
  • Default value is 50070

The server port for webhdfs/httpfs connections.

read_timeout

edit
  • Value type is number
  • Default value is 30

The WebHdfs read timeout, default 30s.

retry_interval

edit
  • Value type is number
  • Default value is 0.5

How long should we wait between retries.

retry_known_errors

edit
  • Value type is boolean
  • Default value is true

Retry some known webhdfs errors. These may be caused by race conditions when appending to same file, etc.

retry_times

edit
  • Value type is number
  • Default value is 5

How many times should we retry. If retry_times is exceeded, an error will be logged and the event will be discarded.

single_file_per_thread

edit
  • Value type is boolean
  • Default value is false

Avoid appending to same file in multiple threads. This solves some problems with multiple logstash output threads and locked file leases in webhdfs. If this option is set to true, %{[@metadata][thread_id]} needs to be used in path config settting.

snappy_bufsize

edit
  • Value type is number
  • Default value is 32768

Set snappy chunksize. Only neccessary for stream format. Defaults to 32k. Max is 65536 @see http://code.google.com/p/snappy/source/browse/trunk/framing_format.txt

snappy_format

edit
  • Value can be any of: stream, file
  • Default value is "stream"

Set snappy format. One of "stream", "file". Set to stream to be hive compatible.

ssl_cert

edit
  • Value type is string
  • There is no default value for this setting.

Set ssl cert file.

ssl_key

edit
  • Value type is string
  • There is no default value for this setting.

Set ssl key file.

standby_host

edit
  • Value type is string
  • Default value is false

Standby namenode for ha hdfs.

standby_port

edit
  • Value type is number
  • Default value is 50070

Standby namenode port for ha hdfs.

use_httpfs

edit
  • Value type is boolean
  • Default value is false

Use httpfs mode if set to true, else webhdfs.

use_kerberos_auth

edit
  • Value type is boolean
  • Default value is false

Set kerberos authentication.

use_ssl_auth

edit
  • Value type is boolean
  • Default value is false

Set ssl authentication. Note that the openssl library needs to be available to use this.

user

edit
  • This is a required setting.
  • Value type is string
  • There is no default value for this setting.

The Username for webhdfs.

Common Options

edit

The following configuration options are supported by all output plugins:

Setting Input type Required

codec

codec

No

enable_metric

boolean

No

id

string

No

codec

edit
  • Value type is codec
  • Default value is "line"

The codec used for output data. Output codecs are a convenient method for encoding your data before it leaves the output without needing a separate filter in your Logstash pipeline.

enable_metric

edit
  • Value type is boolean
  • Default value is true

Disable or enable metric logging for this specific plugin instance. By default we record all the metrics we can, but you can disable metrics collection for a specific plugin.

  • Value type is string
  • There is no default value for this setting.

Add a unique ID to the plugin configuration. If no ID is specified, Logstash will generate one. It is strongly recommended to set this ID in your configuration. This is particularly useful when you have two or more plugins of the same type. For example, if you have 2 webhdfs outputs. Adding a named ID in this case will help in monitoring Logstash when using the monitoring APIs.

output {
  webhdfs {
    id => "my_plugin_id"
  }
}

Variable substitution in the id field only supports environment variables and does not support the use of values from the secret store.