Google Cloud Storage Input Plugin
editGoogle Cloud Storage Input Plugin
edit- Plugin version: v0.14.0
- Released on: 2023-05-02
- Changelog
For other versions, see the Versioned plugin docs.
Installation
editFor plugins not bundled by default, it is easy to install by running bin/logstash-plugin install logstash-input-google_cloud_storage
. See Working with plugins for more details.
Getting Help
editFor questions about the plugin, open a topic in the Discuss forums. For bugs or feature requests, open an issue in Github. For the list of Elastic supported plugins, please consult the Elastic Support Matrix.
Description
editExtracts events from files in a Google Cloud Storage bucket.
Example use-cases:
- Read Stackdriver logs from a Cloud Storage bucket into Elastic.
- Read gzipped logs from cold-storage into Elastic.
- Restore data from an Elastic dump.
- Extract data from Cloud Storage, transform it with Logstash and load it into BigQuery.
Note: While this project is partially maintained by Google, this is not an official Google product.
Installation Note
Attempting to install this plugin may result in an error:
Bundler::VersionConflict: Bundler could not find compatible versions for gem "mimemagic": In Gemfile: logstash-input-google_cloud_storage (= 0.11.0) was resolved to 0.11.0, which depends on mimemagic (>= 0.3.7) Could not find gem 'mimemagic (>= 0.3.7)', which is required by gem 'logstash-input-google_cloud_storage (= 0.11.0)', in any of the sources or in gems cached in vendor/cache
If this error occurs, you can fix it by manually installing the "mimemagic" dependency directly into the
Logstash’s internal Ruby Gems cache, which is present at vendor/bundle/jruby/<ruby_version>/gems/
. This could be done using
the bundled Ruby gem’s instance inside the Logstash’s installation bin/
folder.
To manually install the "mimemagic" gem into Logstash use:
bin/ruby -S gem install mimemagic -v '>= 0.3.7'
The mimemagic gem also requires the shared-mime-info
package to be present, it can be installed using apt-get install shared-mime-info
on Debian/Ubuntu or yum install shared-mime-info
on Red Hat/RockyOS distributions.
Then install the plugin as usual with:
bin/logstash-plugin install logstash-input-google_cloud_storage
Metadata Attributes
editThe plugin exposes several metadata attributes about the object being read. You can access these later in the pipeline to augment the data or perform conditional logic.
Key | Type | Description |
---|---|---|
|
|
The name of the bucket the file was read from. |
|
|
The name of the object. |
|
|
A map of metadata on the object. |
|
|
MD5 hash of the data. Encoded using base64. |
|
|
CRC32c checksum, as described in RFC 4960. Encoded using base64 in big-endian byte order. |
|
|
The content generation of the object. Used for object versioning |
|
|
The position of the event in the file. 1 indexed. |
|
|
A deterministic, unique ID describing this line. This lets you do idempotent inserts into Elasticsearch. |
More information about object metadata can be found in the official documentation.
Example Configurations
editBasic
editBasic configuration to read JSON logs every minute from my-logs-bucket
.
For example, Stackdriver logs.
input { google_cloud_storage { interval => 60 bucket_id => "my-logs-bucket" json_key_file => "/home/user/key.json" file_matches => ".*json" codec => "json_lines" } } output { stdout { codec => rubydebug } }
Idempotent Inserts into Elasticsearch
editIf your pipeline might insert the same file multiple times you can use the line_id
metadata key as a deterministic id.
The ID has the format: gs://<bucket_id>/<object_id>:<line_num>@<generation>
.
line_num
represents the nth event deserialized from the file starting at 1.
generation
is a unique id Cloud Storage generates for the object.
When an object is overwritten it gets a new generation.
input { google_cloud_storage { bucket_id => "batch-jobs-output" } } output { elasticsearch { document_id => "%{[@metadata][gcs][line_id]}" } }
From Cloud Storage to BigQuery
editExtract data from Cloud Storage, transform it with Logstash and load it into BigQuery.
input { google_cloud_storage { interval => 60 bucket_id => "batch-jobs-output" file_matches => "purchases.*.csv" json_key_file => "/home/user/key.json" codec => "plain" } } filter { csv { columns => ["transaction", "sku", "price"] convert => { "transaction" => "integer" "price" => "float" } } } output { google_bigquery { project_id => "my-project" dataset => "logs" csv_schema => "transaction:INTEGER,sku:INTEGER,price:FLOAT" json_key_file => "/path/to/key.json" error_directory => "/tmp/bigquery-errors" ignore_unknown_values => true } }
Additional Resources
editGoogle Cloud Storage Input Configuration Options
editThis plugin supports the following configuration options plus the Common Options described later.
Setting | Input type | Required |
---|---|---|
Yes |
||
No |
||
No |
||
No |
||
No |
||
No |
||
No |
||
No |
||
No |
Also see Common Options for a list of options supported by all input plugins.
bucket_id
edit- Value type is string
- There is no default value for this setting.
The bucket containing your log files.
json_key_file
edit- Value type is path
- There is no default value for this setting.
The path to the key to authenticate your user to the bucket.
This service user should have the storage.objects.update
permission so it can create metadata on the object preventing it from being scanned multiple times.
If no key is provided the plugin will try to use the default application credentials, and if they don’t exist, it falls back to unauthenticated mode.
interval
edit- Value type is number
-
Default is:
60
The number of seconds between looking for new files in your bucket.
file_matches
edit- Value type is string
-
Default is:
.*\.log(\.gz)?
A regex pattern to filter files. Only files with names matching this will be considered. All files match by default.
file_exclude
edit- Value type is string
-
Default is:
^$
Any files matching this regex are excluded from processing. No files are excluded by default.
metadata_key
edit- Value type is string
-
Default is:
x-goog-meta-ls-gcs-input
This key will be set on the objects after they’ve been processed by the plugin. That way you can stop the plugin and not upload files again or prevent them from being uploaded by setting the field manually.
the key is a flag, if a file was partially processed before Logstash exited some events will be resent.
processed_db_path
edit- Value type is path
-
Default is:
LOGSTASH_DATA/plugins/inputs/google_cloud_storage/db
.
If set, the plugin will store the list of processed files locally. This allows you to create a service account for the plugin that does not have write permissions. However, the data will not be shared across multiple running instances of Logstash.
Common Options
editThe following configuration options are supported by all input plugins:
Details
edit
codec
edit- Value type is codec
-
Default value is
"plain"
The codec used for input data. Input codecs are a convenient method for decoding your data before it enters the input, without needing a separate filter in your Logstash pipeline.
enable_metric
edit- Value type is boolean
-
Default value is
true
Disable or enable metric logging for this specific plugin instance by default we record all the metrics we can, but you can disable metrics collection for a specific plugin.
id
edit- Value type is string
- There is no default value for this setting.
Add a unique ID
to the plugin configuration. If no ID is specified, Logstash will generate one.
It is strongly recommended to set this ID in your configuration. This is particularly useful
when you have two or more plugins of the same type, for example, if you have 2 google_cloud_storage inputs.
Adding a named ID in this case will help in monitoring Logstash when using the monitoring APIs.
input { google_cloud_storage { id => "my_plugin_id" } }
Variable substitution in the id
field only supports environment variables
and does not support the use of values from the secret store.
tags
edit- Value type is array
- There is no default value for this setting.
Add any number of arbitrary tags to your event.
This can help with processing later.
type
edit- Value type is string
- There is no default value for this setting.
Add a type
field to all events handled by this input.
Types are used mainly for filter activation.
The type is stored as part of the event itself, so you can also use the type to search for it in Kibana.
If you try to set a type on an event that already has one (for example when you send an event from a shipper to an indexer) then a new input will not override the existing type. A type set at the shipper stays with that event for its life even when sent to another Logstash server.