INSERT INTO LOGSTASH SELECT DATA FROM DATABASE
Ever want to search your database entities from Elasticsearch? Now you can use Logstash to do just that! In this blog we introduce the JDBC input, which has been created to import data from any database that supports the JDBC interface. Below, we show you few examples of using this plugin.
Getting Started
Installation
bin/plugin install logstash-input-jdbc
Driver Support
Popular databases like Oracle, Postgresql, and MySQL have compatible JDBC drivers that can be used with this input. This plugin does not come packaged with any of these JDBC drivers out of the box, but is straightforward to download. You can then configure the plugin to use the desired jdbc driver library. The setting jdbc_driver_library
and jdbc_driver_class
are used to load the library path and the driver's class name.
Lets get started with the examples!
Example 1: Simple Postgres Input
Here is an example of how you get started reading from a local Postgresql database. As a prerequisite, download the Postgresql JDBC drivers to use with the plugin.
Setting Up The Database
Before we get started, let's create a table called contacts
and populate it with some contacts!
create table contacts ( <p> uid serial, email VARCHAR(80) not null, first_name VARCHAR(80) NOT NULL, last_name VARCHAR(80) NOT NULL ); INSERT INTO contacts(email, first_name, last_name) VALUES('jim@example.com', 'Jim', 'Smith'); INSERT INTO contacts(email, first_name, last_name) VALUES(null, 'John', 'Smith'); INSERT INTO contacts(email, first_name, last_name) VALUES('carol@example.com', 'Carol', 'Smith'); INSERT INTO contacts(email, first_name, last_name) VALUES('sam@example.com', 'Sam', null);</p>
After this runs, here are the contents in the database in table form.
First Name | Last Name | |
null | John | Smith |
carol@example.com | Carol | Smith |
sam@example.com | Sam | Smith |
jim@example.com | Jim | Smith |
Logstash Configuration
We can go ahead and output all these events to the console with this sample Logstash configuration:
# file: simple-out.conf input { jdbc { # Postgres jdbc connection string to our database, mydb jdbc_connection_string => "jdbc:postgresql://localhost:5432/mydb" # The user we wish to execute our statement as jdbc_user => "postgres" # The path to our downloaded jdbc driver jdbc_driver_library => "/path/to/postgresql-9.4-1201.jdbc41.jar" # The name of the driver class for Postgresql jdbc_driver_class => "org.postgresql.Driver" # our query statement => "SELECT * from contacts" } } output { stdout { codec => json_lines } }
Now we can run Logstash and see the results!
<p style="line-height: 22.3999996185303px; color: rgb(57, 57, 57); font-family: Arial, Helvetica, Verdana, Tahoma, sans-serif; white-space: normal;">$ logstash-1.5.3/bin/logstash -f simple-out.conf </p><p style="line-height: 22.3999996185303px; color: rgb(57, 57, 57); font-family: Arial, Helvetica, Verdana, Tahoma, sans-serif; white-space: normal;">Logstash startup completed</p><p style="line-height: 22.3999996185303px; color: rgb(57, 57, 57); font-family: Arial, Helvetica, Verdana, Tahoma, sans-serif; white-space: normal;">{"uid":1,"email":null,"first_name":"hello","last_name":null,"@version":"1","@timestamp":"2015-07-29T21:03:18.958Z"} {"uid":2,"email":"jim@example.com","first_name":"Jim","last_name":"Smith","@version":"1","@timestamp":"2015-07-29T21:03:18.959Z"} {"uid":3,"email":null,"first_name":"John","last_name":"Smith","@version":"1","@timestamp":"2015-07-29T21:03:18.959Z"} {"uid":4,"email":"carol@example.com","first_name":"Carol","last_name":"Smith","@version":"1","@timestamp":"2015-07-29T21:03:18.959Z"} {"uid":5,"email":"sam@example.com","first_name":"Sam","last_name":null,"@version":"1","@timestamp":"2015-07-29T21:03:18.961Z"}</p><p style="line-height: 22.3999996185303px; color: rgb(57, 57, 57); font-family: Arial, Helvetica, Verdana, Tahoma, sans-serif; white-space: normal;">Logstash shutdown completed</p>
Awesome, we read data from Postgresql!
Up next, we will demonstrate two examples of how you may use this plugin in the context of Elasticsearch.
Example 2: Synchronizing Data In Your Table To Elasticsearch
In the case that we are using our database as an input source for Elasticsearch, we may be interested in keeping our existing documents in-sync with our data as the database undergoes updates. In this case, we can simply index our rows in Elasticsearch with unique ids such that any time we re-index them, they will just update. This way, we prevent Elasticsearch from assigning a new ID for each record and generating duplicates!
<p># file: contacts-index-logstash.conf input { jdbc { jdbc_connection_string => "jdbc:postgresql://localhost:5432/mydb" jdbc_user => "postgres" jdbc_validate_connection => true jdbc_driver_library => "/path/to/postgresql-9.4-1201.jdbc41.jar" jdbc_driver_class => "org.postgresql.Driver" statement => "SELECT * from contacts" } } output { elasticsearch { protocol => http index => "contacts" document_type => "contact" document_id => "%{uid}" host => "ES_NODE_HOST" } }</p>
Let's do a quick check to see that "Sam" was indexed into Elasticsearch
<p>curl ES_NODE_HOST:9200/contacts/contact/5?pretty { "_index" : "contacts", "_type" : "contact", "_id" : "5", "_version" : 1, "found" : true, "_source":{"uid":5,"email":"sam@example.com","first_name":"Sam","last_name":null,"@version":"1","@timestamp":"2015-07-29T22:12:20.146Z"} }</p>
So far we just saw how to use a query to fetch results from a database query, but what if we want to update our index with new changes? What if some of our contacts changed emails, or we want to update someone's last name? Here is a sequence of changes that we can apply to our table and later verify the behavior we want in the resulting Elasticsearch index.
<p>UPDATE contacts SET last_name = 'Smith' WHERE email = 'sam@example.com'; UPDATE contacts SET email = 'john@example.com' WHERE uid = 3; INSERT INTO contacts(email, first_name, last_name) VALUES('new@example.com', 'New', 'Smith');</p>
Now we can run Logstash with the same configuration. When we do the same query as before, we will
notice that our document containing Sam has been updated and @version
is now 2
<p>curl ES_NODE_HOST:9200/contacts/contact/5?pretty {</p><p> "_index" : "contacts", "_type" : "contact", "_id" : "5", "_version" : 2, "found" : true, "_source":{"uid":5,"email":"sam@example.com","first_name":"Sam","last_name":"Smith","@version":"1","@timestamp":"2015-07-29T22:12:56.980Z"} }</p>
Using this method, we can re-index our table into Elasticsearch without ending up with duplicates. One thing to note is that we are not able to capture deletes to documents under this scheme.
Example 3: MusicBrainz Demo
MusicBrainz is an open music database containing up-to-date information about artists, their works, and everything in-between. You can learn more at http://musicbrainz.org.
MusicBrainz graciously hosts a biweekly data dump of their database here.
This data is 1.8GB with information about around 18 million tracks
How to get the MusicBrainz data
You must first run your own mirror of the MusicBrainz database. This can be achieved using a tool called mbslave. The project's repo has instructions on syncing with the data-dump.
Formulating a query to load
Now that we have all of this wealth of music data in an accessible database, we can choose a subset of the data we wish to index into Elasticsearch. We may be interested in exploring the data about artists and their releases. Here is a SQL query to fetch a few attributes belonging to artists and their releases:
<p>SELECT release_group.gid AS album_id, release_group.type AS album_primary_type_id, release_group_primary_type.name AS album_primary_type_name, release.name AS release_name, artist.name AS artist_name, artist.gid AS artist_gid, artist_credit.id AS artist_credit_id, artist.type AS artist_type_id, artist_type.name AS artist_type_name, artist.begin_date_year artist_begin_date_year, area.name AS artist_country_name, release_country.date_year AS release_year, release_country.date_month AS release_month, release_country.date_day AS release_day FROM musicbrainz.artist INNER JOIN musicbrainz.artist_credit_name ON artist_credit_name.artist = artist.id INNER JOIN musicbrainz.artist_credit ON artist_credit.id = artist_credit_name.artist_credit INNER JOIN musicbrainz.release_group ON release_group.artist_credit = artist_credit.id <span></span>INNER JOIN musicbrainz.release ON release.release_group = release_group.id INNER JOIN musicbrainz.release_country ON release.id = release_country.release INNER JOIN musicbrainz.artist_type ON artist.type = artist_type.id INNER JOIN musicbrainz.area ON artist.area = area.id INNER JOIN musicbrainz.release_group_primary_type ON release_group_primary_type.id = release_group.type <span></span>WHERE ((release_country.date_year IS NOT NULL) AND (release_country.date_month IS NOT NULL) AND (release_country.date_day IS NOT NULL))</p>
Using Logstash to Query The Database and Index Into Elasticsearch
<p>input { jdbc { jdbc_driver_library => "/path/to/driver.jar" jdbc_driver_class => "org.postgresql.Driver" jdbc_url => "jdbc://postgresql" jdbc_user => "musicbrainz" statement_filepath => "query.sql" } }</p><p>output { elasticsearch { protocol => http }</p><p>}</p>
In this case, we have such a complex query that we chose to leverage the statement_filepath
parameter option.
Exploring Data in Kibana
One great feature of migrating a part of the data into Elasticsearch is the ability to generate great insightful visualizations using Kibana. For starters, let's see how many musical releases are introduced year over year!
(image: Number of releases per year from 1900 to 2010)
Seeing general counts is a nice starter, but we can explore much more! For example, these releases have artists, countries, and release types associated with them. In the following Kibana dashboard we can see the 20 artists with the most number of releases associated to them. We can also visualize the differences between album, EP, and singles releases across the various producing countries.
We can drill into our donut visualization and filter for EPs that were produced by artists from the United Kingdom.
You may recognize some of these artists, while others may be new to you. Musicbrainz collects data about so many artists, there is always something new to discover!
More Information
This post only begins to explore the integrations and features the JDBC plugin includes. For more information, check out the plugin's documentation. We would love your feedback on our forum; if you think you’ve found a bug in this plugin, please submit an issue.