Deprecating Rivers
When I added rivers way back in time in the early days of Elasticsearch, the idea was somewhat novel. One of the first tasks that users do when using Elasticsearch is to get data into Elasticsearch to make it searchable, so why not allow our community to write plugins that can be installed directly on an Elasticsearch cluster to pull data into Elasticsearch automatically.
The first few rivers implementations were quite successful and very helpful. The CouchDB river was immensely simple and popular (thanks to CouchDB changes API). Others were popular as well, for example, the RabbitMQ one. They did start to slowly show the problematic nature of rivers as well.
What was the problem we were witnessing? Cluster stability. You see, by their nature, rivers deal with external systems, and those external systems require external libraries to work with. Those are great to use, but they come with an overhead. Part of it is built in overhead, things like additional memory usage, more sockets, file descriptors and so on. Others, sadly, are bugs.
Part of our efforts in the past couple of years has been to improve Elasticsearch resiliency, and we kept seeing, time and time again, that rivers are a big cause for cluster instability, due to their inherent notion of working with external systems and external libraries. When Found joined us a couple of months ago, we found that they see the same thing, with rivers plugins causing most of the cluster instabilities across the thousands of clusters under management.
We knew it for some time, and in the spirit of helping users build more resilient systems, we decided to deprecate rivers and ask users to focus on getting data to Elasticsearch from "outside" the cluster. Rivers are deprecated from 1.5 moving forward. We will probably keep the infrastructure around in 2.0, and only remove them at a later version, to ease the migration.
The ease of use in getting data into Elasticsearch is still important though, so where should you go from here?
For more than a year, we've had official client libraries for Elasticsearch in most programming languages. It means that hooking into your application and getting data through an existing codebase should be relatively simple. This technique also allows to easily munge the data before it gets to Elasticsearch. A common example is an application that already used an ORM to map the domain model to a database, and hooking and indexing the domain model back to Elasticsearch tends to be simple to implement.
Also, Logstash, or similar tools, can be used to ship data into Elasticsearch. For example, some of the rivers Elasticsearch came with are now implemented as Logstash plugins (like the CouchDB one) in the forthcoming Logstash 1.5.
I love the community and work that has gone into a vibrant set of river plugins, and the decision to deprecate rivers was not taken lightly. What should you do if you are a river plugin author?
Since rivers are relatively self-sufficient, the code can be extracted into a common library that can be used to get data into Elasticsearch. Then, this library can be used in various different places. A simple "main class" can be written to allow to execute it as a standalone process.
Another option is to move the plugin to be a Logstash input. Logstash inputs are very simple to write, and in 1.5 the Logstash team has made the process of writing, maintaining, and discovering plugins super simple.
It is a hard decision to take something away, especially with all the effort that has gone into writing those river plugins by the wonderful authors. I deeply apologize for it, and we would love to help out with any questions and ideas on how to move forward adapting them. The issue for it is #10345.