This post will describe how we're using Elasticsearch as part of Crate. Crate is an open source data store for any data. Crate is a shared-nothing, fully searchable, document-oriented cluster store. As far as we know, we're the first to use Elasticsearch as a framework.
We were always impressed by the amazing simplicity of Elasticsearch. I began using it in 2010, to support large scale systems we built out at Lovely Systems (our systems integration company prior to founding Crate.io). Elasticsearch isn't just simple, it is also ultra fast and scales extremely easy. Click here to see Jodok talking about how we threw 24 billion records into Elasticsearch.
At the time, we needed special aggregations and other functionality for some of our projects, so, in 2011, we wrote our own plugins(1) for Elasticsearch. This made me like Java again, since Elasticsearch is written in a very direct and clean way.
When we began thinking about the ingredients of a good, modern, data store - a no-brainer data store for big data applications - we wanted to take elasticsearch as a model for simplicity and scalability.
We decided to use Elasticsearch as a framework. Elasticsearch is currently included from source as a git submodule. Unlike Elasticsearch, Crate uses Gradle as build tool, so we have our own project file for the Elasticsearch source tree.
There are two main reasons why we are including the source tree instead of just adding a dependency on the jar:
- Elasticsearch shades and minimizes some of its dependencies at distribution time. This makes it hard to extend components where the dependencies are used. For example Crate uses its own Netty HTTPHandler in order to support streaming data from the client. This handler uses Netty classes which are not redistributed with the elasticsearch jar. It is also not possible to add Netty as a direct dependency which is also the case in this issue http://elasticsearch-users.115913.n3.nabble.com/netty-transport-and-shaded-jar-td4028065.html
- We use our own Elasticsearch fork, since we have some minor patches in it. In most cases those patches are merged into Elasticsearch (e.g: https://github.com/elasticsearch/elasticsearch/pull/6127) . Our own fork gives us the possibility to do real world testing and stabilize the change before doing a pull request on the upstream.
Today, Crate depends on many Elasticsearch components, such as:
- The settings machinery from Elasticsearch is used in many places. It allows us to store transient and persistent settings on global, node or table level in a very simple way. We don't have to care about how the cluster state gets synced between the nodes - which is great.
- The transport protocol, and also the request handlers are used all over the place in Crate.io (Download here) for all the internal network processing and for the SQL interface.
- Currently we store the data in the same way as Elasticsearch does, so right now it is even possible to retrieve the data via the Elasticsearch API (2). We already have special settings with our tables and their underlying indices. For example we use DocValues for simple column types , since this better matches the database like usage.
- As already mentioned above, we use a derived Netty handler class, in order to support streaming, however it is just a patched version of the original.
- BLOBs are actually treated as a normal tables from a cluster point of view - therefore the whole routing implementation is used from Elasticsearch. However, on the shard level we operate directly on the filesystem instead of using Lucene.
We still enjoy coding against the Elasticsearch codebase and also how contributions and communications are done in this Project. There are many other things we use from Elasticsearch, feel free to take a look at source on GitHub or contact us on https://community.crate.io if you are interested in details.
Get Crate
Get crate now and then watch the youtube video with 1 minute installation of crate.
(1) While most of them have been project specific - some of them have been made open source like the inout https://github.com/crate/elasticsearch-inout-plugin and the timefacets plugin https://github.com/crate/elasticsearch-timefacets-plugin. The functionality of these Plugins are now included in Crate.
(2) The Elasticsearch-API can be enabled via configuration. However this might change in the future, and is currently only there for debugging purposes.