From time to time, a nearby university invites me to deliver a guest lecture about databases. Towards the end of my talk, I include a simple demonstration of query optimization. Beginning with a slow query, I run EXPLAIN ANALYZE, show how to interpret the output, add an index, then re-run EXPLAIN to show how the query is now faster.
I usually conclude by casually remarking “…and if you can learn how to do that, you too can make 2000 a day as a database performance consultant.”
This usually attracts the attention of someone who had been snoozing at the back. They’ll timidly raise their hand and ask, “Excuse me… would you mind running through that again?”
The Magic of Indexes
The first relational databases were developed between 1976 and 1979. Relational data modeling is a concept as old as I am. The introduction of a declarative query language that describes the result set of the query, but which is agnostic about how the data is fetched and filtered was, at the time, a revolutionary idea. This can, however, lead to slow queries that may take literally hours to execute at first. A database performance expert is then brought in to define and add appropriate indexes. Running the same query against the same table may then result in it completing in less than a second.
The Rise of "Schemaless" Databases
Around 2010 the schemaless approach started to gain popularity with application developers, liberating them from having to define a complete database schema up front. With this approach, data can simply be inserted, often as arbitrary JSON documents. JSON documents are self-describing and can include data types such as strings, numbers and Booleans organized as nested objects and arrays. A schemaless approach to storing such data removes the need to spend time negotiating a schema definition with your database architect: developers can just start writing code. As you might expect, developers loved this!
Whilst this was a revolutionary simplification of data management, it wasn’t entirely schemaless. At some point, and preferably before going into production, you still need to add indexes to ensure queries made by your application perform adequately.
CrateDB: Taking Schemaless to the Next Level
CrateDB is also from the 2010 era school of modern database thinking. The founders of CrateDB set out to implement the schemaless vision more completely than others before them. With CrateDB you don’t have to plan or choose which indexes to add… every column is indexed with a Lucene index by default.
Lucene is an open-source project from the Apache Software Foundation. It is the library for state-of-the-art search and indexing. Indexes are compact data structures designed to handle huge volumes of data. They are so efficient that, as CrateDB does, it’s possible to maintain a Lucene index on every column in your database.
What does this mean in practice? You can insert arbitrary data into CrateDB and expect any query to be reasonably fast. Whichever column you filter on, order or group by, it’s already going to be indexed for you.
Trade-offs in Indexing
I hear what you’re thinking: there’s no such thing as a free lunch. You’re right… the revolutionary innovation here is that indexing every column is possible: I didn’t say it’s a zero-cost option. A CrateDB database uses about 3x the storage that storing the unindexed data alone requires. Think of this as a trade-off: without the indexes you would spend much more CPU time on the same queries... This might be of the order of hours or days in a large database, versus performing the same query over an index in seconds or even milliseconds. The increased disk footprint should be seen as a storage/CPU tradeoff. You can of course always selectively disable indexing when you don’t need it for certain columns.
Ask your Data Anything!
If I can ask any question, what should I ask...?
After 50 years working with databases that require careful up-front planning... To suddenly sit in front of a CrateDB database that liberates you from all that and is ready to accept any query... feels unreal. But it’s possible and it works. Soon we will all be used to it, forgetting that there ever was another way.
And those students in the back will have to find some other job. The job of flying to a customer just to add a few indexes will no longer exist.
Liked this article? Download the original PDF here.