We announced CrateDB 2.3 two weeks ago.
The 2.3.x release line is stable now, so head on over to the downloads page to get started.
In my last Feature Focus I took a look at the new support for UNION ALL
, metrics for container monitoring, and authentication.
In this post, I will take a closer look at some of the other new features in 2.3.
Introducing: HyperLogLog
We had a customer with high cardinality columns, and they wanted faster COUNT DISTINCT
queries.
After some discussion with them, we all agreed that an implementation of the HyperLogLog algorithm was a good idea. They were willing to trade away precision for faster performance. HyperLogLog is an algorithm for approximating the number of distinct values in a table.
So now, instead of this:
SELECT COUNT(DISTINCT column) FROM table
Starting in CrateDB 2.3, you can, if you wish, execute this:
SELECT hyperloglog_distinct(column) FROM table
HyperLogLog outperforms a normal COUNT DISTINCT
on high cardinality columns, i.e. when many of the values are different. We performed some benchmarks, and, for example, with a column with a medium cardinality (~120 million distinct values across ~1.3 billion rows) we saw 13x speedups. For columns with much lower cardinality, we saw speedups like 1.8x or less.
COUNT DISTINCT
is great for general purpose amounts of data. But with very large datasets, the CPU and memory costs can be prohibitive. The same is not true for HyperLogLog.
So for that reason, if you're working with a lot of data, and 100% accuracy isn't too important, hyperloglog_distinct
is recommended, even for low cardinality columns.
An Example Use-Case
HyperLogLog works really well with time-series data.
Let's say you're running a factory with a hundred or more sensors all logging events. And you want to know how many warn_type_a
events happen per hour, every hour, so that you can track them over time.
You can run hyperloglog_distinct
across millions of records, every second, in real-time, as the data is pouring into the database. This in turn can be used to power live analytics, data visualization, reporting, and so on.
In Other News
CrateDB now returns expected parameter types for INSERT
, UPDATE
, and DELETE
statements.
For example, consider this query:
INSERT INTO table (id, name) VALUES (?, ?)
CrateDB will now make a best guess and tell the client attempting to prepare a statement what it expects those placeholders to be.
We have also added sub-query support for UPDATE
and DELETE
statements.
For example, with UPDATE
:
UPDATE table_1
SET name = 'hello'
WHERE id IN (
SELECT id FROM table_2
)
And with DELETE
:
DELETE FROM table_1
WHERE id IN (
SELECT id FROM table_2
)
Also, we added support for real-time primary key lookups inside subqueries:
SELECT COUNT(*)
FROM (
SELECT x FROM table WHERE id = ANY(?)
) AS table_alias
In CrateDB, by default, tables are refreshed every 1000 milliseconds. Normal queries can hence sometimes return slightly stale data. However, filtering a table on its primary key is a special type of query because it is guaranteed to read the data directly from the translogs, which always contain the latest changes.
In the past, filtering on primary key inside a subselect broke this guarantee. But in 2.3, this has been fixed.
Wrap Up
In this post, we covered the new support for a HyperLogLog powered distinct count function. As well as some improvements for subqueries. Check out my previous Feature Focus post for more. Or, for a full list of change, check out the CrateDB 2.3.x release notes.
We’re already at work on our next batches of improvements for CrateDB, so watch this space for future announcements and posts.