Feature Focus: Making Things Hyper Fast Fast

Written by PJ Hagerty | 2018-01-31

We announced CrateDB 2.3 two weeks ago.

The 2.3.x release line is stable now, so head on over to the downloads page to get started.

In my last Feature Focus I took a look at the new support for UNION ALL, metrics for container monitoring, and authentication.

In this post, I will take a closer look at some of the other new features in 2.3.

Introducing: HyperLogLog

We had a customer with high cardinality columns, and they wanted faster COUNT DISTINCT queries.

After some discussion with them, we all agreed that an implementation of the HyperLogLog algorithm was a good idea. They were willing to trade away precision for faster performance. HyperLogLog is an algorithm for approximating the number of distinct values in a table.

So now, instead of this:

SELECT COUNT(DISTINCT column) FROM table

Starting in CrateDB 2.3, you can, if you wish, execute this:

SELECT hyperloglog_distinct(column) FROM table

HyperLogLog outperforms a normal COUNT DISTINCT on high cardinality columns, i.e. when many of the values are different. We performed some benchmarks, and, for example, with a column with a medium cardinality (~120 million distinct values across ~1.3 billion rows) we saw 13x speedups. For columns with much lower cardinality, we saw speedups like 1.8x or less.

COUNT DISTINCT is great for general purpose amounts of data. But with very large datasets, the CPU and memory costs can be prohibitive. The same is not true for HyperLogLog.

So for that reason, if you're working with a lot of data, and 100% accuracy isn't too important, hyperloglog_distinct is recommended, even for low cardinality columns.

An Example Use-Case

HyperLogLog works really well with time-series data.

Let's say you're running a factory with a hundred or more sensors all logging events. And you want to know how many warn_type_a events happen per hour, every hour, so that you can track them over time.

You can run hyperloglog_distinct across millions of records, every second, in real-time, as the data is pouring into the database. This in turn can be used to power live analytics, data visualization, reporting, and so on.

In Other News

CrateDB now returns expected parameter types for INSERT, UPDATE, and DELETE statements.

For example, consider this query:

INSERT INTO table (id, name) VALUES (?, ?)

CrateDB will now make a best guess and tell the client attempting to prepare a statement what it expects those placeholders to be.

We have also added sub-query support for UPDATE and DELETE statements.

For example, with UPDATE:

UPDATE table_1
   SET name = 'hello'
 WHERE id IN (
   SELECT id FROM table_2
  )

And with DELETE:

DELETE FROM table_1
      WHERE id IN (
        SELECT id FROM table_2
      )

Also, we added support for real-time primary key lookups inside subqueries:

SELECT COUNT(*)
FROM (
  SELECT x FROM table WHERE id = ANY(?)
) AS table_alias

In CrateDB, by default, tables are refreshed every 1000 milliseconds. Normal queries can hence sometimes return slightly stale data. However, filtering a table on its primary key is a special type of query because it is guaranteed to read the data directly from the translogs, which always contain the latest changes.

In the past, filtering on primary key inside a subselect broke this guarantee. But in 2.3, this has been fixed.

Wrap Up

In this post, we covered the new support for a HyperLogLog powered distinct count function. As well as some improvements for subqueries. Check out my previous Feature Focus post for more. Or, for a full list of change, check out the CrateDB 2.3.x release notes.

We’re already at work on our next batches of improvements for CrateDB, so watch this space for future announcements and posts.

View full post