Feedback
latest
Select performance¶
Aggregations and GROUP BY
¶
It is common to do GROUP BY
queries for analytics purposes. For example,
you might select the avg
, max
, and min
of some measurements over a
billion records and group them by device ID.
If you’re running this query over a billion records and grouping by device ID, you might have something like this:
cr> SELECT
device_id,
max(value),
avg(value),
min(value)
FROM
measures
GROUP BY
device_id
ORDER BY
1 DESC;
+-----------+------------+-------------------+------------+
| device_id | max(value) | avg(value) | min(value) |
+-----------+------------+-------------------+------------+
| 4 | 10000 | 5003.748816285036 | 0 |
| 3 | 10000 | 5005.297395430482 | 0 |
| 2 | 10000 | 5002.940588080021 | 0 |
| 1 | 10000 | 5002.216030711031 | 0 |
+-----------+------------+-------------------+------------+
By default, CrateDB processes all matching records. This may require a lot of processing power, depending on the data set and the size of the CrateDB cluster.
Some databases can limit the amount of records that are processed for
GROUP BY
operations (also known as down-sampling) to improve performance
at the cost of less accurate results.
This exploits the fact that if your data set has a normal distribution, it is likely that the average over the whole data set is not much different from the average over a subset of the data.
Aggregating 100,000 records instead of 10 or 100 million records can make a huge difference in query response times.
For some analytics use-cases, this is an acceptable trade-off.
CrateDB users can emulate this down-sampling behaviour with a combination of LIMITs and sub-selects. However, doing so involves costly data merges in the query execution plan that reduce the parallelization (and thus performance) of a distributed query.
A better way to emulate down-sampling is to filter on the _docid
system
column using a modulo (%) operation, like this:
cr> SELECT
device_id,
max(value),
avg(value),
min(value)
FROM
measures
WHERE
_docid % 10 = 0
GROUP BY
device_id
ORDER BY
1 DESC;
+-----------+------------+--------------------+------------+
| device_id | max(value) | avg(value) | min(value) |
+-----------+------------+--------------------+------------+
| 4 | 10000 | 5013.052623224065 | 1 |
| 3 | 10000 | 4999.1253575025175 | 0 |
| 2 | 10000 | 5001.400379047543 | 0 |
| 1 | 10000 | 5035.220951927276 | 0 |
+-----------+------------+--------------------+------------+
You’ll notice that the result has changed slightly, but is still fairly close to the original result.
Tip
The % 10
in this example was arbitrary and roughly translates to: “only
consider every 10th row.”
The higher the number, the fewer records will match the query and the less accurate the result will be. Larger numbers trade accuracy for performance.
Note
The _docid
system column exposes the internal document ID each document
has within a Lucene segment. The IDs are unique within a segment but not
across segments or shards. This is good enough for a modulo sampling
operation.
Furthermore, the internal ID is already available and doesn’t have to be read from the file system. This makes it an ideal candidate for modulo based sampling.