For many applications, query throughput (i.e. the capacity for concurrent queries) is an important consideration. Specifically, being able to scale query throughput easily makes it possible (and economical) to grow your system as your user-base increases in size.
In the last few weeks, we assembled a performance engineering team to run some benchmarks that would help us demonstrate CrateDB’s ability to scale query throughput linearly.
To do this, we simulated an application that collects time-series data (specifically, sensor readings) and stores them in a multi-tenant database.
This is a common use-case for our users. CrateDB’s capability to ingest massive amounts of data and query it in real-time is uniquely well suited for the monitoring and analysis of machine data (e.g. sensors, Internet of Things, logs, and so on) via dashboards.
In this post, I will go through the basic benchmark setup and share the high-level results with you. If you want to dig deep, the full details are available as a whitepaper: CrateDB Query Throughput Scalability Benchmark
We ran this benchmark on CrateDB 1.0.4.
We used Amazon AWS c3.2xlarge instances with the following configuration:
The data set we used contains 314,496,000 records. These data points simulate sensor readings gathered over the period of one year. Each record represents a single sensor reading, and also includes data about the sensor itself. The data was partitioned by week.
Here’s the table schema we used for this data:
DROP TABLE IF EXISTS b.t2;
CREATE ANALYZER "tree" (
TOKENIZER tree WITH (
type = 'path_hierarchy',
delimiter = ':'
)
);
CREATE TABLE IF NOT EXISTS b.t2 (
"uuid" STRING,
"ts" TIMESTAMP,
"tenant_id" INTEGER,
"sensor_id" STRING,
"sensor_type" STRING,
"v1" INTEGER,
"v2" INTEGER,
"v3" FLOAT,
"v4" FLOAT,
"v5" BOOLEAN,
"week_generated" TIMESTAMP GENERATED ALWAYS AS date_trunc('week', ts),
INDEX "taxonomy" USING FULLTEXT (sensor_type) WITH (analyzer='tree')
) PARTITIONED BY ("week_generated")
CLUSTERED BY ("tenant_id") INTO 156 SHARDS;
For this benchmark we ran three different queries. Parameters in the queries were randomly generated to ensure the queries were operating across different parts of the dataset, to emulate real-world use.
Data from a single tenant, aggregated across multiple partitions:
SELECT min(v1) as v1_min, max(v1) as v1_max, avg(v1) as v1_avg, sum(v1) as v1_sum
FROM b.t1
WHERE tenant_id = ? AND day_generated BETWEEN ? AND ?;
Data from all tenants, aggregated within a single partition, grouped and ordered by tenant (i.e. for doing cross-tenant analytics):
SELECT count(*) AS num_docs, tenant_id,
min(v1) AS v1_min, max(v1) AS v1_max, avg(v1) AS v1_avg, sum(v1) AS v1_sum,
min(v2) AS v2_min, max(v2) AS v2_max, avg(v2) AS v2_avg, sum(v2) AS v2_sum,
min(v3) AS v3_min, max(v3) AS v3_max, avg(v3) AS v3_avg, sum(v3) AS v3_sum,
min(v4) AS v4_min, max(v4) AS v4_max, avg(v4) AS v4_avg, sum(v4) AS v4_sum
FROM b.t1
WHERE day_generated = ?
GROUP BY tenant_id
ORDER BY tenant_id;
Sensor lookup within taxonomy (term + children):
SELECT sensor_id, sensor_type
FROM b.t1
WHERE taxonomy = ?
ORDER BY v1
LIMIT 100;
As we increased the size of the CrateDB cluster, we also increased the number of simulated users.
Here’s a summary of scaling setup we used:
Nodes | Users |
---|---|
3 | 30 |
6 | 30 |
12 | 60 |
24 | 120 |
For each cluster, a JMeter test harness simulated a number of concurrent users, each connecting to the database and executing the three test queries (in random sequence and with random query parameters).
These results show the average queries per second across 314 million rows:
3 node | 6 node | 12 node | 24 node | |
---|---|---|---|---|
# Users | 30 | 30 | 60 | 120 |
Query 1 | 3.2395 | 6.1410 | 10.2985 | 18.2665 |
Query 2 | 3.0398 | 5.6857 | 10.0139 | 17.5259 |
Query 3 | 3.1790 | 5.9053 | 10.2393 | 17.7649 |
Total | 9.0698 | 17.0139 | 29.8725 | 52.0404 |
If we graph this, we get:
As you can see, query throughput increases linearly with cluster size.
This benchmark demonstrates that, at least for this setup, CrateDB scales query throughput linearly with cluster size.
Linear scaling is an important feature, because it makes capacity planning a lot more predictable—and more affordable.
If you want to know more, check out the whitepaper, or get in touch.