CrateDB Scalability Benchmark: Query Throughput

Written by Andy Ellicott | 2017-04-19

For many applications, query throughput (i.e. the capacity for concurrent queries) is an important consideration. Specifically, being able to scale query throughput easily makes it possible (and economical) to grow your system as your user-base increases in size.

In the last few weeks, we assembled a performance engineering team to run some benchmarks that would help us demonstrate CrateDB’s ability to scale query throughput linearly.

To do this, we simulated an application that collects time-series data (specifically, sensor readings) and stores them in a multi-tenant database.

This is a common use-case for our users. CrateDB’s capability to ingest massive amounts of data and query it in real-time is uniquely well suited for the monitoring and analysis of machine data (e.g. sensors, Internet of Things, logs, and so on) via dashboards.

In this post, I will go through the basic benchmark setup and share the high-level results with you. If you want to dig deep, the full details are available as a whitepaper: CrateDB Query Throughput Scalability Benchmark

Cluster Setup

We ran this benchmark on CrateDB 1.0.4.

We used Amazon AWS c3.2xlarge instances with the following configuration:

Intel Xeon CPU E5-2680 v2 (Ivy Bridge) with 8 cores
15 GB RAM
2x 80GB SSD storage

Data Setup

Dataset

The data set we used contains 314,496,000 records. These data points simulate sensor readings gathered over the period of one year. Each record represents a single sensor reading, and also includes data about the sensor itself. The data was partitioned by week.

Table Schema

Here’s the table schema we used for this data:

DROP TABLE IF EXISTS b.t2;

CREATE ANALYZER "tree" (
  TOKENIZER tree WITH (
    type = 'path_hierarchy',
    delimiter = ':'
  )
);

CREATE TABLE IF NOT EXISTS b.t2 (
  "uuid" STRING,
  "ts" TIMESTAMP,
  "tenant_id" INTEGER,
  "sensor_id" STRING,
  "sensor_type" STRING,
  "v1" INTEGER,
  "v2" INTEGER,
  "v3" FLOAT,
  "v4" FLOAT,
  "v5" BOOLEAN,
  "week_generated" TIMESTAMP GENERATED ALWAYS AS date_trunc('week', ts),
  INDEX "taxonomy" USING FULLTEXT (sensor_type) WITH (analyzer='tree')
) PARTITIONED BY ("week_generated")
CLUSTERED BY ("tenant_id") INTO 156 SHARDS;

Test Queries

For this benchmark we ran three different queries. Parameters in the queries were randomly generated to ensure the queries were operating across different parts of the dataset, to emulate real-world use.

Query 1

Data from a single tenant, aggregated across multiple partitions:

SELECT min(v1) as v1_min, max(v1) as v1_max, avg(v1) as v1_avg, sum(v1) as v1_sum
FROM b.t1
WHERE tenant_id = ? AND day_generated BETWEEN ? AND ?;

Query 2

Data from all tenants, aggregated within a single partition, grouped and ordered by tenant (i.e. for doing cross-tenant analytics):

SELECT count(*) AS num_docs, tenant_id,
  min(v1) AS v1_min, max(v1) AS v1_max, avg(v1) AS v1_avg, sum(v1) AS v1_sum,
  min(v2) AS v2_min, max(v2) AS v2_max, avg(v2) AS v2_avg, sum(v2) AS v2_sum,
  min(v3) AS v3_min, max(v3) AS v3_max, avg(v3) AS v3_avg, sum(v3) AS v3_sum,
  min(v4) AS v4_min, max(v4) AS v4_max, avg(v4) AS v4_avg, sum(v4) AS v4_sum
FROM b.t1
WHERE day_generated = ?
GROUP BY tenant_id
ORDER BY tenant_id;

Query 3

Sensor lookup within taxonomy (term + children):

SELECT sensor_id, sensor_type
FROM b.t1
WHERE taxonomy = ?
ORDER BY v1
LIMIT 100;

Cluster and Query Scaling

As we increased the size of the CrateDB cluster, we also increased the number of simulated users.

Here’s a summary of scaling setup we used:

Nodes	Users
3	30
6	30
12	60
24	120

Benchmark Results

For each cluster, a JMeter test harness simulated a number of concurrent users, each connecting to the database and executing the three test queries (in random sequence and with random query parameters).

These results show the average queries per second across 314 million rows:

	3 node	6 node	12 node	24 node
# Users	30	30	60	120
Query 1	3.2395	6.1410	10.2985	18.2665
Query 2	3.0398	5.6857	10.0139	17.5259
Query 3	3.1790	5.9053	10.2393	17.7649
Total	9.0698	17.0139	29.8725	52.0404

If we graph this, we get:

As you can see, query throughput increases linearly with cluster size.

Wrap Up

This benchmark demonstrates that, at least for this setup, CrateDB scales query throughput linearly with cluster size.

Linear scaling is an important feature, because it makes capacity planning a lot more predictable—and more affordable.

If you want to know more, check out the whitepaper, or get in touch.

View full post