Clustering

CrateDB provides scalability through partitioning, sharding, and replication.

Overview

CrateDB uses a shared-nothing architecture to form high-availability, resilient database clusters with minimal effort of configuration, effectively implementing a distributed SQL database.

About

CrateDB relies on Lucene for storage and inherits components from Elasticsearch / OpenSearch for cluster consensus. Fundamental concepts of CrateDB are familiar to Elasticsearch users, because both are actually using the same implementation.

Details

Sharding and partitioning are techniques used to distribute data evenly across multiple nodes in a cluster, ensuring data scalability, availability, and performance.

Replication can be applied to increase redundancy, which reduces the chance of data loss, and to improve read performance.

Sharding:

In CrateDB, tables are split into a configured number of shards. Then, the shards are distributed across multiple nodes of the database cluster. Each shard in CrateDB is stored in a dedicated Lucene index.

You can think of shards as a self-contained part of a table, that includes both a subset of records and the corresponding indexing structures.

Figuring out how many shards to use for your tables requires you to think about the type of data you are processing, the types of queries you are running, and the type of hardware you are using.

Partitioning:

CrateDB also supports splitting up data across another dimension with partitioning. Tables can be partitioned by defining partition columns. You can think of a partition as a set of shards.

  • Partitioned tables optimize access efficiency when querying data, because only a subset of data needs to be addressed and acquired.

  • Each partition can be backed up and restored individually, for efficient operations.

  • Tables allow to change the number of shards even after creation time for future partitions. This feature enables you to start out with few shards per partition, and scale up the number of shards for later partitions once traffic and ingest rates increase over the lifetime of your application or system.

Replication:

You can configure CrateDB to replicate tables. When you configure replication, CrateDB will ensure that every table shard has one or more copies available at all times.

Replication can also improve read performance because any increase in the number of shards distributed across a cluster also increases the opportunities for CrateDB to parallelize query execution across multiple nodes.

Concepts

Reference Manual

Guides

Clustering Sharding Partitioning Replication

Synopsis

With a monthly throughput of 300 GB, partitioning your table by month, and using six shards, each shard will manage 50 GB of data, which is within the recommended size range (5 - 50 GB).

Through replication, the table will store three copies of your data, in order to reduce the chance of permanent data loss.

CREATE TABLE timeseries_table (
    ts TIMESTAMP,
    val DOUBLE PRECISION,
    part GENERATED ALWAYS AS date_trunc('month', ts)
)
CLUSTERED INTO 6 SHARDS
PARTITIONED BY (part)
WITH (number_of_replicas = 2);

Learn

Individual characteristics and shapes of data need different sharding and partitioning strategies. Learn about the details of shard allocation, that will support you to choose the right strategy for your data and your most prominent types of workloads.

Sharding and Partitioning

  • Introduction to the concepts of sharding and partitioning.

  • Learn how to choose a strategy that fits your needs.

Sharding and Partitioning

Sharding Performance Guide

  • Optimising for query performance.

  • Optimising for ingestion performance.

Sharding and Partitioning

Sharding and partitioning guide for time-series data

A hands-on walkthrough to support you with building a sharding and partitioning strategy for your time series data.

Sharding and partitioning guide for time-series data