Sharding with CrateDB

Sharding involves splitting datasets into smaller units called shards, which are stored on individual nodes within the cluster. When nodes are added, removed, or data distribution becomes uneven, CrateDB automatically redistributes the shards to maintain balanced data distribution. If the number of shards is not defined, CrateDB sets a default value based on the number of nodes in the cluster.

Working with the right number of shards per table allows to optimize the performance of your application by leveraging the most of the existing hardware – ideally you have the same number of shards per table on every node as the number of available CPUs on each node. If the number of shards is not defined, CrateDB sets a default value based on the number of available CPUs on the nodes of the cluster. In very heavy read/write scenarios, advanced configuration options help to avoid any bottlenecks, for example by defining dedicated master-eligible nodes that don’t hold data, but only the cluster configuration.

In addition to partitioning, you can seamlessly scale your cluster horizontally to cope with any growing application workload, by creating more shards for more recent partitions. For example, in a constantly growing application, the next weekly or monthly partition can be split in more shards than the previous one to distribute the workload across a larger number of nodes.

The ideal shard size is between 3-70 GB, depending on your use case. Smaller shard sizes help reduce the restore time of nodes in case of automatic failovers and recoveries.