Replication
Replication in CrateDB allows users to replicate data across multiple nodes in a cluster. Data is replicated at the shard level, and replica shards automatically step in as primary shards if the primary one becomes unavailable due to failures or maintenance. Maintaining at least two replicas is recommended to ensure high availability of the CrateDB cluster.
Replication helps increase performance with parallel data query and data availability. Read requests are broken down and executed in parallel across multiple shards on multiple nodes, massively improving read performance.
CrateDB offers multiple configuration options to find the optimal balance between shards, partitions, and replications:- Partitioning and sharding help scale the cluster horizontally and optimize hardware usage during ingest and read. Ingest and query workloads are parallelized and distributed across the whole cluster.
- Replication helps optimize the resiliency of the cluster and allow for higher read throughput.
CrateDB ensures efficient data synchronization from the primary shard to all replica shards, leveraging the append-only characteristic of Lucene segments. When a node failure results in the loss of a primary shard, a replica shard is automatically promoted to primary status. This system enhances both fault tolerance and the capacity for potent query execution across multiple nodes.
Read operations are executed on primary shards and their replicas to increase query distribution and enhance performance. CrateDB randomly assigns a shard when routing an operation, with the option to configure this behavior. Write operations follow a synchronous process across all active replicas.
CREATE TABLE t1 ( name STRING ) CLUSTERED INTO 3 SHARDS WITH (“number_of_replicas” = '1');
Additional resources
CrateDB at Berlin Buzzwords 2023
When milliseconds matter: maximizing query performance in CrateDB.
Timestamp: 9:55 – 10:35