Understanding partitioned tables and sharding in CrateDB

Written by Ivan Sanchez | 2025-08-21

These days, terms like sharding, partitioning, and segments are commonly used across many databases. In this blog, we’ll take a closer look at what they mean specifically in the context of CrateDB.

The storage model

To understand shards and partitions, we first need to understand the storage model of CrateDB.

Apache Lucene is the cornerstone of the data model. A table is split is several chunks called shards, a shard is the same as a Lucene Index. Every index is composed of segments, segments are immutable and write-only, akin to pages, and they contain the documents/rows. Segments can be searched in sequence, as each segment is its own index.

A table clustered (or split) in 2 shards would look like:

(r) records or rows.

The number of shards that a table will be divided into is calculated automatically using a simple formula:

max(4, num_data_nodes * 2)

Alternatively, you can manually set in the DDL query how many shards a table will be split into:

We can inspect how many shards a table has by querying sys.shards

An index has a path because it's a file, as we said before, composed of immutable segments, this allows for easy backup and synchronization between different nodes as the database just needs to send and receive files.

Operations of the storage models

The fundamental part of CrateDB storage model is the segment, and as operations (read/write/update/delete) happen on node, segments are created and merged. They are merged because search performance decreases as segments pile up, and each segment consumes file handles, memory, and CPU time.

Merging segments

Segments are merged periodically by default. In Lucene, there are other merge strategies than time-based, and CrateDB might change the policy in the future.

When two segments are merged, all the 'valid' records of each segment are combined into a new one.

You can also manually merge the segments of a table explicitly by calling:

OPTIMIZE TABLE table_name WITH (max_num_segments=1)

This often results in less disk usage and faster search operations, it's typically best to let CrateDB merge the segments since in some situations, mostly after heavy writes, it can be an expensive operation.

Michael McCandless has a great post where you can visualize how merges happen with different merge policies.

Deleting a record

When a record is deleted, it's not really deleted, the information stays on disk (in the segment), and the record is marked as deleted/invalid, when an IndexReader reads from the indexes, it will skip these records.

When a merge occurs, the new segment will not contain the records that were marked as deleted.

Updating a record

When a record is updated, a new segment with the updated record is created, where the previous version of the record is marked as deleted. After a refresh, the IndexReader will be aware of the new segment and the record will show up in search results.

Eventually, the segments will be merged and the resulting segment will be the same as if we had updated the original segment.

Inserting a record (routing)

When inserting records, they have to be approximately evenly routed to the shards; otherwise, shard imbalance could degrade performance.

This is the used formula:

shard number = hash(routing column) % total primary shards

If a primary key exists, that will be used as the routing column. The user can also specify an explicit routing column with CRATE TABLE t (a integer, b text) CLUSTERED BY (a), if no primary key or explicit routing column exists, the internal column _id is used.

We can see the effect of routing after inserting a value and checking the shards system table.

After one insert:

After another insert:

Inserting a record (segments)

The record(s) is first committed to both the translog and an in-memory buffer, once its written to the translog we can ensure that the data will not be lost if a node failure happens as we can recover and write the new segments from it.

When a refresh happens, a segment is created, still in memory, and it will now be available in search results, after that at some point the in-memory segments will be committed to disk.

We can see this by checking the segments system table, after inserting a new record and manually calling refresh:

There are three segments, each with one record, all available for search, and the newly added segment is still not committed to disk. If we call optimize to forcibly merge the segments:

Segments are now merged, if you are wondering why there are two instead of one, remember that the segments belong to a shard, the merged segments were from the shard nº2.

Partitioned tables

A partitioned table is a table that has a defined partition column, for every unique value of that partition column, a new partition for the table will be created.

Partitions are split into shards, the number of shards is calculated the same as before, either the default formula or defined by the user. New records will be routed to its partition and then a shard, once could see partitions as the specialization of shards.

To understand the concept better, let's look at this table:

ts_month will contain the month of ts, and the table is partitioned by it, so for every unique value of the partition column (every month) a new partition will be created, furthermore every partition will have 3 shards, therefore 12 months (and partitions) * 3 shards = 36 shards is the maximum number of shards that table can have (without replication).

The shard map of the table will look like this:

As you can see, the structure of a table doesn't really change, it is still composed of shards and segments, but when it has a partition column, the rule of creating shards and routing records depends on that partition column.

As a rule of thumb:

Un-partitioned tables: Number of shards is the default formula or defined in clustered by n_shards. Routing is default _id column or primary key or explicit routing column if defined.
Partitioned tables: Number of shards is the default formula or defined in clustered by n_shards multiplied by the number of partitions, the number of partitions is equal to the count unique values of the routing column.

Why do we partition tables?

The answer to why sharding tables is easy, it allows for easy backup, replication and faster queries, as different indexes can be searched concurrently, but why partitioning or specializing sets of shards to a routing column?

Consider this situation similar to partitions in CrateDB, where we have a set of dates, grouped by their months:

When filtering values by month, months can be skipped to directly access the needed dates. For example, to get the date 1994-03-26, it only requires scanning one group and 4 records instead of scanning 12 records if the data is not grouped by month.

By grouping your data into buckets or partitions by a partition column, queries that filter on that column will be faster, since the query engine can just skip entire partitions. A very typical use case is time-series data, where there is a date or timestamp column and queries usually filter on time.

Choosing the partition_column is critical, a badly chosen column can result in a very high number of shards and most likely hit the default limit of 1000 shards per table, imagine if we chose seconds as the partition column, if we added many timestamp records, very quickly we would create many partitions, hurting performance and storage. The right partition column will depend on the data, use case and requirements, you can read more about this here.

Notes: Replication is turned off

Sharding a table is part of the fundamental structure of the data model in CrateDB, another fundamental aspect is replication.

By default tables have one replica, this multiplies the number of shards, the total number of shards is: primary shards + replica shards. In this article, the replication was turned off and the images showing shards do not show replica shards, only primary shards.

This was done to simplify the different explanations and query results.

Summary

I hope that by the end of the article, you have a deeper understanding of CrateDB's storage model and partitioning: How every table in CrateDB is split into shards and how partitioning is just creating and grouping shards depending on the values of a column. One could even think that an un-partitioned table is just a partitioned table with only one partition and a static number of shards.

You can read about the best practices at: Guide to sharding and partitioning best practices in CrateDB

View full post