Skip to content
Blog

Reducing storage costs in CrateDB v5.10

As businesses generate and analyze ever-growing datasets, managing storage costs becomes a critical concern. Without proper optimization, storage costs can escalate due to increasing data volume, inefficient indexing, and unnecessary data retention. But reducing storage costs is not just about saving money; it also impacts database performance, scalability, and maintainability.

In this blog post, we will explore how CrateDB managed to reduce storage costs while maintaining high performance and analytical capabilities.

New version – new format

CrateDB v5.10 comes with a new table storage format, automatically applied to new tables and partitions, that can reduce its data disk footprint by as much as 50%. 

Rows in CrateDB tables have an associated field named _source that is used internally in three ways: as an entry in a data structure that allows individual insert statements to return quickly before aggregate indexes are written to disk; for operations-based recovery and replication, ensuring data integrity in a distributed system; and to represent the special _doc system column, which can be used in SELECT statements both for querying and as a return value. This field can constitute up to 50% of the disk space used by a table’s row data. The new storage format removes this third internal usage, meaning that when data is fully committed to all replicas of a table, the field can be removed.

Transaction logs

CrateDB uses an index-everything approach to its row data, allowing users to efficiently run SELECT queries over exceptionally large tables. Indexes are by their nature aggregate data structures, meaning that they are built over many rows; but when you insert a row into a table, you want to get both an immediate response, and a guarantee that the row has been added to the system and won’t disappear in the event of a crash. This gap between individual rows and bulk indexes is bridged by a data structure called the transaction log. As each row is added to an Indexer as a collection of column entries, it is also consolidated into a binary blob and appended to the transaction log. Once the Indexer contains enough rows to fill its memory buffer, it will write its contents to disk and tell the transaction log that it can remove the entries for those rows. If a node crashes or is shut down with rows still in the Indexer’s buffer, then when it restarts it can replay the entries from the transaction log and so ensure that there is no data loss.

Replication 

CrateDB allows table data to be replicated across multiple machines. In the normal run of things, rows are added to a primary shard and then sent to all replica shards. However, if a replica falls behind for whatever reason – a network error, or a node restart or crash – then it can ask its primary for any rows it has missed when it re-joins the cluster. Sometimes this can be achieved by directly copying index files from one node to another, but if the missing rows form only part of an index, then the replica will need to replay individual row operations. If these operations are still present in the primary’s transaction log, then they can be replayed from there; if the data has been committed on the primary and the rows are no longer in the transaction log, then they need to be read directly from the index.  For this reason, transaction log entries are also stored in the index, in an internal column called _source.

The cluster keeps track of how up to date the various copies of a shard’s data are using an object called a retention lease.

Retrieval

The underlying Apache Lucene storage engine used by CrateDB contains several data structures for efficient search and retrieval of data. The important ones for retrieval are per-document lookup structures called DocValue Fields, and block-based binary blob stores called Stored Fields. DocValue fields are used for sorting and querying and can also be used to retrieve values. Stored fields are structured as compressed blocks containing the values for several documents at a time and are slower to access than DocValue fields as the entire block needs to be decompressed to fetch the values for individual rows.  The _source column is held in a Stored Field and is available to queries as the system column _doc. It can also be loaded and parsed to provide values for columns that are not stored individually in DocValue fields.

Data duplication

The transaction log is stored as a JSON-formatted map which can be conveniently used to look up individual columns. However, this does mean that columns which are stored individually in DocValue fields have their data duplicated in this separate field, and for tables with a lot of columns this can mean that lots of extra bytes are used on disk. Once a row is fully committed to the index on all replicas, these extra bytes are wasted.

The solution

Not all column types support DocValue fields, and many column types allow DocValue fields to be disabled. CrateDB currently uses the _source field to retrieve values for these fields. If those columns that do not store data in DocValue fields instead store them in Stored Fields, that means that we no longer need the _source field at all to retrieve this data, and no column will have duplicated data. The _doc object can be reconstructed from either the DocValue or Stored Field for each column.

When an index segment is rewritten as part of a merge, the indexer can check each row to see if the _source field is still required. If there are no retention leases in the cluster that refer to that row, then it is fully replicated and it is safe to write out without a _source entry.

This can happen organically as part of a background merge during ongoing indexing or can be triggered explicitly by an OPTIMIZE call.

This new format is automatically adopted by all tables and partitions created by clusters running CrateDB v5.10, so you should start to see a reduced data footprint for new data as soon as you upgrade.

Stay tuned!

We are working on comparing internal benchmarks to real world data!