CrateDB v5.10 Release: 50% storage space reduction and fast outer joins

A new feature release of CrateDB, version 5.10, has been released today, featuring:

new table storage format, bringing up to 50% storage space reduction in size on disk
hash join implementation for LEFT and RIGHT OUTER JOINs.
- Previously only INNER JOINs were using CrateDB’s distributed hash join algorithm, while other joins were slower to execute.

Performance & Scaling

Table storage file format

Since CrateDB (typically) indexes all your data, the tradeoff is that data is essentially stored twice - in the table itself and as an index used for querying. But starting in 5.10, the original record is rebuilt from the indexes, and there is no longer a need to store the original source record at all. In our benchmarks, we’ve seen up to 50% size reduction in the data files on disk! A separate blog post covers in depth how we made it possible.

How can I upgrade to this new format, you might immediately be thinking. As new partitions are created in the new file format, for users that use some form of time-based partitioning, where new partitions are created monthly or quarterly, and old ones are deleted after their retention period runs out, the smoothest way to upgrade is to do absolutely nothing! Just sit back and wait until all your partitions have rotated into the new format.

If you have a schema that is not partitioned by time, the other option is to re-insert the data into new tables. It is a manual process, and we will be investing in a solution that doesn’t require any downtime, nor depends on using time as your primary key.

Fast LEFT OUTER JOIN

CrateDB JOINs are state-of-the-art for distributed databases. This starts with the implementation on each data node, where the JOIN’ed rows are mapped with a hash-based algorithm. The implementation for LEFT-OUTER-EQUI-JOINS changed to a Hash-Join algorithm which reduces the runtime behavior from O(N*M) to 0(N) resulting in linear complexity. This increases performance significantly.

In 5.10 we expanded this algorithm so that it now also works for OUTER JOIN. (both LEFT and RIGHT.). As a reminder, OUTER JOINs are what you use when you want to include the joining rows even if they have no match in the joined table. In this case, the columns from the joined table are just NULL.

User Experience

Object IGNORED behavior

As CrateDB has become a popular analytical database used together with document databases like MongoDB or DynamoDB, we have reviewed and fixed behavior in various corner cases when using OBJECT(DYNAMIC) or OBJECT(IGNORED) columns. More details about each fix are in the release notes.

Error messages in bulk inserts

Although CrateDB supports bulk inserts, such inserts are not atomic. So, each row inserted can succeed or fail, and it can fail for different reasons than the next row. Starting in 5.10, for HTTP client protocol, we return an array that includes all errors that happened across the bulk write.

Admin

Lucene 9.12

The Lucene library was upgraded to the latest 9.12 version. This is a patch upgrade only.

Cluster-level JWT configuration

Previously the JWT module in the authentication framework allowed a user to configure a JWT identity provider for their own account. As of 5.10, the cluster administrator can configure a JWT identity provider for the entire cluster.

For a more detailed listing of changes in 5.10, see the release notes.

CrateDB 5.10 is available in CrateDB Cloud immediately!

Read the Full Release Notes

CrateDB v5.10 Release: 50% storage space reduction and fast outer joins

Performance & Scaling

Table storage file format

Fast LEFT OUTER JOIN

User Experience

Object IGNORED behavior

Error messages in bulk inserts

Admin

Lucene 9.12

Cluster-level JWT configuration

Company

Ecosystem

Contact

CrateDB v5.10 Release: 50% storage space reduction and fast outer joins

Performance & Scaling

Table storage file format

Fast LEFT OUTER JOIN

User Experience

Object IGNORED behavior

Error messages in bulk inserts

Admin

Lucene 9.12

Cluster-level JWT configuration

Related Posts

Leveraging Shared Nothing Architecture and Multi-Model Databases for Scalable Real-Time Analytics

How CrateDB Minimizes Data Footprints Without Compromising Performance

Announcing MongoDB CDC Integration (Public Preview) in CrateDB Cloud