Live Stream on Jan 23rd: Unlocking Real Time Insights in the Renewable Energy Sector with CrateDB

Register now
Skip to content
Blog

CrateDB Version 5.5: Vector Store

With the release of CrateDB version 5.5, we're excited to introduce new features, including vector store functionality and the ability to drop columns.

Vector Store

One of the prominent features included in this release is the added vector store functionality. This means it is now possible to store floating point vectors in CrateDB and do approximate KNN search queries against them. The current dimension limit a float_vector data type can be configured for is 2048.

Learn more about the vector data and vector store functionality

CREATE TABLE my_vectors (
  xs float_vector(2)
 );

INSERT INTO my_vectors (xs) VALUES ([3.14, 8.17]), ([14.3, 19.4]);

SELECT xs, _score FROM my_vectors
WHERE knn_match(xs, [3.14, 8], 2)
ORDER BY _score DESC;

Drop Columns

Additionally, we are happy to announce that the new release introduces the ability to DROP COLUMNs. Please be aware, that the storage space of any dropped column's value will still be occupied. Please look at the related documentation to learn more.

CREATE TABLE t1 (id int, obj object as (x int, y int));

ALTER TABLE t1 DROP COLUMN obj['y'];

-- Re-add same (sub-)column but with a different data type:
ALTER TABLE t1 ADD COLUMN obj['y'] text;

Operation Memory Limits

Another quite interesting administrative feature we’ve added is the ability to limit the memory usage a single operation can use. Compared to our existing memory protections (aka. Circuit Breaker), where only global memory limits shared across all queries can be set, this new feature lets users assign a memory limit a single operation can use (and thus a single query, more on this explained below). This is useful to prevent individual (heavy) queries from resulting in an unresponsive cluster.

You may wonder why this is called per operation and not per query, so let's dive into query execution a bit more in detail. Let’s use a simple query that returns all rows where their timestamp values match a concrete timestamp range:

SELECT ts, level FROM my_timeseries_data 
WHERE ts > '2023-10-17' AND ts < '2023-10-18';

If we look at the plans used for this query using the EXPLAIN statement

 EXPLAIN SELECT ts, level FROM my_timeseries_data 
         WHERE ts > '2023-10-17' AND ts < '2023-10-18';                                                                                                                                                      
+--------------------------------------------------------------------------------------------------------------------------------+
| QUERY PLAN                                                                                                                     |
+--------------------------------------------------------------------------------------------------------------------------------+
| Collect[doc.my_timeseries_data | [ts, level] | ((ts > 1697500800000::bigint) AND (ts < 1697587200000::bigint))] (rows=unknown) |
+--------------------------------------------------------------------------------------------------------------------------------+

we see that this results in one logical Collect plan which will map to one collect operation.

When adding an ordering and limit to the query

EXPLAIN SELECT ts, level FROM my_timeseries_data 
        WHERE ts > '2023-10-17' AND ts < '2023-10-18' 
        ORDER BY ts 
        LIMIT 10;                                                                                                                                 
+-------------------------------------------------------------------------------------------------------------------------------------------+
| QUERY PLAN                                                                                                                                |
+-------------------------------------------------------------------------------------------------------------------------------------------+
| Fetch[ts, level] (rows=10)                                                                                                                |
|   └ Limit[10::bigint;0] (rows=10)                                                                                                         |
|     └ OrderBy[ts ASC] (rows=unknown)                                                                                                      |
|       └ Collect[doc.my_timeseries_data | [_fetchid, ts] | ((ts > 1697500800000::bigint) AND (ts < 1697587200000::bigint))] (rows=unknown) |
+-------------------------------------------------------------------------------------------------------------------------------------------+

The plan shown looks different as now more logical plans are added to the overall plan, each one resulting in a concrete operation. One is collecting the matching rows, the next one will order them, while the last one will apply the final limit (ignoring the Fetch plan here for simplicity).

A very valid question coming to your head probably is: why do we expose such details to the user and let the user not just add a limit per query instead of kinda internal operations?

The answer is quite simple: code complexity and error-proneness.

CrateDB is a heavily distributed database using a parallel execution engine splitting the work across all data nodes inside a cluster. Thus the operations a query is using are spanned across nodes, which makes it quite hard to track memory consumption as this would result in a lot of synchronization and serialization steps (actual memory consumption must be passed to other nodes, memory tracking must support concurrent operation, etc.).

This release contains more interesting minor features, please have a look at the CrateDB 5.5 release notes for a complete list.