Live Stream on Jan 23rd: Unlocking Real Time Insights in the Renewable Energy Sector with CrateDB

Register now
Skip to content
Blog

Unlocking the power of vector support and KNN search in CrateDB

With CrateDB version 5.5 we are thrilled to introduce exciting features in CrateDB: vector support and KNN search! This important addition makes CrateDB the optimal multi-model database for all types of data. Whether it’s structured, semi-structured, or unstructured data, CrateDB stands as the all-in-one solution, capable of handling diverse data types with ease.

In this feature-focused blog post, we will introduce how CrateDB can be used as a vector database and how the vector store is implemented. We will also explore the possibilities of the K-Nearest Neighbors (KNN) search, and demonstrate vector capabilities with easy-to-follow examples.

Introduction

In the context of the growing importance and popularity of AI technologies, vector stores, and vector search have emerged as critical components due to their profound impact on various applications. Vectors, mathematical representations of data points, provide a versatile means to represent data, enabling AI systems to efficiently capture patterns in information like text and images. Vector search allows for an efficient and accurate retrieval of semantically similar items, facilitating tasks such as NLP, image matching, or content recommendation. In these use cases, a piece of data is represented as a vector in a multi-dimensional space. By measuring the distance between their vectors, we can figure out how similar items are.

KNN (k-nearest neighbors) search is an algorithm that finds the 'k' closest data points to a given query point based on a specified distance metric. By identifying the most similar neighbors to a target point, KNN search enables tasks like finding similar records, recognizing patterns in data, and detecting outliers.

KNN search in CrateDB

Vector data type

A vector in CrateDB is a one-dimensional array-like structure that consists of multiple values, often referred to as components or features. Each element in the vector holds numerical information that can represent one or more attributes or characteristics of an object. CrateDB offers vector support with the float_vector data type that allows you to store dense vectors of float values of fixed length:

CREATE TABLE my_data (
  xs FLOAT_VECTOR(2)
);
INSERT INTO my_data VALUES ([1.6,2.7]), ([4.6, 7.8])

CrateDB supports vector dimensions that do not exceed 2048. This allows the outputs of the major embedding algorithms, such as OpenAI or HuggingFace, to be stored in CrateDB effectively. 

Approximate nearest neighbor search by Lucene

Many of CrateDB’s core features are based on the Apache Lucene library, an open-source project providing powerful search and indexing capabilities. This also includes the new features for storing and searching numeric vectors. The vector search in CrateDB is developed upon the Lucene implementation for ANN (Approximate Nearest Neighbor) search. ANN search aims to provide a close approximation of the nearest neighbors within a certain margin of error, allowing for significant speedup in retrieval times. 

The ANN search in Lucene uses the Hierarchical Navigable Small World (HNSW) algorithm. It constructs a navigable graph that hierarchically organizes data points, creating connections that help guide the search process efficiently. This approach enables quick identification of candidate neighbors and reduces the computational burden of exhaustive search, which makes it particularly well-suited for large-scale data sets and real-time applications.

knn_match function

The knn_match (search_vector, query_vector, k) function in CrateDB implements the approximate k-nearest neighbors (KNN) search within a dataset. KNN search involves finding the k data points that are most similar to a given query data point. The knn_match provides an efficient way to retrieve approximate nearest neighbors, with a great balance between accuracy and speed. It takes three parameters:

  • search_vector: is the column containing vectors to search
  • query_vector: is the vector for which you want to find the nearest neighbors
  • k: the number of most similar vectors to a given query_vector from each shard

As illustrated by the following example, you can use the knn_match function to find vectors in the xs column similar to vector [3.14, 1.33]:

SELECT xs, _score FROM my_data
WHERE knn_match(xs, [3.14, 1.33], 2)
ORDER BY _score DESC;

In CrateDB, the _score represents a measure of similarity between the search and query vectors. The similarity is based on the Euclidean distance metric and the _score is computed as:

1 / (1 + squareDistance(v1, v2));

It provides a straightforward measure of how far apart two points are and is a commonly used metric in various ML applications, such as clustering, classification, or similarity analysis.

Example with word embeddings

To illustrate the vector similarity search in CrateDB we will use the simple dataset of text documents and their corresponding word embeddings (a real-valued vector that encodes the representation of the word). First, we will create word_embeddings data where the text holds the document text, and embedding contains the word embeddings as float vector:

CREATE TABLE word_embeddings (
    text STRING PRIMARY KEY,
    embedding FLOAT_VECTOR(4)
);

For demonstration purposes, we will insert data with simplified embeddings:

INSERT INTO word_embeddings (text, embedding)
VALUES
    ('Exploring the cosmos', [0.1, 0.5, -0.2, 0.8]),
    ('Discovering moon', [0.2, 0.4, 0.1, 0.7]),
    ('Discovering galaxies', [0.2, 0.4, 0.2, 0.9]),
    ('Sending the mission', [0.5, 0.9, -0.1, -0.7])
;

Now, once the embeddings are stored in CrateDB, let’s find similar embeddings for a specific query. We will use the word embedding [0.3, 0.6, 0.0, 0.9] for the query “Space exploration” and k=2:

SELECT text, _score
FROM word_embeddings
WHERE knn_match(embedding,[0.3, 0.6, 0.0, 0.9], 2)
ORDER BY _score DESC; 

By running this query we will get the text and relevance score for each, ranked by their similarity to the query word embedding:

|         text           | _score |
|------------------------|--------|
|Discovering galaxies    |0.917431|
|Discovering moon        |0.909090|
|Exploring the cosmos    |0.909090|
|Sending the mission     |0.270270|

Similarly, we can also compare how similar other data points are in comparison to an example text. For instance, let’s understand how equal other text items str to “Discovering galaxies”:

SELECT text, _score
FROM word_embeddings
WHERE knn_match(embedding, (SELECT embedding FROM word_embeddings WHERE text ='Discovering galaxies'), 2)
ORDER BY _score DESC

Finally, the result shows text data ranked by their similarity:

|         text           | _score |
|------------------------|--------|
|Discovering galaxies    |1       |
|Discovering moon        |0.952381|
|Exploring the cosmos    |0.840336|
|Sending the mission     |0.250626|

As expected, we can observe a high similarity between “Discovering galaxies“ and “Discovering moon“ records. The above examples can be easily extended with real-world data as a starting point for many interesting applications.

Wrap up

We started with an exciting journey into the world of similarity-based querying using CrateDB's support for vector store and vector search. Now we are equipped with tools to further explore how to make sense of relationships and patterns within datasets, enabling a variety of AI applications. If you already want to try vector capabilities in CrateDB, start your cluster on CrateDB Cloud – a streamlined solution for seamless, secure, and scalable data operations. For questions and other interesting tutorials, visit our documentation and join the CrateDB community.