Vector Store Implementation with CrateDB

In the context of Generative AI, multimodal vector embeddings are getting more popular. No matter the kind of source data—text, images, audio, or video—an embedding algorithm of your choice is used to translate the given data into a vector representation. This vector comprises numerous values, the length of which can vary based on the algorithm used. These vectors, along with chunks of the source data, are then stored in a vector store.

Vector databases are ideal for tasks such as similarity search, natural language processing, and computer vision. They provide a structured way to comprehend intricate patterns within large volumes of data. The process of integrating this vector data with CrateDB is straightforward, thanks to its native SQL interface.

CrateDB offers a FLOAT_VECTOR(n) data type, where you specify the length of the vector. This creates an HNSW (Hierarchical Navigable Small World) graph in the background for efficient nearest neighbour search. The KNN_MATCH function executes an approximate K-nearest neighbour (KNN) search and uses the Euclidean distance algorithm to determine similar vectors. You just need to input the target vector and specify the number of nearest neighbours you wish to discover.

The example below illustrates the creation of a table with both a text field and a 4-dimension embedding field, the record insertion into the table with a simple INSERT INTO command, and the usage of the KNN_MATCH function to perform a similarity search.

INSERT INTO word_embeddings (text, embedding)
VALUES
    ('Exploring the cosmos', [0.1, 0.5, -0.2, 0.8]),
    ('Discovering moon', [0.2, 0.4, 0.1, 0.7]),
    ('Discovering galaxies', [0.2, 0.4, 0.2, 0.9]),
    ('Sending the mission', [0.5, 0.9, -0.1, -0.7]);

| text                | _score    |
|---------------------|-----------|
| Discovering galaxies| 0.917431  |
| Discovering moon    | 0.909090  |
| Exploring the cosmos| 0.909090  |
| Sending the mission | 0.270270  |

| text                | _score    |
|---------------------|-----------|
| Discovering galaxies| 1         |
| Discovering moon    | 0.952381  |
| Exploring the cosmos| 0.840336  |
| Sending the mission | 0.250626  |

Combining Vectors, Source and Contextual Information

Combining your vector data (vectorized chunks of your source data) with the original data and some additional contextual information is very powerful.

As we will outline in this chapter, JSON payload offers the most flexible way to store and query your metadata information. A typical table schema would contain a FLOAT_VECTOR column for the embedding and a OBJECT column to contain the source and contextual information.

In the example below, the table contains a FLOAT_VECTOR column with 1536 dimensions. If you are using multiple embedding algorithms, you can add new columns with a different vector length value.

INSERT INTO input_values (source, embedding) VALUES (
'{ "id": "chunk_001",
   "text": "This is the first chunk of text. It contains some information that will be vectorized.",
   "metadata": {
     "author": "Author A",
     "date": "2024-03-15",
     "category": "Education"
   },
   "annotations": [
     { "type": "keyword", "value": "vectorized" },
     { "type": "sentiment", "value": "neutral" }
   ],
   "context": {
     "previous_chunk": "",
     "next_chunk": "chunk_002",
     "related_topics": ["Data Processing", "Machine Learning"]
   }
 }'
 [1.2, 2.1, ..., 3.2] -- Embedding created by your favorite algorithm

SELECT
  source['id'],
  source['text']
FROM
  input_values
WHERE
  knn_match(embedding,?,10) -- Embedding to search
  AND source[’metadata’][’category’] = ‘Education’ 
ORDER BY
  _score DESC
LIMIT 10;

Vector Store Implementation with CrateDB

Combining Vectors, Source and Contextual Information

Adding Filters to Similarity Search

Want to read more?

Want to discuss your project?

Company

Ecosystem

Contact