Vector Search¶

Vector search on machine learning embeddings: CrateDB is all you need.

Overview

CrateDB can be used as a vector database (VDBMS) for storing and retrieving vector embeddings.

CrateDB’s FLOAT_VECTOR data type implements a vector store and the k-nearest neighbor (kNN) search algorithm to find vectors that are similar to a query vector. This works by using its accompanying KNN_MATCH and VECTOR_SIMILARITY functions to perform HNSW-based semantic similarity search, also known as vector search.

About

Vector search leverages machine learning (ML) to capture the meaning and context of unstructured data, including text and images, transforming it into a numeric representation.

Frequently used for semantic search, vector search finds similar data using approximate nearest neighbor (ANN) algorithms. Compared to traditional keyword search, vector search yields more relevant results and executes faster.

Feature vectors are computed from raw data via ML methods such as feature extraction, word embeddings, or deep neural networks.

Details

CrateDB uses Lucene as a storage layer, so it inherits the implementation and concepts of Lucene Vector Search, in the same spirit as Elasticsearch.

To learn more details about what’s inside, please refer to the HNSW graph search algorithm, how Lucene implemented it, how Elasticsearch now also builds on it, and why effectively Lucene Is All You Need.

While Elasticsearch uses a query DSL based on JSON, in CrateDB, you can work with Lucene Vector Search using SQL.

Reference Manual

SQL Semantic Search Machine Learning ML Embeddings Vector Store

Synopsis¶

Store and query word embeddings using similarity search based on Euclidean distance.

DDL

CREATE TABLE word_embeddings (
  text STRING PRIMARY KEY,
  embedding FLOAT_VECTOR(4)
);

DML

INSERT INTO word_embeddings (text, embedding)
VALUES
  ('Exploring the cosmos', [0.1, 0.5, -0.2, 0.8]),
  ('Discovering moon', [0.2, 0.4, 0.1, 0.7]),
  ('Discovering galaxies', [0.2, 0.4, 0.2, 0.9]),
  ('Sending the mission', [0.5, 0.9, -0.1, -0.7])
;

DQL

WITH param AS
  (SELECT [0.3, 0.6, 0.0, 0.9] AS sv)
SELECT
  text,
  VECTOR_SIMILARITY(embedding, (SELECT sv FROM param))
    AS score
FROM
  word_embeddings
WHERE
  KNN_MATCH(embedding, (SELECT sv FROM param), 2)
ORDER BY
  score DESC;

Result

+----------------------+-----------+
| text                 |     score |
+----------------------+-----------+
| Discovering galaxies | 0.9174312 |
| Exploring the cosmos | 0.9090909 |
| Discovering moon     | 0.9090909 |
| Sending the mission  | 0.2702703 |
+----------------------+-----------+
SELECT 4 rows in set (0.078 sec)

Usage¶

Working with vector data in CrateDB.

Pure SQL

CrateDB’s vector store features are available through SQL and can be used by any application speaking it. The fundamental data type of FLOAT_VECTOR is a plain array of floating point numbers, as such it will be communicated through CrateDB’s HTTP and PostgreSQL interfaces.

Framework Integrations

CrateDB supports applications using the vector data type through corresponding framework adapters. The page about Machine learning illustrates all of them, covering both topics about machine learning operations (MLOps), and vector database operations (similarity search).

Learn¶

Learn how to set up your database for vector search, how to create the relevant indices, and how to semantically query your data efficiently. A few must-reads for anyone looking to make sense of large volumes of unstructured text data.

Tutorials

Vector Support and KNN Search through SQL

The addition of vector support and KNN search makes CrateDB the optimal multi-model database for all types of data. Whether it is structured, semi-structured, or unstructured data, CrateDB stands as the all-in-one solution, capable of handling diverse data types with ease.

In this feature-focused blog post, we will introduce how CrateDB can be used as a vector database and how the vector store is implemented. We will also explore the possibilities of the K-Nearest Neighbors (KNN) search, and demonstrate vector capabilities with easy-to-follow examples.

Introduction
Vector Store
SQL

Retrieval Augmented Generation (RAG) with CrateDB and SQL

This notebook illustrates CrateDB’s vector store using pure SQL on behalf of an example exercising a RAG workflow.

It uses the white-paper Time series data in manufacturing as input data, generates embeddings using OpenAI’s ChatGPT, stores them into a table using FLOAT_VECTOR(1536), and queries it using the KNN_MATCH and VECTOR_SIMILARITY functions.

Fundamentals
Vector Store
LangChain
pandas
SQL

Technologies

Support for Vector Search in Apache Lucene

Uwe Schindler talks at Berlin Buzzwords 2023 about the new vector search features of Lucene 9, and about the journey of implementing HNSW from 2016 to 2021.

Uwe Schindler - What’s coming next with Apache Lucene?

Fundamentals Lucene Vector Search