Vector Search¶
Vector search on machine learning embeddings: CrateDB is all you need.
Overview
CrateDB can be used as a vector database (VDBMS) for storing and retrieving vector embeddings based on the FLOAT_VECTOR data type and its accompanying KNN_MATCH and VECTOR_SIMILARITY functions, effectively conducting HNSW semantic similarity searches on them, also known as vector search.
About
Vector search leverages machine learning (ML) to capture the meaning and context of unstructured data, including text and images, transforming it into a numeric representation.
Frequently used for semantic search, vector search finds similar data using approximate nearest neighbor (ANN) algorithms. Compared to traditional keyword search, vector search yields more relevant results and executes faster.
Details
CrateDB uses Lucene as a storage layer, so it inherits the implementation and concepts of Lucene Vector Search, in the same spirit as Elasticsearch.
To learn more details about what’s inside, please refer to the HNSW graph search algorithm, how Lucene implemented it, how Elasticsearch now also builds on it, and why effectively Lucene Is All You Need.
While Elasticsearch uses a query DSL based on JSON, in CrateDB, you can work with Lucene Vector Search using SQL.
Reference Manual
Related
SQL Semantic Search Machine Learning ML Embeddings Vector Store
Synopsis¶
Store and query word embeddings using similarity search based on Euclidean distance.
DDL
CREATE TABLE word_embeddings (
text STRING PRIMARY KEY,
embedding FLOAT_VECTOR(4)
);
DML
INSERT INTO word_embeddings (text, embedding)
VALUES
('Exploring the cosmos', [0.1, 0.5, -0.2, 0.8]),
('Discovering moon', [0.2, 0.4, 0.1, 0.7]),
('Discovering galaxies', [0.2, 0.4, 0.2, 0.9]),
('Sending the mission', [0.5, 0.9, -0.1, -0.7])
;
DQL
WITH param AS
(SELECT [0.3, 0.6, 0.0, 0.9] AS sv)
SELECT
text,
VECTOR_SIMILARITY(embedding, (SELECT sv FROM param))
AS score
FROM
word_embeddings
WHERE
KNN_MATCH(embedding, (SELECT sv FROM param), 2)
ORDER BY
score DESC;
Result
+----------------------+-----------+
| text | score |
+----------------------+-----------+
| Discovering galaxies | 0.9174312 |
| Exploring the cosmos | 0.9090909 |
| Discovering moon | 0.9090909 |
| Sending the mission | 0.2702703 |
+----------------------+-----------+
SELECT 4 rows in set (0.078 sec)
Usage¶
Working with vector data in CrateDB.
Pure SQL
CrateDB’s vector store features are available through SQL and can be used by any application speaking it. The fundamental data type of FLOAT_VECTOR is a plain array of floating point numbers, as such it will be communicated through CrateDB’s HTTP and PostgreSQL interfaces.
Framework Integrations
CrateDB supports applications using the vector data type through corresponding framework adapters. The page about Machine Learning illustrates all of them, covering both topics about machine learning operations (MLOps), and vector database operations (similarity search).
Learn¶
Learn how to set up your database for vector search, how to create the relevant indices, and how to semantically query your data efficiently. A few must-reads for anyone looking to make sense of large volumes of unstructured text data.
Tutorials
Vector Support and KNN Search through SQL
The addition of vector support and KNN search makes CrateDB the optimal multi-model database for all types of data. Whether it is structured, semi-structured, or unstructured data, CrateDB stands as the all-in-one solution, capable of handling diverse data types with ease.
In this feature-focused blog post, we will introduce how CrateDB can be used as a vector database and how the vector store is implemented. We will also explore the possibilities of the K-Nearest Neighbors (KNN) search, and demonstrate vector capabilities with easy-to-follow examples.
Introduction
Vector Store
SQL
Retrieval Augmented Generation (RAG) with CrateDB and SQL
This notebook illustrates CrateDB’s vector store using pure SQL on behalf of an example exercising a RAG workflow.
It uses the white-paper Time series data in manufacturing as input data,
generates embeddings using OpenAI’s ChatGPT, stores them into a table
using FLOAT_VECTOR(1536)
, and queries it using the KNN_MATCH
and
VECTOR_SIMILARITY
functions.
Fundamentals
Vector Store
LangChain
pandas
SQL
Technologies
Support for Vector Search in Apache Lucene
Uwe Schindler talks at Berlin Buzzwords 2023 about the new vector search features of Lucene 9, and about the journey of implementing HNSW from 2016 to 2021.
Fundamentals Lucene Vector Search
See also
Features: Advanced Querying • Full-Text Search
Domains: Industrial Data • Machine Learning • Time Series Data
Product: Relational Database • Vector Database