Contents Menu Expand Light mode Dark mode Skip to content
  • Product
    • Database
      • Overview
      • SQL examples
      • Integrations
      • Security
    • Data models
      • Time-series
      • Document/JSON
      • Vector
      • Full-text
      • Spatial
      • Relational
  • Use cases
    • Real-time analytics
    • Hybrid search
    • AI/ML integration
    • AI chatbots
    • Internet of Things
    • Geospatial analytics
    • Log & event analysis
  • Industries
    • Energy
    • Financial Services
    • FMCG
    • Logistics
    • Manufacturing
    • Oil, gas & mining
    • Smart city solutions
    • Technology platforms
    • Telco
    • Transportation
  • Resources
    • Customer stories
    • Academy
    • Asset library
    • Blog
    • Events
  • Developer
    • Documentation
    • Drivers and tools
    • Community
    • GitHub
    • Support
  • Pricing
  • Login
  • Get Started
  • Overview
  • CrateDB Cloud
  • Guides and Tutorials
    • Installation
    • Getting Started
    • All Features
      • SQL
      • Connectivity
      • Document Store
      • Relational / JOINs
      • Search: FTS, Geo, Vector, Hybrid
        • Full-Text Search
        • Geospatial Search
        • Vector Search
        • Hybrid Search
      • BLOB Store
      • Clustering
      • Snapshots
      • Cloud Native
      • Storage Layer
      • Hybrid Index
      • Advanced Querying
      • Generated Columns
      • Server-Side Cursors
      • Foreign Data Wrapper
      • User-Defined Functions
      • Cross-Cluster Replication
    • Administration
    • Performance Guides
    • Application Domains
    • Integrations
    • Migrations
    • Reference Architectures
  • Reference Manual
  • Admin UI
  • CrateDB CLI
  • Cloud CLI
  • Drivers and Integrations
  • Support
  • Community
  • Integration Tutorials
  • Sample Applications
  • Academy

Vector Search¶

Vector search on machine learning embeddings: CrateDB is all you need.

Overview

CrateDB can be used as a vector database (VDBMS) for storing and retrieving vector embeddings based on the FLOAT_VECTOR data type and its accompanying KNN_MATCH and VECTOR_SIMILARITY functions, effectively conducting HNSW semantic similarity searches on them, also known as vector search.

About

Vector search leverages machine learning (ML) to capture the meaning and context of unstructured data, including text and images, transforming it into a numeric representation.

Frequently used for semantic search, vector search finds similar data using approximate nearest neighbor (ANN) algorithms. Compared to traditional keyword search, vector search yields more relevant results and executes faster.

Details

CrateDB uses Lucene as a storage layer, so it inherits the implementation and concepts of Lucene Vector Search, in the same spirit as Elasticsearch.

To learn more details about what’s inside, please refer to the HNSW graph search algorithm, how Lucene implemented it, how Elasticsearch now also builds on it, and why effectively Lucene Is All You Need.

While Elasticsearch uses a query DSL based on JSON, in CrateDB, you can work with Lucene Vector Search using SQL.

Reference Manual

  • FLOAT_VECTOR

  • KNN_MATCH

  • VECTOR_SIMILARITY

Related

  • SQL

  • Full-Text Search

  • Geospatial Search

  • Hybrid Search

  • Machine Learning

  • Advanced Querying

SQL Semantic Search Machine Learning ML Embeddings Vector Store

Synopsis¶

Store and query word embeddings using similarity search based on Euclidean distance.

DDL

CREATE TABLE word_embeddings (
  text STRING PRIMARY KEY,
  embedding FLOAT_VECTOR(4)
);

DML

INSERT INTO word_embeddings (text, embedding)
VALUES
  ('Exploring the cosmos', [0.1, 0.5, -0.2, 0.8]),
  ('Discovering moon', [0.2, 0.4, 0.1, 0.7]),
  ('Discovering galaxies', [0.2, 0.4, 0.2, 0.9]),
  ('Sending the mission', [0.5, 0.9, -0.1, -0.7])
;

DQL

WITH param AS
  (SELECT [0.3, 0.6, 0.0, 0.9] AS sv)
SELECT
  text,
  VECTOR_SIMILARITY(embedding, (SELECT sv FROM param))
    AS score
FROM
  word_embeddings
WHERE
  KNN_MATCH(embedding, (SELECT sv FROM param), 2)
ORDER BY
  score DESC;

Result

+----------------------+-----------+
| text                 |     score |
+----------------------+-----------+
| Discovering galaxies | 0.9174312 |
| Exploring the cosmos | 0.9090909 |
| Discovering moon     | 0.9090909 |
| Sending the mission  | 0.2702703 |
+----------------------+-----------+
SELECT 4 rows in set (0.078 sec)

Usage¶

Working with vector data in CrateDB.

Pure SQL

CrateDB’s vector store features are available through SQL and can be used by any application speaking it. The fundamental data type of FLOAT_VECTOR is a plain array of floating point numbers, as such it will be communicated through CrateDB’s HTTP and PostgreSQL interfaces.

Framework Integrations

CrateDB supports applications using the vector data type through corresponding framework adapters. The page about Machine Learning illustrates all of them, covering both topics about machine learning operations (MLOps), and vector database operations (similarity search).

Learn¶

Learn how to set up your database for vector search, how to create the relevant indices, and how to semantically query your data efficiently. A few must-reads for anyone looking to make sense of large volumes of unstructured text data.

Tutorials

Vector Support and KNN Search through SQL

The addition of vector support and KNN search makes CrateDB the optimal multi-model database for all types of data. Whether it is structured, semi-structured, or unstructured data, CrateDB stands as the all-in-one solution, capable of handling diverse data types with ease.

In this feature-focused blog post, we will introduce how CrateDB can be used as a vector database and how the vector store is implemented. We will also explore the possibilities of the K-Nearest Neighbors (KNN) search, and demonstrate vector capabilities with easy-to-follow examples.

Blog

Introduction
Vector Store
SQL

Retrieval Augmented Generation (RAG) with CrateDB and SQL

This notebook illustrates CrateDB’s vector store using pure SQL on behalf of an example exercising a RAG workflow.

It uses the white-paper Time series data in manufacturing as input data, generates embeddings using OpenAI’s ChatGPT, stores them into a table using FLOAT_VECTOR(1536), and queries it using the KNN_MATCH and VECTOR_SIMILARITY functions.

Notebook on GitHub Notebook on Colab Notebook on Binder

Fundamentals
Vector Store
LangChain
pandas
SQL

Technologies

Support for Vector Search in Apache Lucene

Uwe Schindler talks at Berlin Buzzwords 2023 about the new vector search features of Lucene 9, and about the journey of implementing HNSW from 2016 to 2021.

  • Uwe Schindler - What’s coming next with Apache Lucene?

 

Fundamentals Lucene Vector Search

See also

Features: Advanced Querying • Full-Text Search

Domains: Industrial Data • Machine Learning • Time Series Data

Product: Relational Database • Vector Database

Next
Hybrid Search
Previous
Geospatial Search
  Feedback

  Suggest improvement

  Edit page

  View page source

On this page
  • Vector Search
    • Synopsis
    • Usage
    • Learn
  • Imprint
  • Contact
  • Legal
Follow us
Follow us on X Follow us on LinkedIn Follow us on Facebook Follow us on Instagram Follow us on Facebook