Skip to content
Data models

Vector data in CrateDB

Store embeddings, run similarity search, and combine vector queries with SQL filters and aggregations. In the same database as your operational data.

The standard pattern for AI applications is a separate vector database synchronized with your operational database: two systems to operate, data duplicated across both, and a synchronization pipeline that lags behind every write.

When the vector store and the operational database are separate, hybrid queries that combine semantic similarity with structured filters, time constraints, or aggregations require results to be fetched from both systems and merged in application code. CrateDB stores vectors in the same rows as your structured and semi-structured data. Similarity search runs alongside filters, time constraints, and aggregations in a single SQL query, with no synchronization overhead and no application-side merging.

 

Vector search examples

knn_match in standard SQL. No proprietary SDK, no separate query interface, no client-side result merging. 

        

/* Find the nearest vectors to a given query embedding using knn_match.
 Results include a relevance score. This runs on the same distributed SQL engine
 as your analytical queries — no separate search index, no separate request. */

SELECT text, _score
FROM word_embeddings
WHERE knn_match(embedding,[0.3, 0.6, 0.0, 0.9], 2)
ORDER BY _score DESC; 
        

|------------------------|--------|
|         text           | _score |
|------------------------|--------|
|Discovering galaxies    |0.917431|
|Discovering moon        |0.909090|
|Exploring the cosmos    |0.909090|
|Sending the mission     |0.270270|
|------------------------|--------|
        

/* Use a stored embedding as the query target rather than an inline vector.
 The subquery retrieves the reference embedding and the outer query finds similar ones.
 This pattern is common in RAG pipelines where context is retrieved 
 by similarity to a stored document embedding. */

SELECT text, _score
FROM word_embeddings
WHERE knn_match(embedding, (SELECT embedding FROM word_embeddings WHERE text ='Discovering galaxies'), 2)
ORDER BY _score DESC
        

|------------------------|--------|
|         text           | _score |
|------------------------|--------|
|Discovering galaxies    |1       |
|Discovering moon        |0.952381|
|Exploring the cosmos    |0.840336|
|Sending the mission     |0.250626|
|------------------------|--------|

Unified vector storage

Store vector embeddings directly in CrateDB alongside the data they describe. Vectors and metadata live in the same row. No ETL jobs, no sync workflows, no risk of the two falling out of alignment. CrateDB supports float vectors up to 2,048 dimensions and integrates with Python pipelines, LangChain, LlamaIndex, and HuggingFace embedding models through standard PostgreSQL or HTTP interfaces.

cr-quote-image

Similarity search

CrateDB's knn_match function performs fast nearest-neighbor lookups across high-dimensional vector spaces. Results are returned in milliseconds even across millions of rows, using distributed execution across all cluster nodes. Product search, semantic search, anomaly detection, and RAG retrieval all use the same query interface.

SELECT id, title
FROM documents
ORDER BY knn_match(embedding, $query_vector)
LIMIT 5; 

cr-quote-image

Hybrid search

Vector similarity and structured filters run in the same SQL statement. Combine knn_match with MATCH for full-text relevance, with WHERE clauses for structured filters, and with time constraints for recency-weighted retrieval. No application-side merging, no dual-system query coordination.

SELECT id, title
FROM documents
WHERE MATCH(text, 'energy storage')
ORDER BY knn_match(embedding, $query_vector)
LIMIT 10; 

This pattern is what makes CrateDB useful for RAG pipelines. You can filter by metadata, time, or full-text relevance at the same time as you retrieve by semantic similarity.

cr-quote-image

AI-ready integrations

CrateDB works as a vector store, feature store, and RAG backend depending on what you are building, without any reconfiguration. It fits into RAG pipelines by storing embeddings, running semantic search, retrieving context, and feeding external LLMs such as OpenAI or Llama. No proprietary SDKs required, just SQL through the PostgreSQL wire protocol or HTTP endpoint.

cr-quote-image

Scalability

CrateDB's shared-nothing architecture scales vector and non-vector data uniformly across cluster nodes. Add nodes as your embedding dataset grows and the cluster rebalances automatically. No hand-tuned infrastructure, no specialized vector-only systems to manage alongside your operational database.

cr-quote-image

CrateDB handles vector data alongside time-series, JSON, geospatial, full-text, and relational data in the same engine. No separate pipelines.

Keynote: The transformative effects of real-time AI 

Keynote: The transformative effects of real-time AI 

CrateDB's VP of Product on the convergence of multi-model SQL databases and large language models. From the AI and Big Data Expo Europe 2023.

Dev talk: How to use private data in generative AI

Dev talk: How to use private data in generative AI

How to combine CrateDB and LangChain to build RAG pipelines using private data as context for large language models. From FOSDEM 2024.

Demo: anomaly detection on machine data with CrateDB and PyCaret

Demo: anomaly detection on machine data with CrateDB and PyCaret

How to connect CrateDB to PyCaret for low-code machine learning on live operational data to detect anomalies and potential failures in industrial sensor streams without exporting data to a separate ML platform.

Additional resources

FAQ

In machine learning and AI, a vector is a numerical array that represents the semantic meaning of a piece of data: a word, a sentence, an image, or any other input processed by an embedding model. For example, the sentence "discovering galaxies" might be represented as [-0.326, -0.123, -0.287, ...], a point in high-dimensional space. Vectors that are semantically similar cluster closer together in that space, which is what makes similarity search possible. In CrateDB, vectors are stored as one-dimensional arrays of float values using the float_vector data type, supporting up to 2,048 dimensions.

Vectors are numerical representations used to quantify and compare features or characteristics of data items, such as text, images, or sounds, in a high-dimensional space. For example, a vector can look like this: -0.32643065-0.12308089, -0.2873811, representing a point in a multi-dimensional space. In CrateDB, vectors are stored as one-dimensional arrays of float values using the float_vector data type, allowing for efficient storage and querying of dense vector data.

A vector database is designed to store and manage high-dimensional data, grouping vectors based on their similarities. These databases use advanced indexing and search algorithms to find the most similar vectors to a given query quickly. Examples of vector databases include CrateDB, Pinecone, Zilliz, and Weaviate. CrateDB excels as a vector store database with features like vector storage and similarity search. If you want to learn more, read this blog post on how to choose the best database for vector data.

CrateDB's advantage over dedicated vector databases is that vector search runs in the same engine as your structured analytics. You can combine knn_match similarity search with SQL filters, time constraints, full-text search, and aggregations in a single query, without fetching results from multiple systems and merging them in application code. Teams whose entire workload is vector search on static embeddings may find a dedicated vector database simpler. Teams who need vector search alongside real-time analytics, time-series data, or operational workloads will find CrateDB significantly easier to operate.

Vector data is used in various applications, including e-commerce recommendations, chatbots and customer support, anomaly and fraud detection, multimodal searches, and generative AI. One of the key applications is similarity search, where algorithms like k-nearest neighbors (KNN) identify the most similar data points to a given query vector. This capability is crucial for recommendation systems, image retrieval, and anomaly detection. CrateDB enhances these applications by integrating vector storage and similarity search within a scalable database solution. Watch CrateDB’s keynote to learn more about vectors for real-time AI >

Typical distance metrics for comparing vectors include Euclidean distance, Cosine similarity, and Manhattan distance. In AI and ML, they are used for similarity search.

CrateDB's knn_match function uses cosine similarity by default, which measures the angle between two vectors and works well for text and semantic embeddings. Euclidean distance measures the straight-line distance between two points in vector space and suits image and audio embeddings. Manhattan distance sums the absolute differences across all dimensions and is less common in practice for ML workloads. The right metric depends on how your embedding model was trained. Consult your model documentation for the recommended similarity function.

Start querying in minutes

Free forever on CrateDB Cloud. No credit card required.