Vector data in CrateDB
The standard pattern for AI applications is a separate vector database synchronized with your operational database: two systems to operate, data duplicated across both, and a synchronization pipeline that lags behind every write.
When the vector store and the operational database are separate, hybrid queries that combine semantic similarity with structured filters, time constraints, or aggregations require results to be fetched from both systems and merged in application code. CrateDB stores vectors in the same rows as your structured and semi-structured data. Similarity search runs alongside filters, time constraints, and aggregations in a single SQL query, with no synchronization overhead and no application-side merging.
Vector search examples
knn_match in standard SQL. No proprietary SDK, no separate query interface, no client-side result merging.
/* Find the nearest vectors to a given query embedding using knn_match.
Results include a relevance score. This runs on the same distributed SQL engine
as your analytical queries — no separate search index, no separate request. */
SELECT text, _score
FROM word_embeddings
WHERE knn_match(embedding,[0.3, 0.6, 0.0, 0.9], 2)
ORDER BY _score DESC;
|------------------------|--------|
| text | _score |
|------------------------|--------|
|Discovering galaxies |0.917431|
|Discovering moon |0.909090|
|Exploring the cosmos |0.909090|
|Sending the mission |0.270270|
|------------------------|--------|
/* Use a stored embedding as the query target rather than an inline vector.
The subquery retrieves the reference embedding and the outer query finds similar ones.
This pattern is common in RAG pipelines where context is retrieved
by similarity to a stored document embedding. */
SELECT text, _score
FROM word_embeddings
WHERE knn_match(embedding, (SELECT embedding FROM word_embeddings WHERE text ='Discovering galaxies'), 2)
ORDER BY _score DESC
|------------------------|--------|
| text | _score |
|------------------------|--------|
|Discovering galaxies |1 |
|Discovering moon |0.952381|
|Exploring the cosmos |0.840336|
|Sending the mission |0.250626|
|------------------------|--------|
Unified vector storage
Store vector embeddings directly in CrateDB alongside the data they describe. Vectors and metadata live in the same row. No ETL jobs, no sync workflows, no risk of the two falling out of alignment. CrateDB supports float vectors up to 2,048 dimensions and integrates with Python pipelines, LangChain, LlamaIndex, and HuggingFace embedding models through standard PostgreSQL or HTTP interfaces.
Similarity search
CrateDB's knn_match function performs fast nearest-neighbor lookups across high-dimensional vector spaces. Results are returned in milliseconds even across millions of rows, using distributed execution across all cluster nodes. Product search, semantic search, anomaly detection, and RAG retrieval all use the same query interface.
SELECT id, title
FROM documents
ORDER BY knn_match(embedding, $query_vector)
LIMIT 5;
Hybrid search
Vector similarity and structured filters run in the same SQL statement. Combine knn_match with MATCH for full-text relevance, with WHERE clauses for structured filters, and with time constraints for recency-weighted retrieval. No application-side merging, no dual-system query coordination.
SELECT id, title
FROM documents
WHERE MATCH(text, 'energy storage')
ORDER BY knn_match(embedding, $query_vector)
LIMIT 10;
This pattern is what makes CrateDB useful for RAG pipelines. You can filter by metadata, time, or full-text relevance at the same time as you retrieve by semantic similarity.
AI-ready integrations
CrateDB works as a vector store, feature store, and RAG backend depending on what you are building, without any reconfiguration. It fits into RAG pipelines by storing embeddings, running semantic search, retrieving context, and feeding external LLMs such as OpenAI or Llama. No proprietary SDKs required, just SQL through the PostgreSQL wire protocol or HTTP endpoint.
Scalability
CrateDB's shared-nothing architecture scales vector and non-vector data uniformly across cluster nodes. Add nodes as your embedding dataset grows and the cluster rebalances automatically. No hand-tuned infrastructure, no specialized vector-only systems to manage alongside your operational database.
Keynote: The transformative effects of real-time AI
Keynote: The transformative effects of real-time AI
Dev talk: How to use private data in generative AI
Dev talk: How to use private data in generative AI
Demo: anomaly detection on machine data with CrateDB and PyCaret
Demo: anomaly detection on machine data with CrateDB and PyCaret
Additional resources
Documentation
FAQ
In machine learning and AI, a vector is a numerical array that represents the semantic meaning of a piece of data: a word, a sentence, an image, or any other input processed by an embedding model. For example, the sentence "discovering galaxies" might be represented as [-0.326, -0.123, -0.287, ...], a point in high-dimensional space. Vectors that are semantically similar cluster closer together in that space, which is what makes similarity search possible. In CrateDB, vectors are stored as one-dimensional arrays of float values using the float_vector data type, supporting up to 2,048 dimensions.
Vectors are numerical representations used to quantify and compare features or characteristics of data items, such as text, images, or sounds, in a high-dimensional space. For example, a vector can look like this: -0.32643065, -0.12308089, -0.2873811, representing a point in a multi-dimensional space. In CrateDB, vectors are stored as one-dimensional arrays of float values using the float_vector data type, allowing for efficient storage and querying of dense vector data.
A vector database is designed to store and manage high-dimensional data, grouping vectors based on their similarities. These databases use advanced indexing and search algorithms to find the most similar vectors to a given query quickly. Examples of vector databases include CrateDB, Pinecone, Zilliz, and Weaviate. CrateDB excels as a vector store database with features like vector storage and similarity search. If you want to learn more, read this blog post on how to choose the best database for vector data.
CrateDB's advantage over dedicated vector databases is that vector search runs in the same engine as your structured analytics. You can combine knn_match similarity search with SQL filters, time constraints, full-text search, and aggregations in a single query, without fetching results from multiple systems and merging them in application code. Teams whose entire workload is vector search on static embeddings may find a dedicated vector database simpler. Teams who need vector search alongside real-time analytics, time-series data, or operational workloads will find CrateDB significantly easier to operate.
Vector data is used in various applications, including e-commerce recommendations, chatbots and customer support, anomaly and fraud detection, multimodal searches, and generative AI. One of the key applications is similarity search, where algorithms like k-nearest neighbors (KNN) identify the most similar data points to a given query vector. This capability is crucial for recommendation systems, image retrieval, and anomaly detection. CrateDB enhances these applications by integrating vector storage and similarity search within a scalable database solution. Watch CrateDB’s keynote to learn more about vectors for real-time AI >
Typical distance metrics for comparing vectors include Euclidean distance, Cosine similarity, and Manhattan distance. In AI and ML, they are used for similarity search.
CrateDB's knn_match function uses cosine similarity by default, which measures the angle between two vectors and works well for text and semantic embeddings. Euclidean distance measures the straight-line distance between two points in vector space and suits image and audio embeddings. Manhattan distance sums the absolute differences across all dimensions and is less common in practice for ML workloads. The right metric depends on how your embedding model was trained. Consult your model documentation for the recommended similarity function.