Skip to content
Data models

Vector Data

A single database for vectors, search, and real-time analytics.

CrateDB unlocks the full potential of vector data by combining vector storage, similarity search, hybrid retrieval, and real-time analytics; all in a single, scalable SQL-native database. No additional vector store required. No fragile synchronization pipelines. Just one unified system for operational, semantic, and analytical data. By centralizing all your structured, semi-structured, and unstructured data alongside their vector embeddings, CrateDB reduces development effort, simplifies architecture, and lowers total cost of ownership.

 

Vector data querying with SQL

Hyper-fast. Queries in milliseconds.

        

SELECT text, _score
FROM word_embeddings
WHERE knn_match(embedding,[0.3, 0.6, 0.0, 0.9], 2)
ORDER BY _score DESC; 
        

|------------------------|--------|
|         text           | _score |
|------------------------|--------|
|Discovering galaxies    |0.917431|
|Discovering moon        |0.909090|
|Exploring the cosmos    |0.909090|
|Sending the mission     |0.270270|
|------------------------|--------|
        

SELECT text, _score
FROM word_embeddings
WHERE knn_match(embedding, (SELECT embedding FROM word_embeddings WHERE text ='Discovering galaxies'), 2)
ORDER BY _score DESC
        

|------------------------|--------|
|         text           | _score |
|------------------------|--------|
|Discovering galaxies    |1       |
|Discovering moon        |0.952381|
|Exploring the cosmos    |0.840336|
|Sending the mission     |0.250626|
|------------------------|--------|

Unified vector storage & management

Store vector embeddings directly in CrateDB using your preferred machine learning or embedding models. Keep vectors and metadata perfectly aligned within the same row, without additional services, ETL jobs, or complex sync workflows. This unified approach removes the operational burden of managing separate databases for semantics, text, search, and analytics.

CrateDB supports high-dimensional float vector types (max 2,048 dimensions) and stores vectors alongside time series, JSON, geospatial, and full-text data for truly unified analytics. Integrate seamlessly with Python pipelines, LangChain, LlamaIndex, and HuggingFace embedding models using standard PostgreSQL or HTTP interfaces.

cr-quote-image

Powerful similarity search

CrateDB supports high-performance vector similarity search, enabling:

  • Fast nearest-neighbor lookups
  • Efficient exploration of high-dimensional datasets
  • Real-time matching for AI-powered applications

Whether it’s product search, semantic search, anomaly detection, or RAG workloads, CrateDB makes vector search immediate and scalable. Vector queries return results in milliseconds, even across millions of rows, thanks to CrateDB’s distributed execution engine.

SELECT id, title
FROM documents
ORDER BY KNN_MATCH(embedding, $query_vector)
LIMIT 5;

cr-quote-image

Hybrid search for maximum relevance

CrateDB natively combines:

  • Vector similarity (KNN_MATCH)
  • Full-text search (MATCH)
  • Keyword/structured filters
  • External semantic enrichment via your embedding models

This hybrid retrieval approach improves precision and relevance in every search workflow, by querying vectors, text, and structured attributes in a single SQL statement.

Hybrid queries make CrateDB ideal for semantic search, product search, knowledge retrieval, and multi-attribute ranking scenarios.

SELECT id, title
FROM documents
WHERE MATCH(text, 'energy storage')
ORDER BY KNN_MATCH(embedding, $query_vector)
LIMIT 10;

cr-quote-image

Semantically enriched data

Attach vector embeddings to any database row to enrich your data with semantic context. This increases explainability, captures relationships that traditional attributes miss, and strengthens downstream machine learning and search applications. This also enables multi-modal use cases by storing text, images, audio, or video embeddings together with their metadata.
cr-quote-image

AI-ready architecture

CrateDB integrates cleanly with LLMs, vectorizers, ML pipelines, and AI platforms. Use it as:

  • A vector store
  • A real-time feature store
  • A RAG backend
  • A context store for conversational memory

All while benefiting from CrateDB’s distributed SQL engine, automatic indexing, and interoperability.

CrateDB fits naturally into Retrieval-Augmented Generation (RAG) pipelines by storing embeddings, running semantic search, retrieving context, and feeding external LLMs such as OpenAI or Llama. No proprietary SDKs required, just SQL.

cr-quote-image

Effortless scalability

Thanks to its shared-nothing, distributed architecture, CrateDB scales with your dataset and vector dimensionality, without requiring hand-tuned infrastructure or specialized vector-only systems. Eliminate the cost and complexity of maintaining a separate vector database and focus on building your AI applications.

Automatic rebalancing ensures vector and non-vector data scale uniformly across cluster nodes. Indexing occurs within seconds, enabling near real-time use cases.

cr-quote-image

Accelerated development & lower overhead

With native vector types, SQL queries, and multi-model functionality in one platform, teams avoid the integration, maintenance, and learning curve associated with external vector solutions. Development becomes faster, operations become simpler, and maintenance costs decrease.

No need to manage separate vector stores, sync pipelines, or duplicated data models, CrateDB keeps everything consistent by design. This consolidation significantly reduces total cost of ownership (TCO).

cr-quote-image

Keynote - The transformative effects of real-time AI 

Keynote - The transformative effects of real-time AI 

In this keynote at the AI & Big Data Expo Europe 2023, CrateDB's VP Product shares his vision for the future with multi-model SQL databases and Large Language Models.

Dev Talk - How to use private data in generative AI

Dev Talk - How to use private data in generative AI

This talk at Fosdem 2024 focuses on the combination of CrateDB and LangChain: it helps get started with using private data as context for large language models through LangChain, incorporating the concept of Retrieval Augmented Generation (RAG).

5 essential things you need to know about vector databases

5 essential things you need to know about vector databases

This infographic gives you some basic understanding of vector databases, from what you should look for when choosing one to combining vector data with other data types.

Demo – Harnessing CrateDB’s multi-model capabilities for AI-powered applications

Demo – Harnessing CrateDB’s multi-model capabilities for AI-powered applications

In this video, we explore the integration of CrateDB and PyCaret to detect anomalies in machine data, crucial for identifying potential failures or inefficiencies in technological systems. CrateDB's capability for handling large-scale data with ease pairs seamlessly with PyCaret's low-code approach to machine learning, offering a streamlined path to uncovering insights within vast datasets.

Additional resources

FAQ

Vector data allows users to capture the complex details of points, lines, and polygons, unveiling a new dimension in data analysis, mapping, and spatial decision-making. Vector data can be stored in different file formats: Shapefile (.shp), GeoJSON (.geojson), KML (Keyhole Markup Language), and GML (Geography Markup Language). CrateDB leverages the power of vector data with a highly scalable database that can be queried using SQL, simplifying data management and reducing development time and overall costs.

Vectors are numerical representations used to quantify and compare features or characteristics of data items, such as text, images, or sounds, in a high-dimensional space. For example, a vector can look like this: -0.32643065-0.12308089, -0.2873811, representing a point in a multi-dimensional space. In CrateDB, vectors are stored as one-dimensional arrays of float values using the float_vector data type, allowing for efficient storage and querying of dense vector data.

A vector database is designed to store and manage high-dimensional data, grouping vectors based on their similarities. These databases use advanced indexing and search algorithms to find the most similar vectors to a given query quickly. Examples of vector databases include CrateDB, Pinecone, Zilliz, and Weaviate. CrateDB excels as a vector store database with features like vector storage and similarity search. If you want to learn more, read this blog post on how to choose the best database for vector data.

Vector data is used in various applications, including e-commerce recommendations, chatbots and customer support, anomaly and fraud detection, multimodal searches, and generative AI. One of the key applications is similarity search, where algorithms like k-nearest neighbors (KNN) identify the most similar data points to a given query vector. This capability is crucial for recommendation systems, image retrieval, and anomaly detection. CrateDB enhances these applications by integrating vector storage and similarity search within a scalable database solution. Watch CrateDB’s keynote to learn more about vectors for real-time AI >

Typical distance metrics for comparing vectors include Euclidean distance, Cosine similarity, and Manhattan distance. In AI and ML, they are used for similarity search.

Want to know more?