AI integration¶

Introduction

CrateDB is not just a real-time analytics database, it’s a powerful platform to feed and interact with machine learning models, thanks to its ability to store, query, and transform structured, unstructured, and vectorized data at scale using standard SQL.

Whether you’re training models, running batch or real-time inference, or integrating with AI pipelines, CrateDB offers:

High ingestion performance for time-series or sensor data.
SQL-powered transformations and filtering.
Unified queries across structured and semi-structured data:
Full-text, vector, and JSON.
Native support for embeddings via FLOAT_VECTOR data type,
for conducting similarity searches in vector spaces (HNSW).

Benefits of using CrateDB in ML pipelines

Use Case	CrateDB Role
Feature Store	Store pre-computed features with SQL access
Real-Time Inference	Serve vector-based results with `KNN_MATCH`
Experimentation	Use SQL for fast slicing, filtering, and aggregations
Monitoring	Track model performance, drift, or input quality
Data Collection	Capture telemetry, events, logs, and raw user data

Use cases

Similarity search¶

Feature: Store and query word embeddings using HNSW nearest neighbor search through SQL. CrateDB supports high-dimensional vectors with FLOAT_VECTOR. To query these vectors for similarity-based inference, see Vector Search and the quick synopsis below.

Create a table using the FLOAT_VECTOR data type.

CREATE TABLE word_embeddings (
  text STRING PRIMARY KEY,
  embedding FLOAT_VECTOR(4)
);

Insert a few vectors.

INSERT INTO word_embeddings (text, embedding)
VALUES
  ('Exploring the cosmos', [0.1, 0.5, -0.2, 0.8]),
  ('Discovering moon', [0.2, 0.4, 0.1, 0.7]),
  ('Discovering galaxies', [0.2, 0.4, 0.2, 0.9]),
  ('Sending the mission', [0.5, 0.9, -0.1, -0.7])
;

Query dataset for similarities.

SELECT *
FROM word_embeddings
WHERE KNN_MATCH(
    embedding, [0.3, 0.6, 0.0, 0.9], 2);

ML engineering¶

Feature Engineering: Use SQL to build features dynamically from raw data.

SELECT
  user_id,
  AVG(duration) AS avg_session,
  COUNT(DISTINCT page) AS page_diversity
FROM sessions
GROUP BY user_id;

Feature Engineering: Use CrateDB as a feature store. Centralize your features and use them in production models.

SELECT *
FROM user_features
WHERE last_active > NOW() - INTERVAL '1 day';

Training Dataset Extraction: Efficiently extract and filter relevant training data from large datasets using plain SQL.

SELECT *
FROM telemetry
WHERE temperature > 80
  AND error_code IS NOT NULL
  AND ts BETWEEN NOW() - INTERVAL '7 days' AND NOW();

Model Training Pipeline: An example architecture for feature engineering and model training with CrateDB, see also MLflow and PyCaret for AutoML purposes.

[ Sensors, APIs ] (ingest)
     ↓
[ CrateDB ] (real-time analytics store)
     ↓
[ Python ML* ] (model training)
     ↓
[ Model Registry ] (model serving)

* … using frameworks or platforms like Apache Spark, Databricks, MLflow, pandas, PyCaret, scikit-learn.

Real-Time Inference with Hybrid Queries: An example architecture for hybrid queries, see also Vector Search and Hybrid Search.

[ User ] (query expression)
   ↓
[ CrateDB ] (data source)
  - JSON filters
  - Full-text search (BM25)
  - FLOAT_VECTOR support (HNSW)
  - SQL + KNN_MATCH
   ↓
[ Application response ]

Note

For advanced ML engineering tasks and use cases, based on industry-approved frameworks and libraries, see CrateDB’s support for MLflow and PyCaret at Machine learning.

Text-to-SQL¶

Text‑to‑SQL lets you query data in natural language with contemporary large language models, optionally offline.

Related features¶

Learn more about how to combine ML features with other major features of CrateDB.

Feature	Description	Documentation
FLOAT_VECTOR	High-dimensional embedding support	Vector Search
KNN_MATCH	Nearest neighbor search in SQL	Vector Search
JSON & full-text	Store unstructured and semi-structured data	Document Store, Full-Text Search
Joins & aggregates	Combine data in real-time	Relational / JOINs, Aggregations
Time-series support	Perfect for sensor-based training data	Time series data

AI integration¶

Similarity search¶

ML engineering¶

Text-to-SQL¶

Related features¶

See also¶