AI integration¶
Introduction
CrateDB is not just a real-time analytics database, it’s a powerful platform to feed and interact with machine learning models, thanks to its ability to store, query, and transform structured, unstructured, and vectorized data at scale using standard SQL.
Whether you’re training models, running batch or real-time inference, or integrating with AI pipelines, CrateDB offers:
High ingestion performance for time-series or sensor data.
SQL-powered transformations and filtering.
Unified queries across structured and semi-structured data:
Full-text, vector, and JSON.Native support for embeddings via FLOAT_VECTOR data type,
for conducting similarity searches in vector spaces (HNSW).
Benefits of using CrateDB in ML pipelines
Use Case |
CrateDB Role |
|---|---|
Feature Store |
Store pre-computed features with SQL access |
Real-Time Inference |
Serve vector-based results with |
Experimentation |
Use SQL for fast slicing, filtering, and aggregations |
Monitoring |
Track model performance, drift, or input quality |
Data Collection |
Capture telemetry, events, logs, and raw user data |
Use cases
Similarity search¶
Feature: Store and query word embeddings using HNSW nearest neighbor
search through SQL. CrateDB supports high-dimensional vectors with FLOAT_VECTOR.
To query these vectors for similarity-based inference, see
Vector Search and the quick synopsis below.
Create a table using the FLOAT_VECTOR data type.
CREATE TABLE word_embeddings (
text STRING PRIMARY KEY,
embedding FLOAT_VECTOR(4)
);
Insert a few vectors.
INSERT INTO word_embeddings (text, embedding)
VALUES
('Exploring the cosmos', [0.1, 0.5, -0.2, 0.8]),
('Discovering moon', [0.2, 0.4, 0.1, 0.7]),
('Discovering galaxies', [0.2, 0.4, 0.2, 0.9]),
('Sending the mission', [0.5, 0.9, -0.1, -0.7])
;
Query dataset for similarities.
SELECT *
FROM word_embeddings
WHERE KNN_MATCH(
embedding, [0.3, 0.6, 0.0, 0.9], 2);
ML engineering¶
Feature Engineering: Use SQL to build features dynamically from raw data.
SELECT
user_id,
AVG(duration) AS avg_session,
COUNT(DISTINCT page) AS page_diversity
FROM sessions
GROUP BY user_id;
Feature Engineering: Use CrateDB as a feature store. Centralize your features and use them in production models.
SELECT *
FROM user_features
WHERE last_active > NOW() - INTERVAL '1 day';
Training Dataset Extraction: Efficiently extract and filter relevant training data from large datasets using plain SQL.
SELECT *
FROM telemetry
WHERE temperature > 80
AND error_code IS NOT NULL
AND ts BETWEEN NOW() - INTERVAL '7 days' AND NOW();
Model Training Pipeline: An example architecture for feature engineering and model training with CrateDB, see also MLflow and PyCaret for AutoML purposes.
[ Sensors, APIs ] (ingest)
↓
[ CrateDB ] (real-time analytics store)
↓
[ Python ML* ] (model training)
↓
[ Model Registry ] (model serving)
* … using frameworks or platforms like Apache Spark, Databricks, MLflow, pandas, PyCaret, scikit-learn.
Real-Time Inference with Hybrid Queries: An example architecture for hybrid queries, see also Vector Search and Hybrid Search.
[ User ] (query expression)
↓
[ CrateDB ] (data source)
- JSON filters
- Full-text search (BM25)
- FLOAT_VECTOR support (HNSW)
- SQL + KNN_MATCH
↓
[ Application response ]
Note
For advanced ML engineering tasks and use cases, based on industry-approved frameworks and libraries, see CrateDB’s support for MLflow and PyCaret at Machine learning.
Text-to-SQL¶
Text‑to‑SQL lets you query data in natural language with contemporary large language models, optionally offline.