AI integration

Introduction

CrateDB is not just a real-time analytics database, it’s a powerful platform to feed and interact with machine learning models, thanks to its ability to store, query, and transform structured, unstructured, and vectorized data at scale using standard SQL.

Whether you’re training models, running batch or real-time inference, or integrating with AI pipelines, CrateDB offers:

  • High ingestion performance for time-series or sensor data.

  • SQL-powered transformations and filtering.

  • Unified queries across structured and semi-structured data:
    Full-text, vector, and JSON.

  • Native support for embeddings via FLOAT_VECTOR data type,
    for conducting similarity searches in vector spaces (HNSW).

Benefits of using CrateDB in ML pipelines

Use Case

CrateDB Role

Feature Store

Store pre-computed features with SQL access

Real-Time Inference

Serve vector-based results with KNN_MATCH

Experimentation

Use SQL for fast slicing, filtering, and aggregations

Monitoring

Track model performance, drift, or input quality

Data Collection

Capture telemetry, events, logs, and raw user data

Use cases

ML engineering

Feature Engineering: Use SQL to build features dynamically from raw data.

SELECT
  user_id,
  AVG(duration) AS avg_session,
  COUNT(DISTINCT page) AS page_diversity
FROM sessions
GROUP BY user_id;

Feature Engineering: Use CrateDB as a feature store. Centralize your features and use them in production models.

SELECT *
FROM user_features
WHERE last_active > NOW() - INTERVAL '1 day';

Training Dataset Extraction: Efficiently extract and filter relevant training data from large datasets using plain SQL.

SELECT *
FROM telemetry
WHERE temperature > 80
  AND error_code IS NOT NULL
  AND ts BETWEEN NOW() - INTERVAL '7 days' AND NOW();

Model Training Pipeline: An example architecture for feature engineering and model training with CrateDB, see also MLflow and PyCaret for AutoML purposes.

[ Sensors, APIs ] (ingest)
     ↓
[ CrateDB ] (real-time analytics store)
     ↓
[ Python ML* ] (model training)
     ↓
[ Model Registry ] (model serving)

* … using frameworks or platforms like Apache Spark, Databricks, MLflow, pandas, PyCaret, scikit-learn.

Real-Time Inference with Hybrid Queries: An example architecture for hybrid queries, see also Vector Search and Hybrid Search.

[ User ] (query expression)
   ↓
[ CrateDB ] (data source)
  - JSON filters
  - Full-text search (BM25)
  - FLOAT_VECTOR support (HNSW)
  - SQL + KNN_MATCH
   ↓
[ Application response ]

Note

For advanced ML engineering tasks and use cases, based on industry-approved frameworks and libraries, see CrateDB’s support for MLflow and PyCaret at Machine learning.

Text-to-SQL

Text‑to‑SQL lets you query data in natural language with contemporary large language models, optionally offline.

See also