Machine learning¶

CrateDB provides a vector type natively, and adapters for integrating with machine learning frameworks.

Modern AI and machine learning applications demand efficient storage and retrieval of high-dimensional vectors, seamless integration with ML frameworks, and the ability to combine traditional analytics with semantic search capabilities. From retrieval-augmented generation (RAG) systems to predictive maintenance models, organizations need a unified platform that handles vector embeddings, training datasets, and production model artifacts without juggling multiple specialized systems.

CrateDB unifies vector search, time series analysis, and ML operations in a single platform. Store and query high-dimensional embeddings using native FLOAT_VECTOR support with HNSW-based similarity search, integrate directly with LangChain and LlamaIndex for AI applications, and leverage MLflow and PyCaret for end-to-end MLOps workflows. Whether you’re building semantic search engines, training forecasting models on large time series datasets, or implementing hybrid search combining full-text and vector similarity, CrateDB eliminates data movement and infrastructure complexity.

By keeping vector embeddings, training data, and model metadata in one queryable system, you avoid the overhead of synchronizing between specialized vector databases, data lakes, and model registries. Your ML pipelines remain agile, your queries span structured and vector data seamlessly, and your infrastructure stays lean.

With CrateDB, compatible to PostgreSQL, you can do all of that using plain SQL. Other than integrating well with commodity systems using standard database access interfaces like ODBC or JDBC, it provides a proprietary HTTP interface on top.

Vector store¶

Vector databases can be used for similarity search, multi-modal search, recommendation engines, large language models (LLMs), and other applications.

These applications can answer questions about specific sources of information, for example using techniques like Retrieval Augmented Generation (RAG). RAG is a technique for augmenting LLM knowledge with additional data, often private or real-time.

CrateDB supports high-dimensional vectors with FLOAT_VECTOR, e.g. to store and query word embeddings using HNSW-based nearest neighbor search through SQL.

Documentation: Vector search

CrateDB’s FLOAT_VECTOR data type implements a vector store and the k‑nearest neighbors (k‑NN) search algorithm to find vectors that are similar to a query vector.

Vector search on machine learning embeddings: CrateDB is all you need.

Vector Search

Documentation: Hybrid search

Hybrid search is a technique to enhance relevancy and accuracy by combining traditional full-text with semantic search algorithms, for achieving better accuracy and relevancy than each algorithm would individually.

Combined BM25 term search and vector search based on Apache Lucene, using SQL: CrateDB is all you need.

Hybrid Search

Integration: LangChain

LangChain is a framework for developing applications powered by language models, written in Python, and with a strong focus on composability. It supports retrieval-augmented generation (RAG).

The LangChain adapter lets you use CrateDB as a vector store database, load documents via document loaders, and use LangChain’s conversational memory.

LangChain

Text-to-SQL¶

Integrate CrateDB with Text-to-SQL solutions, and provide MCP and AI enterprise data integrations.

Text-to-SQL with LlamaIndex

Text-to-SQL is a technique that converts natural language queries into SQL queries that can be executed by a database.

LlamaIndex

All about MCP

The Model Context Protocol (MCP) is an open protocol that enables seamless integration between LLM applications and external data sources and tools.

Model Context Protocol (MCP)

MindsDB

MindsDB is the platform for customizing AI from enterprise data.

MindsDB

Time series analysis¶

Load and analyze data from database systems for time series anomaly detection and forecasting.

Time series analysis using ML

End-to-end: Statistical analysis and visualization on huge datasets.
Traditional: Regression analysis within a Jupyter Notebook.
Predictive maintenance: Build a machine learning model to predict machine failures.
Advanced time series analysis: Conduct advanced data analysis on large time series datasets.

Time series analysis

MLOps and model training¶

CrateDB supports MLOps procedures through adapters to best-of-breed software frameworks.

Training a machine learning model, running it in production, and maintaining it, requires a significant amount of data processing and bookkeeping operations.

Machine Learning Operations MLOps is a paradigm that aims to deploy and maintain machine learning models in production reliably and efficiently, including experiment tracking, and in the spirit of continuous development and DevOps.

MLflow

MLflow is an open-source platform to manage the whole ML lifecycle, including experimentation, reproducibility, deployment, and a central model registry.

CrateDB can be used as a storage database for the MLflow Tracking subsystem.

MLflow

PyCaret

PyCaret is an open-source, low-code machine learning library for Python that automates machine learning workflows (AutoML).

CrateDB can be used as a storage database for training and production datasets.

PyCaret

Advanced time series analysis with MLflow and PyCaret

Learn how to conduct advanced data analysis on large time series datasets with CrateDB, MLflow, and PyCaret.

What’s inside: Anomaly detection and forecasting, time series decomposition, exploratory data analysis (EDA).

Advanced time series analysis