Machine learning¶
CrateDB provides a vector type natively, and adapters for integrating with machine learning frameworks.
Modern AI and machine learning applications demand efficient storage and retrieval of high-dimensional vectors, seamless integration with ML frameworks, and the ability to combine traditional analytics with semantic search capabilities. From retrieval-augmented generation (RAG) systems to predictive maintenance models, organizations need a unified platform that handles vector embeddings, training datasets, and production model artifacts without juggling multiple specialized systems.
CrateDB unifies vector search, time series analysis, and ML operations in a single platform. Store and query high-dimensional embeddings using native FLOAT_VECTOR support with HNSW-based similarity search, integrate directly with LangChain and LlamaIndex for AI applications, and leverage MLflow and PyCaret for end-to-end MLOps workflows. Whether you’re building semantic search engines, training forecasting models on large time series datasets, or implementing hybrid search combining full-text and vector similarity, CrateDB eliminates data movement and infrastructure complexity.
By keeping vector embeddings, training data, and model metadata in one queryable system, you avoid the overhead of synchronizing between specialized vector databases, data lakes, and model registries. Your ML pipelines remain agile, your queries span structured and vector data seamlessly, and your infrastructure stays lean.
With CrateDB, compatible to PostgreSQL, you can do all of that using plain SQL. Other than integrating well with commodity systems using standard database access interfaces like ODBC or JDBC, it provides a proprietary HTTP interface on top.
Vector store¶
Vector databases can be used for similarity search, multi-modal search, recommendation engines, large language models (LLMs), and other applications.
These applications can answer questions about specific sources of information, for example using techniques like Retrieval Augmented Generation (RAG). RAG is a technique for augmenting LLM knowledge with additional data, often private or real-time.
CrateDB supports high-dimensional vectors with FLOAT_VECTOR, e.g. to
store and query word embeddings using HNSW-based nearest neighbor search
through SQL.
CrateDB’s FLOAT_VECTOR data type implements a vector store and the k‑nearest neighbors (k‑NN) search algorithm to find vectors that are similar to a query vector.
Hybrid search is a technique to enhance relevancy and accuracy by combining traditional full-text with semantic search algorithms, for achieving better accuracy and relevancy than each algorithm would individually.
LangChain is a framework for developing applications powered by language models, written in Python, and with a strong focus on composability. It supports retrieval-augmented generation (RAG).
Text-to-SQL¶
Integrate CrateDB with Text-to-SQL solutions, and provide MCP and AI enterprise data integrations.
Text-to-SQL is a technique that converts natural language queries into SQL queries that can be executed by a database.
The Model Context Protocol (MCP) is an open protocol that enables seamless integration between LLM applications and external data sources and tools.
MindsDB is the platform for customizing AI from enterprise data.
Time series analysis¶
Load and analyze data from database systems for time series anomaly detection and forecasting.
End-to-end: Statistical analysis and visualization on huge datasets.
Traditional: Regression analysis within a Jupyter Notebook.
Predictive maintenance: Build a machine learning model to predict machine failures.
Advanced time series analysis: Conduct advanced data analysis on large time series datasets.
MLOps and model training¶
CrateDB supports MLOps procedures through adapters to best-of-breed software frameworks.
Training a machine learning model, running it in production, and maintaining it, requires a significant amount of data processing and bookkeeping operations.
Machine Learning Operations MLOps is a paradigm that aims to deploy and maintain machine learning models in production reliably and efficiently, including experiment tracking, and in the spirit of continuous development and DevOps.
MLflow is an open-source platform to manage the whole ML lifecycle, including experimentation, reproducibility, deployment, and a central model registry.
PyCaret is an open-source, low-code machine learning library for Python that automates machine learning workflows (AutoML).
Learn how to conduct advanced data analysis on large time series datasets with CrateDB, MLflow, and PyCaret.