Machine Learning with CrateDB

Machine learning applications and frameworks which can be used together with CrateDB.


Learn how to integrate CrateDB with machine learning frameworks and tools, for MLOps and Vector database operations.



LangChain is a framework for developing applications powered by language models, written in Python, and with a strong focus on composability. As a language model integration framework, LangChain’s use-cases largely overlap with those of language models in general, including document analysis and summarization, chatbots, and code analysis.

LangChain supports retrieval-augmented generation (RAG), which is a technique for augmenting LLM knowledge with additional, often private or real-time, data, and mixing in “prompt engineering” as the process of structuring text that can be interpreted and understood by a generative AI model. A prompt is natural language text describing the task that an AI should perform.

The LangChain adapter for CrateDB provides support to use CrateDB as a vector store database, to load documents using LangChain’s DocumentLoader, and also supports LangChain’s conversational memory subsystem.


MLflow is an open source platform to manage the whole ML lifecycle, including experimentation, reproducibility, deployment, and a central model registry.

The MLflow adapter for CrateDB, available through the mlflow-cratedb package, provides support to use CrateDB as a storage database for the MLflow Tracking subsystem, which is about recording and querying experiments, across code, data, config, and results.


PyCaret is an open-source, low-code machine learning library for Python that automates machine learning workflows.

It is a high-level interface and AutoML wrapper on top of your loved machine learning libraries like scikit-learn, xgboost, ray, lightgbm, and many more. PyCaret provides a universal interface to utilize these libraries without needing to know the details of the underlying model architectures and parameters.



Machine Learning in Python.

  • Simple and efficient tools for predictive data analysis

  • Accessible to everybody, and reusable in various contexts

  • Built on NumPy, SciPy, and matplotlib


The open source data analysis and manipulation tool.

Pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series.

Project Jupyter

Interactive computing across all programming languages.

JupyterLab is the latest web-based interactive development environment for notebooks, code, and data. Its flexible interface allows users to configure and arrange workflows in data science, scientific computing, computational journalism, and machine learning. A modular design invites extensions to expand and enrich functionality.