CrateDB and DataFrame libraries

Data frame libraries and frameworks which can be used together with CrateDB.

Tutorials

Learn how to use CrateDB together with popular open-source data frame libraries, on behalf of hands-on tutorials and code examples.

Dataframe Libraries
SQLAlchemy

CrateDB’s SQLAlchemy dialect implementation provides fundamental infrastructure to integrations with Dask, pandas, and Polars.

Dask

Dask is a parallel computing library for analytics with task scheduling. It is built on top of the Python programming language, making it easy to scale the Python libraries that you know and love, like NumPy, pandas, and scikit-learn.

  • Dask DataFrames help you process large tabular data by parallelizing pandas, either on your laptop for larger-than-memory computing, or on a distributed cluster of computers.

  • Dask Futures, implementing a real-time task framework, allow you to scale generic Python workflows across a Dask cluster with minimal code changes, by extending Python’s concurrent.futures interface.

pandas

pandas is a fast, powerful, flexible, and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.

Pandas (stylized as pandas) is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series.

Data Model

  • Pandas is built around data structures called Series and DataFrames. Data for these collections can be imported from various file formats such as comma-separated values, JSON, Parquet, SQL database tables or queries, and Microsoft Excel.

  • A Series is a 1-dimensional data structure built on top of NumPy’s array.

  • Pandas includes support for time series, such as the ability to interpolate values and filter using a range of timestamps.

  • By default, a Pandas index is a series of integers ascending from 0, similar to the indices of Python arrays. However, indices can use any NumPy data type, including floating point, timestamps, or strings.

  • Pandas supports hierarchical indices with multiple values per data point. An index with this structure, called a “MultiIndex”, allows a single DataFrame to represent multiple dimensions, similar to a pivot table in Microsoft Excel. Each level of a MultiIndex can be given a unique name.

Polars

Polars is a blazingly fast DataFrames library with language bindings for Rust, Python, Node.js, R, and SQL. Polars is powered by a multithreaded, vectorized query engine, it is open source, and written in Rust.

  • Fast: Written from scratch in Rust and with performance in mind, designed close to the machine, and without external dependencies.

  • I/O: First class support for all common data storage layers: local, cloud storage & databases.

  • Intuitive API: Write your queries the way they were intended. Polars, internally, will determine the most efficient way to execute using its query optimizer. Polars’ expressions are intuitive and empower you to write readable and performant code at the same time.

  • Out of Core: The streaming API allows you to process your results without requiring all your data to be in memory at the same time.

  • Parallel: Polars’ multi-threaded query engine utilises the power of your machine by dividing the workload amongst the available CPU cores without any additional configuration.

  • Vectorized Query Engine: Uses Apache Arrow, a columnar data format, to process your queries in a vectorized manner and SIMD to optimize CPU usage. This enables cache-coherent algorithms and high performance on modern processors.

  • Open Source: Polars is and always will be open source. Driven by an active community of developers. Everyone is encouraged to add new features and contribute. It is free to use under the MIT license.

Data formats

Polars supports reading and writing to many common data formats. This allows you to easily integrate Polars into your existing data stack.

  • Text: CSV & JSON

  • Binary: Parquet, Delta Lake, AVRO & Excel

  • IPC: Feather, Arrow

  • Databases: MySQL, Postgres, SQL Server, Sqlite, Redshift & Oracle

  • Cloud Storage: S3, Azure Blob & Azure File