Dask

Dask logo
Dask CI

About

Dask is a parallel computing library for analytics with task scheduling. It is built on top of the Python programming language, making it easy to scale the Python libraries that you know and love, like NumPy, pandas, and scikit-learn.

  • Dask DataFrames help you process large tabular data by parallelizing pandas, either on your laptop for larger-than-memory computing, or on a distributed cluster of computers.

  • Dask Futures, implementing a real-time task framework, allow you to scale generic Python workflows across a Dask cluster with minimal code changes, by extending Python’s concurrent.futures interface.

Install

pip install 'dask[dataframe]' 'sqlalchemy-cratedb'

Synopsis

Write Dask dataframe to CrateDB.

example.py

import dask.dataframe as dd
from sqlalchemy_cratedb import insert_bulk

CRATEDB_URI = "crate://crate:crate@localhost:4200"
TABLE_NAME = "example"

df = makeTimeDataFrame(rows=500_000, freq="s")
ddf = dd.from_pandas(df, npartitions=4)
ddf.to_sql(
    TABLE_NAME,
    uri=CRATEDB_URI,
    index=False,
    if_exists="replace",
    chunksize=20_000,
    parallel=True,
    method=insert_bulk,
)

Quickstart example

Create the file example.py including the synopsis code shared above. Complete the example by using the makeTimeDataFrame() function.

def makeTimeDataFrame(rows=5_000, freq = "B"):
    import numpy as np
    import pandas as pd
    return pd.DataFrame(
        np.random.default_rng(2).standard_normal((rows, 4)),
        columns=pd.Index(list("ABCD"), dtype=object),
        index=pd.date_range("2000-01-01", periods=rows, freq=freq),
    )

Start CrateDB using Docker or Podman, then invoke the example program.

docker run --rm --publish=5432:5432 docker.io/crate '-Cdiscovery.type=single-node'
pip install 'dask[dataframe]' 'sqlalchemy-cratedb'
python example.py

Full example

Connect to CrateDB and CrateDB Cloud using Dask.

https://github.com/crate/cratedb-examples/tree/main/by-dataframe/dask

Guides

Related sections