Dask¶
About
Dask is a parallel computing library for analytics with task scheduling. It is built on top of the Python programming language, making it easy to scale the Python libraries that you know and love, like NumPy, pandas, and scikit-learn.
Dask DataFrames help you process large tabular data by parallelizing pandas, either on your laptop for larger-than-memory computing, or on a distributed cluster of computers.
Dask Futures, implementing a real-time task framework, allow you to scale generic Python workflows across a Dask cluster with minimal code changes, by extending Python’s
concurrent.futuresinterface.
Install
pip install 'dask[dataframe]' 'sqlalchemy-cratedb'
Synopsis
Write Dask dataframe to CrateDB.
example.py
import dask.dataframe as dd
from sqlalchemy_cratedb import insert_bulk
CRATEDB_URI = "crate://crate:crate@localhost:4200"
TABLE_NAME = "example"
df = makeTimeDataFrame(rows=500_000, freq="s")
ddf = dd.from_pandas(df, npartitions=4)
ddf.to_sql(
TABLE_NAME,
uri=CRATEDB_URI,
index=False,
if_exists="replace",
chunksize=20_000,
parallel=True,
method=insert_bulk,
)
Quickstart example
Create the file example.py including the synopsis code shared above.
Complete the example by using the makeTimeDataFrame() function.
def makeTimeDataFrame(rows=5_000, freq = "B"):
import numpy as np
import pandas as pd
return pd.DataFrame(
np.random.default_rng(2).standard_normal((rows, 4)),
columns=pd.Index(list("ABCD"), dtype=object),
index=pd.date_range("2000-01-01", periods=rows, freq=freq),
)
Start CrateDB using Docker or Podman, then invoke the example program.
docker run --rm --publish=5432:5432 docker.io/crate '-Cdiscovery.type=single-node'
pip install 'dask[dataframe]' 'sqlalchemy-cratedb'
python example.py
Full example
Connect to CrateDB and CrateDB Cloud using Dask.
Guides
Related sections
Efficient batch/bulk INSERT operations for pandas, Dask, and Polars
Importing Parquet files into CrateDB using Apache Arrow and SQLAlchemy