dlt¶

dlt logo dlt

dlt (data load tool)—think ELT as Python code—is a popular, production-ready Python library for moving data. It loads data from various and often messy data sources into well-structured, live datasets.

dlt supports 30+ databases supported by SQLAlchemy, and is also the workhorse behind the ingestr toolkit.

Just code: no need to use any backends or containers.
Platform agnostic: Does not replace your data platform, deployments, or security models. Simply import dlt in your favorite code editor, or add it to your Jupyter Notebook.
Versatile: You can load data from any source that produces Python data structures, including APIs, files, databases, and more.

Synopsis¶

Prerequisites: Install dlt and the CrateDB destination adapter:

pip install --upgrade dlt-cratedb

Load data from cloud storage or files into CrateDB.

import dlt
import dlt_cratedb
from dlt.sources.filesystem import filesystem

resource = filesystem(
    bucket_url="s3://example-bucket",
    file_glob="*.csv"
)

pipeline = dlt.pipeline(
    pipeline_name="filesystem_example",
    destination=dlt.destinations.cratedb("postgresql://crate:crate@localhost:5432/"),
    dataset_name="doc",
)

pipeline.run(resource)

Load data from SQL databases into CrateDB.

import dlt_cratedb
from dlt.sources.sql_database import sql_database

source = sql_database(
    "mysql+pymysql://rfamro@mysql-rfam-public.ebi.ac.uk:4497/Rfam"
)

pipeline = dlt.pipeline(
    pipeline_name="sql_database_example",
    destination=dlt.destinations.cratedb("postgresql://crate:crate@localhost:5432/"),
    dataset_name="doc",
)

pipeline.run(source)

Supported features¶

Data loading¶

Data is loaded into CrateDB using the most efficient method depending on the data source.

For local files, the psycopg2 library is used to directly load files into CrateDB tables using the INSERT command.
For files in remote storage like S3 or Azure Blob Storage, CrateDB data loading functions are used to read the files and insert the data into tables.

Datasets¶

Use dataset_name="doc" to address CrateDB’s default schema doc. When addressing other schemas, make sure they contain at least one table. [1]

File formats¶

The SQL INSERT file format is the preferred format for both direct loading and staging.

Column types¶

The cratedb destination has a few specific deviations from the default SQL destinations.

CrateDB does not support the time datatype. Time will be loaded to a text column.
CrateDB does not support the binary datatype. Binary will be loaded to a text column.
CrateDB can produce rounding errors under certain conditions when using the float/double datatype. Make sure to use the decimal datatype if you can’t afford to have rounding errors.

Column hints¶

CrateDB supports the following column hints.

primary_key - marks the column as part of the primary key. Multiple columns can have this hint to create a composite primary key.

File staging¶

CrateDB supports Amazon S3, Google Cloud Storage, and Azure Blob Storage as file staging destinations.

dlt will upload CSV or JSONL files to the staging location and use CrateDB data loading functions to load the data directly from the staged files.

Please refer to the filesystem documentation to learn how to configure credentials for the staging destinations.

Invoke a pipeline with staging enabled.

pipeline = dlt.pipeline(
  pipeline_name='chess_pipeline',
  destination='cratedb',
  staging='filesystem',  # add this to activate staging
  dataset_name='chess_data'
)

dbt support¶

Integration with dbt is generally supported via dbt-cratedb2 but not tested by us.

dlt state sync¶

The CrateDB destination fully supports dlt state sync.

dlt¶

Synopsis¶

Supported features¶

Data loading¶

Datasets¶

File formats¶

Column types¶

Column hints¶

File staging¶

dbt support¶

dlt state sync¶

See also¶