dlt¶
dlt (data load tool)—think ELT as Python code—is a popular, production-ready Python library for moving data. It loads data from various and often messy data sources into well-structured, live datasets.
dlt supports 30+ databases supported by SQLAlchemy, and is also the workhorse behind the ingestr toolkit.
Just code: no need to use any backends or containers.
Platform agnostic: Does not replace your data platform, deployments, or security models. Simply import dlt in your favorite code editor, or add it to your Jupyter Notebook.
Versatile: You can load data from any source that produces Python data structures, including APIs, files, databases, and more.
Synopsis¶
Prerequisites: Install dlt and the CrateDB destination adapter:
pip install --upgrade dlt-cratedb
Load data from cloud storage or files into CrateDB.
import dlt
import dlt_cratedb
from dlt.sources.filesystem import filesystem
resource = filesystem(
bucket_url="s3://example-bucket",
file_glob="*.csv"
)
pipeline = dlt.pipeline(
pipeline_name="filesystem_example",
destination=dlt.destinations.cratedb("postgresql://crate:crate@localhost:5432/"),
dataset_name="doc",
)
pipeline.run(resource)
Load data from SQL databases into CrateDB.
import dlt_cratedb
from dlt.sources.sql_database import sql_database
source = sql_database(
"mysql+pymysql://rfamro@mysql-rfam-public.ebi.ac.uk:4497/Rfam"
)
pipeline = dlt.pipeline(
pipeline_name="sql_database_example",
destination=dlt.destinations.cratedb("postgresql://crate:crate@localhost:5432/"),
dataset_name="doc",
)
pipeline.run(source)
Supported features¶
Data loading¶
Data is loaded into CrateDB using the most efficient method depending on the data source.
For local files, the
psycopg2library is used to directly load files into CrateDB tables using theINSERTcommand.For files in remote storage like S3 or Azure Blob Storage, CrateDB data loading functions are used to read the files and insert the data into tables.
Datasets¶
Use dataset_name="doc" to address CrateDB’s default schema doc.
When addressing other schemas, make sure they contain at least one table. [1]
File formats¶
The SQL INSERT file format is the preferred format for both direct loading and staging.
Column types¶
The cratedb destination has a few specific deviations from the default SQL destinations.
CrateDB does not support the
timedatatype. Time will be loaded to atextcolumn.CrateDB does not support the
binarydatatype. Binary will be loaded to atextcolumn.CrateDB can produce rounding errors under certain conditions when using the
float/doubledatatype. Make sure to use thedecimaldatatype if you can’t afford to have rounding errors.
Column hints¶
CrateDB supports the following column hints.
primary_key- marks the column as part of the primary key. Multiple columns can have this hint to create a composite primary key.
File staging¶
CrateDB supports Amazon S3, Google Cloud Storage, and Azure Blob Storage as file staging destinations.
dlt will upload CSV or JSONL files to the staging location and use CrateDB data loading functions
to load the data directly from the staged files.
Please refer to the filesystem documentation to learn how to configure credentials for the staging destinations.
Invoke a pipeline with staging enabled.
pipeline = dlt.pipeline(
pipeline_name='chess_pipeline',
destination='cratedb',
staging='filesystem', # add this to activate staging
dataset_name='chess_data'
)
dbt support¶
Integration with dbt is generally supported via dbt-cratedb2 but not tested by us.
dlt state sync¶
The CrateDB destination fully supports dlt state sync.
See also¶
Examples
Exercise a canonical dlt init example with CrateDB.
Executable code examples on GitHub that demonstrate how to use dlt with CrateDB.
Resources
dlt-cratedbThe dlt destination adapter for CrateDB is based on the dlt PostgreSQL adapter.
ingestrThe ingestr data import/export application uses dlt as a workhorse.
Video tutorials