Batch Ingestion for CrateDB | High Speed Bulk Data Loading

Why batch ingestion

Batch ingestion provides a reliable and efficient way to load structured, semi structured, or unstructured files into CrateDB. It is ideal for scenarios where data arrives in chunks, is generated by legacy systems, or is produced at scheduled intervals.

With batch ingestion, you can:

Consolidate historical and real time datasets in one database
Enrich operational analytics with large backfills
Import data from data lakes and lakehouses
Run nightly or periodic updates without operational complexity
Scale ingestion by using the full capacity of the cluster

CrateDB automatically distributes incoming data across the cluster and indexes it so it becomes instantly available for analytics, search, and AI workloads.

How it works

Batch ingestion integrates with file based and lakehouse based workflows. CrateDB processes each file in parallel across the cluster, taking advantage of distributed compute for fast loading. Files can be loaded as they appear in object storage or through scheduled pipelines.

CrateDB supports ingestion through:

SQL COPY for high throughput bulk loads
External file connectors built on OSS standards
Cloud storage triggers that load files as they arrive
ETL and ELT jobs orchestrated by tools like Airflow or dbt
Imports from data lakes or lakehouses that store data in Parquet and similar formats

Supported sources and formats

File-based ingestion via SQL COPY FROM, or via UI / Cloud Console
File formats: CSV, JSON (JSON-Lines / JSON Documents / JSON Arrays), Parquet
Compression: gzip (if specified) for supported file formats
Sources: local filesystem, remote URL, cloud object storage (e.g. S3, other storage accessible via URL)

Batch ingestion use cases

Batch ingestion unlocks value in scenarios such as:

Analytics on historical sensor data stored in object storage
Merging years of archives with fresh IoT streams
Importing large customer or transaction files
Loading data for model training or feature stores
Periodic refreshes of enterprise datasets
Backfilling missing records after downtime in upstream systems

Batch ingestion can be combined with streaming ingestion so you can unify batch and real time pipelines in one database.

Example workflow

An analytics team stores several years of device data in Parquet files on S3. Using CrateDB Cloud, they import the files through SQL COPY and make the entire dataset queryable in minutes.

New Parquet files land on S3 daily and are automatically detected by their orchestration tool, which triggers a bulk load into CrateDB. Once ingested, the data is indexed for aggregations, hybrid search, and vector queries.

Learn more about CrateDB's real-time ingestion

Batch Ingestion

Why batch ingestion

How it works

Supported sources and formats

Batch ingestion use cases

Example workflow

CrateDB architecture guide

Additional resources

Page

Real-time ingestion (overview)

Page

High throughput

Page

Streaming connectors

Page

Parallel processing

Page

Real-time indexing

Want to learn more?

Company

Ecosystem

Contact