Skip to content
Ingestion

Batch Ingestion

Efficient bulk loading for historical datasets, scheduled pipelines, and large file imports. Unify batch workflows with real time analytics in one database.

CrateDB supports high volume batch ingestion for loading historical datasets, scheduled imports, and lakehouse extractions. You can bring large files from object storage, ETL tools, and data lakes into CrateDB and make them queryable within seconds.
Batch ingestion complements streaming pipelines by unifying cold and hot data inside one database.

Why batch ingestion

Batch ingestion provides a reliable and efficient way to load structured, semi structured, or unstructured files into CrateDB. It is ideal for scenarios where data arrives in chunks, is generated by legacy systems, or is produced at scheduled intervals.

With batch ingestion, you can:

  • Consolidate historical and real time datasets in one database
  • Enrich operational analytics with large backfills
  • Import data from data lakes and lakehouses
  • Run nightly or periodic updates without operational complexity
  • Scale ingestion by using the full capacity of the cluster
CrateDB automatically distributes incoming data across the cluster and indexes it so it becomes instantly available for analytics, search, and AI workloads.
cr-quote-image

How it works

Batch ingestion integrates with file based and lakehouse based workflows. CrateDB processes each file in parallel across the cluster, taking advantage of distributed compute for fast loading. Files can be loaded as they appear in object storage or through scheduled pipelines.

CrateDB supports ingestion through:

  • SQL COPY for high throughput bulk loads
  • External file connectors built on OSS standards
  • Cloud storage triggers that load files as they arrive
  • ETL and ELT jobs orchestrated by tools like Airflow or dbt
  • Imports from data lakes or lakehouses that store data in Parquet and similar formats
cr-quote-image

Supported sources and formats

  • File-based ingestion via SQL COPY FROM, or via UI / Cloud Console
  • File formats: CSV, JSON (JSON-Lines / JSON Documents / JSON Arrays), Parquet
  • Compression: gzip (if specified) for supported file formats
  • Sources: local filesystem, remote URL, cloud object storage (e.g. S3, other storage accessible via URL)
cr-quote-image

Batch ingestion use cases

Batch ingestion unlocks value in scenarios such as:

  • Analytics on historical sensor data stored in object storage
  • Merging years of archives with fresh IoT streams
  • Importing large customer or transaction files
  • Loading data for model training or feature stores
  • Periodic refreshes of enterprise datasets
  • Backfilling missing records after downtime in upstream systems
Batch ingestion can be combined with streaming ingestion so you can unify batch and real time pipelines in one database.
cr-quote-image

Example workflow

An analytics team stores several years of device data in Parquet files on S3. Using CrateDB Cloud, they import the files through SQL COPY and make the entire dataset queryable in minutes.

New Parquet files land on S3 daily and are automatically detected by their orchestration tool, which triggers a bulk load into CrateDB. Once ingested, the data is indexed for aggregations, hybrid search, and vector queries.

cr-quote-image

CrateDB architecture guide

This comprehensive guide covers all the key concepts you need to know about CrateDB's architecture. It will help you gain a deeper understanding of what makes it performant, scalable, flexible and easy to use. Armed with this knowledge, you will be better equipped to make informed decisions about when to leverage CrateDB for your data projects. 

CrateDB-Architecture-Guide-Cover

Additional resources

Want to learn more?