Ingesting Time Series Data

Batch Import

CrateDB offers efficient and flexible handling of large datasets, making it ideal for batch imports.

In this Jupyter notebook, we explain how to import data in batches into CrateDB. It leverages Dask, an open-source library that parallelizes Python code for large-scale computations. With Dask, data can be processed in chunks, which is useful for the management of datasets that exceed the memory capacity of a single machine.

By combining CrateDB's full-text indexing and Dask's parallel processing capabilities, large datasets can be imported at high speed, and optimized for query performance.

Stream Ingest

CrateDB is optimized for time series applications that predominantly work with real-time data. Streaming ingestion of this data is crucial for several reasons:

Always up-to-date information: It ensures the database is always current, accurate, and provides current insights.
Real-time insights and decision making: It enables accurate insights and real-time decision making. This is essential for industries requiring immediate action based on the latest data, leading to significant operational improvements.
Efficiently handle volume & velocity: It allows efficient management of the high volume and velocity of time series data, preventing data loss and ensuring system sustainability.

While data can be streamed via a multitude of technologies based on CrateDB’s SQL and PostgreSQL Wire Protocol compatibility, it is interesting to briefly mention the combination with Apache Kafka and Apache Flink, as they are widely used technologies:

Kafka usually acts as a queue to route incoming data and provides a buffer for incoming data, ensuring no data is lost even during data ingest spikes.
Apache Flink consumes the data from Kafka and incrementally ingests it into CrateDB, facilitating micro batching.

As CrateDB is consistently updated with the most recent data, services can read from either the stream or directly from CrateDB. This setup provides a robust architecture for handling real-time ingestion and processing of time series data.

Ingesting Time Series Data

Batch Import

Stream Ingest

Want to read more?

Company

Ecosystem

Contact