Advanced Indexing and Columnar Storage
CrateDB uses Lucene as a storage engine enabling fast search queries and aggregations. By default, everything gets indexed using various index methods (full-text, geospatial, BKD-trees) for each data type and stored in a column store, also referred to as doc values, to improve query speed for aggregations across large data sets.
Columnar Storage
In conjunction with advanced indexing strategies, CrateDB adopts a columnar storage approach that facilitates fast queries and complex aggregations across large data sets. In CrateDB, each value term is stored in a column store by default, alongside storing the row data as-is and indexing each value automatically. This design provides various advantages:
- Storing data for the same field together optimizes file-system cache utilization. This eliminates the need to load unnecessary data for fields not needed by the query and improves global aggregations and groupings.
- By segmenting data into blocks and incorporating metadata about the range or set of unique values in the block header, certain queries may entirely skip unnecessary blocks during execution. This enables the possibility of ordering as data for one column is packed at one place.
- Implementation of specific techniques allows querying data without decompressing it first.
CREATE TABLE t1 ( id INTEGER, url TEXT INDEX OFF STORAGE WITH (columnstore = false) );
Indexing
When ingesting data, the queries you will make over time may not be clear initially, and use cases tend to evolve, leading to new requirements and query patterns.
To ensure fast query responses for any query type, CrateDB automatically indexes every attribute by default. Depending on the data type, the following strategies are applied:
- Inverted Index for text values: Facilitates efficient search for precise text matches, including support for wildcards and regular expressions. Text columns are using a plain inverted index by default that doesn’t analyze the text. One can also define a full-text index with custom analyzers.
- Block k-d trees (BKD) for numeric, date, and geospatial values: Highly efficient indexes designed for optimal IO. Numeric values, including geopoints, are indexed using BKD-trees. Most data structure resides in on-disk blocks, with a small in-heap binary tree index structure for locating blocks at search time. This design ensures excellent query and update performance regardless of the number of updates performed.
- HNSW (Hierarchical Navigable Small World) graphs for high dimensional vectors: Enables efficient approximate nearest neighbor search, commonly known as similarity search.
Additionally, full-text indexes can be added on-demand to unlock features like fuzzy search, phrase search, and attribute boosting. CrateDB offers over 30 languages, 11 analyzers, 15 tokenizers, more than 35 token filters, and the flexibility for custom analyzers and tokenizers.
Object Storage and Indexing
CrateDB dynamically maps each sub-column to a data type and stores primitive object properties as sub-columns using the above mention column store. Each sub-column is also indexed by default. This dynamic mapping can be disabled by the user.
The efficient indexing through Lucene provides fast search across billions of records.
// obj JSON { "MachineID" : "drill001", "Sensors" : [ { "name" : "temp1", "value": 21.5 }, { "name" : "temp2", "value": 20.8 }, { "name" : "accel", "value": 3.2 } ], "Events" : { "type" : "info"} }
// find all Machines that emit an INFO event and provide a temperature value SELECT obj['MachineID'] as "machine" FROM my_devices WHERE obj['Events']['type'] = 'info' AND 'temp1' = ANY(obj['Sensors']['name']);
CrateDB at Big Data Conference Europe 2022
Not all Time-Series are Equal – Challenges of Storing and Analyzing Industrial Data
Timestamp: 24:06 – 27:00
CrateDB Architecture Guide
This comprehensive guide covers all the key concepts you need to know about CrateDB's architecture. It will help you gain a deeper understanding of what makes it performant, scalable, flexible and easy to use. Armed with this knowledge, you will be better equipped to make informed decisions about when to leverage CrateDB for your data projects.