Indexing, Columnar Storage, and Aggregations

CrateDB uses Lucene as a storage engine enabling fast search queries and aggregations. By default, everything gets indexed using various index methods (full-text, geospatial, BKD-trees) for each data type and stored in a column store, also referred to as doc values, to improve query speed for aggregations across large data sets.

Columnar Storage

In conjunction with advanced indexing strategies, CrateDB adopts a columnar storage approach that facilitates fast queries and complex aggregations across large data sets. In CrateDB, each value term is stored in a column store by default, alongside storing the row data as-is and indexing each value automatically. This design provides various advantages:

Storing data for the same field together optimizes file-system cache utilization. This eliminates the need to load unnecessary data for fields not needed by the query and improves global aggregations and groupings.
By segmenting data into blocks and incorporating metadata about the range or set of unique values in the block header, certain queries may entirely skip unnecessary blocks during execution. This enables the possibility of ordering as data for one column is packed at one place.
Implementation of specific techniques allows querying data without decompressing it first.

Indexing

When ingesting data, the queries you will make over time may not be clear initially, and use cases tend to evolve, leading to new requirements and query patterns.

To ensure fast query responses for any query type, CrateDB automatically indexes every attribute by default. Depending on the data type, the following strategies are applied:

Inverted Index for text values: Facilitates efficient search for precise text matches, including support for wildcards and regular expressions. Text columns are using a plain inverted index by default that doesn’t analyze the text. One can also define a full-text index with custom analyzers.
Block k-d trees (BKD) for numeric, date, and geospatial values: Highly efficient indexes designed for optimal IO. Numeric values, including geopoints, are indexed using BKD-trees. Most data structure resides in on-disk blocks, with a small in-heap binary tree index structure for locating blocks at search time. This design ensures excellent query and update performance regardless of the number of updates performed.
HNSW (Hierarchical Navigable Small World) graphs for high dimensional vectors: Enables efficient approximate nearest neighbor search, commonly known as similarity search.

Additionally, full-text indexes can be added on-demand to unlock features like fuzzy search, phrase search, and attribute boosting. CrateDB offers over 30 languages, 11 analyzers, 15 tokenizers, more than 35 token filters, and the flexibility for custom analyzers and tokenizers.

// obj JSON
{
   "MachineID" : "drill001",
   "Sensors" : [
     { "name" : "temp1", "value": 21.5 },
     { "name" : "temp2", "value": 20.8 },
     { "name" : "accel", "value": 3.2 }
  ],
   "Events" : { "type" : "info"}
}

// find all Machines that emit an INFO event and provide a temperature value

SELECT obj['MachineID'] as "machine"
FROM my_devices
WHERE obj['Events']['type'] = 'info'
AND 'temp1' = ANY(obj['Sensors']['name']);

Advanced Indexing and Columnar Storage

Columnar Storage

Indexing

Object Storage and Indexing

Aggregations

CrateDB at Big Data Conference Europe 2022

CrateDB Architecture Guide

Additional resources

Documentation

Fulltext indices

Documentation

Column Store

Documentation

Column Policy

Documentation

Aggregations

Blog

Indexing and Storage in CrateDB

Blog

Handling Dynamic Objects in CrateDB

Interested in learning more?

Company

Ecosystem

Contact