Indexing Techniques for Time Series Data

Indexing in CrateDB enhances data retrieval speed, crucial for effective time series data analysis. To enable efficient aggregation, grouping and search over billions of rows, CrateDB combines multiple approaches:

Automatic indexing: All attributes are automatically indexed based on their data types, enabling fast ad-hoc queries and aggregations, and eliminating the need for a custom indexing strategy.
Memory-mapped columnar storage: This approach enables fast aggregations over huge datasets by optimizing frequently accessed data.
Block KD-trees and inverted index: These are leveraged for efficient handling of numeric and textual data respectively. Full-text indexes can be created on-demand.
HNSW (Hierarchical Navigable Small World) index: Used for vector data, this approach enables efficient nearest neighbour searches.

Automatic Indexing

CrateDB indexes every attribute by default, based on its data type. It also indexes attributes at all depth levels within nested data objects (JSON document for example). This eliminates the need for continuous adjustments and optimizations, streamlining the process. The system ensures quick responses to both structured and semi-structured queries and unforeseen ad-hoc requests, across extensive datasets.

Example 1: querying JSON attributes

In the Part 1 of this white paper series, we created a weather_data table. Now, we edit its schema by using the ALTER TABLE command to add a new ‘sensor_description’ column with the OBJECT(DYNAMIC) type to store some JSON data. With a SELECT query, we can instantly identify devices with low battery value by using the ‘battery_life’ JSON attribute:

ALTER TABLE weather_data  ADD COLUMN sensor_description OBJECT(DYNAMIC);

INSERT INTO weather_data (ts, location, temperature, humidity, wind_speed, sensor_description) VALUES ('2024-01-12T14:32:00', 'San Francisco', 17.8, 61.0, 33.2, '{"quality":90, "battery_life":35}');

SELECT DISTINCT(location), sensor_description['battery_life'] AS battery_life FROM weather_data WHERE sensor_description['battery_life'] < 40 LIMIT 100;

The updated data, including the new attribute, can be instantly viewed and filtered with a SELECT query like in Example 1.

This approach ensures high query performance and adaptability to different data types. Indexing strategies are tailored accordingly and can be turned off on specific fields when needed. With automatic indexing, CrateDB optimizes resource utilization, allowing users to focus on deriving insights rather than managing indexing complexities.

Columnar Storage

CrateDB uses columnar storage, which stores data table columns separately, as opposed to row-based storage which stores entire rows together. This offers several advantages for time series data:

Efficient data scanning: Only relevant columns are read from storage for queries, aggregations, and sort operations.
Better compression: Similar data types in each column allow for effective compression techniques.
Improved memory handling: Fast reads for frequently accessed columns are enabled via memory-mapped columnar storage files.

In our example, the columnar store links record IDs to temperatures at each weather station. When aggregating, such as when calculating the average temperature, CrateDB accesses the 'temperature' column's values directly, speeding up the query execution.

This method is ideal for time-series data, where datasets are large and queries often involve time interval aggregations. Fast range scans, even when combined with additional filters, make this approach highly efficient.

Block KD-Trees

CrateDB utilizes Block KD-Trees (BKD-Trees), a unique type of binary search trees, to enhance the indexing of numerical values such as time stamps, IP addresses, geospatial data, and measurement values. These trees are designed to ensure high query performance.

Each node in a BKD-Tree contains a block of points instead of individual points to reduce the overhead associated with traversing the tree and to improve cache efficiency, especially for datasets with high dimensionality or large numbers of points. BKD-Trees are commonly used in database systems and spatial indexing applications to accelerate search operations on multidimensional data.

BKD-Trees in CrateDB offer logarithmic search time complexity, which means they efficiently handle search operations, even for very large datasets. This is due to the ability to halve the search space at each step, allowing the algorithm to scale effectively with the size of the dataset. BKD-Trees are useful for time series analytics requiring filtering, range-queries, and handling multi-dimensional geospatial data. They also excel in indexing disk-stored data, pinpointing relevant disk blocks during queries, and avoiding unnecessary disk scanning. This efficiency is crucial for time-series data where the index continually grows due to extensive data insertions and data older data oftentimes needs to be read from disk.

Inverted Index and Full-text Index

In CrateDB, an inverted index is a data structure that maps the content to its location in a table or a document.

The inverted index is used for exact text matching. It's straightforward and fast for queries that know the exact ‘phrases' they are looking for. In the illustration below, we have an index mapping each weather station to an 'ID', perfect for direct lookups.

For more complex searches, a full-text index supports searches for partial matches and variations. In the illustration above, the term 'Berlin' is linked to IDs 1 and 2, corresponding to ‘Berlin West' and ‘Berlin South'. This allows users to search for a term like ‘Berlin' and find all related stations in Berlin, regardless of their exact names.

Full-text search in CrateDB, extending beyond simple pattern matching, is enabled by creating full-text indexes, including composite indexes for indexing multiple text columns together.

In this example, a composite full-text index on the 'name' and 'description' columns can be used to perform sophisticated searches. Custom analyzers allow for tailored text tokenization processes suitable for different languages or specific needs. In a typical use case, a full-text search is conducted on the 'stations' table to locate key weather stations using the term 'key station' in the name or description field.

Results are ranked based on relevance, indicating how closely each entry matches the search term.

CrateDB's use of standard SQL simplifies full-text search integration with other filter types and query criteria. It supports, amongst others, fuzzy searches, phrase searches, and attribute boosting, enhancing result relevance. These features enable efficient handling of advanced time series use cases, including data filtering, slicing, and complex searches.

HNSW (Hierarchical Navigable Small World) Index

The Hierarchical Navigable Small World (HNSW) indexing in CrateDB is highly efficient for managing high-dimensional vectors. It's crucial for executing approximate nearest neighbour searches, also known as similarity searches.

HNSW excels at managing and interpreting unstructured data like text, video, and audio, often found in AI applications. This algorithm is used to create vector indexes, which greatly improve query speed.

Combining Multiple Indexing Techniques for Time Series Data

Automatic Indexing

Example 1: querying JSON attributes

Example 2: adding JSON attributes dynamically

Columnar Storage

Block KD-Trees

Inverted Index and Full-text Index

HNSW (Hierarchical Navigable Small World) Index

Want to read more?

Company

Ecosystem

Contact