Indexing in CrateDB enhances data retrieval speed, crucial for effective time series data analysis. To enable efficient aggregation, grouping and search over billions of rows, CrateDB combines multiple approaches:
- Automatic indexing: All attributes are automatically indexed based on their data types, enabling fast ad-hoc queries and aggregations, and eliminating the need for a custom indexing strategy.
- Memory-mapped columnar storage: This approach enables fast aggregations over huge datasets by optimizing frequently accessed data.
- Block KD-trees and inverted index: These are leveraged for efficient handling of numeric and textual data respectively. Full-text indexes can be created on-demand.
- HNSW (Hierarchical Navigable Small World) index: Used for vector data, this approach enables efficient nearest neighbour searches.
Automatic Indexing
CrateDB indexes every attribute by default, based on its data type. It also indexes attributes at all depth levels within nested data objects (JSON document for example). This eliminates the need for continuous adjustments and optimizations, streamlining the process. The system ensures quick responses to both structured and semi-structured queries and unforeseen ad-hoc requests, across extensive datasets.
Example 1: querying JSON attributes
In the Part 1 of this white paper series, we created a weather_data table. Now, we edit its schema by using the ALTER TABLE command to add a new ‘sensor_description’ column with the OBJECT(DYNAMIC) type to store some JSON data. With a SELECT query, we can instantly identify devices with low battery value by using the ‘battery_life’ JSON attribute:
ALTER TABLE weather_data ADD COLUMN sensor_description OBJECT(DYNAMIC); INSERT INTO weather_data (ts, location, temperature, humidity, wind_speed, sensor_description) VALUES ('2024-01-12T14:32:00', 'San Francisco', 17.8, 61.0, 33.2, '{"quality":90, "battery_life":35}'); SELECT DISTINCT(location), sensor_description['battery_life'] AS battery_life FROM weather_data WHERE sensor_description['battery_life'] < 40 LIMIT 100;
Example 2: adding JSON attributes dynamically
We create a 'weather_data_json' table with 3 columns for timestamp, location, and a JSON object containing other values:
CREATE TABLE weather_data_json ( ts TIMESTAMP, location VARCHAR, vals OBJECT, PRIMARY KEY (ts, location) );
We start by inserting a record where the JSON object contains only the ‘temperature’ and ‘humidity’ attributes:
INSERT INTO weather_data_json (ts, location, vals) VALUES ('2024-01-12T14:30:00', 'San Francisco', '{"temperature":17.5, "humidity":60.0}');
Then, we insert another record where the JSON object contains the new ‘wind_speed’ attribute:
INSERT INTO weather_data_json (ts, location, vals) VALUES ('2024-01-12T14:31:00', 'San Francisco', '{"temperature":17.5, "humidity":60.0, "wind_speed":35.6}');
The updated data, including the new attribute, can be instantly viewed and filtered with a SELECT query like in Example 1.
This approach ensures high query performance and adaptability to different data types. Indexing strategies are tailored accordingly and can be turned off on specific fields when needed. With automatic indexing, CrateDB optimizes resource utilization, allowing users to focus on deriving insights rather than managing indexing complexities.
Columnar Storage
CrateDB uses columnar storage, which stores data table columns separately, as opposed to row-based storage which stores entire rows together. This offers several advantages for time series data:
- Efficient data scanning: Only relevant columns are read from storage for queries, aggregations, and sort operations.
- Better compression: Similar data types in each column allow for effective compression techniques.
- Improved memory handling: Fast reads for frequently accessed columns are enabled via memory-mapped columnar storage files.
In our example, the columnar store links record IDs to temperatures at each weather station. When aggregating, such as when calculating the average temperature, CrateDB accesses the 'temperature' column's values directly, speeding up the query execution.
This method is ideal for time-series data, where datasets are large and queries often involve time interval aggregations. Fast range scans, even when combined with additional filters, make this approach highly efficient.
Block KD-Trees
Inverted Index and Full-text Index
For more complex searches, a full-text index supports searches for partial matches and variations. In the illustration above, the term 'Berlin' is linked to IDs 1 and 2, corresponding to ‘Berlin West' and ‘Berlin South'. This allows users to search for a term like ‘Berlin' and find all related stations in Berlin, regardless of their exact names.
Full-text search in CrateDB, extending beyond simple pattern matching, is enabled by creating full-text indexes, including composite indexes for indexing multiple text columns together.
CREATE TABLE weather_stations ( id INT, name TEXT, description TEXT, INDEX name_description USING fulltext(name, description) WITH (analyzer = 'english') );
SELECT name, description, _score FROM weather_stations WHERE match(name_description, ‘key station’) ORDER BY _score;
In this example, a composite full-text index on the 'name' and 'description' columns can be used to perform sophisticated searches. Custom analyzers allow for tailored text tokenization processes suitable for different languages or specific needs. In a typical use case, a full-text search is conducted on the 'stations' table to locate key weather stations using the term 'key station' in the name or description field.
Results are ranked based on relevance, indicating how closely each entry matches the search term.
CrateDB's use of standard SQL simplifies full-text search integration with other filter types and query criteria. It supports, amongst others, fuzzy searches, phrase searches, and attribute boosting, enhancing result relevance. These features enable efficient handling of advanced time series use cases, including data filtering, slicing, and complex searches.
HNSW (Hierarchical Navigable Small World) Index
The Hierarchical Navigable Small World (HNSW) indexing in CrateDB is highly efficient for managing high-dimensional vectors. It's crucial for executing approximate nearest neighbour searches, also known as similarity searches.
HNSW excels at managing and interpreting unstructured data like text, video, and audio, often found in AI applications. This algorithm is used to create vector indexes, which greatly improve query speed.