Live Stream on Jan 23rd: Unlocking Real Time Insights in the Renewable Energy Sector with CrateDB

Register now
Skip to content
Resources > Academy > CrateDB Fundamentals

Indexing

Login or sign up for free CrateDB Academy: CrateDB Fundamentals

In this video, we'll dive deeper into indexing in CrateDB, beginning with what is indexed and how. I'll show you how the different object column policies influence the indexing of objects, and we'll end with a quick look at how to configure or even disable indexing.

Let's start with the question what's indexed by default in CrateDB? Well, everything is. To ensure fast responses for any type of query, CrateDB automatically indexes every attribute by default. This is especially useful when query patterns and use cases evolve over time and may not have been clear at the point at which the data was ingested into the database.

How does CrateDB achieve this? Let's find out. CrateDB leverages the powerful search and indexing capabilities of Apache Lucene. Inverted indexes are used for text values. This facilitates efficient search for precise text matches, including support for wild card and regular expressions. Block-KD trees or BKD trees are used for numeric, date, and geospatial values. These are highly efficient indexes designed for optimal I/O. Hierarchical Navigable Small World Graphs enable efficient approximate nearest neighbour search, commonly known as vector similarity search. Full-text indexes can be added to enable features such as fuzzy search, phrase search and attribute boosting. CrateDB offers multiple languages, analysers and tokenizers, plus the flexibility to create custom analysers and tokenizers.

In conjunction with advanced indexing strategies. CrateDB uses a columnar storage approach. This facilitates fast queries and complex aggregations across large data sets. How does this work? Let's consider a simple example. Imagine we have data about products and quantities in stock stored in a table such as the one on the left. If we want to search for a given product term, for example almond, we could go through the table row by row to find matches. When storing the data, an inverted index is created. This index maps each term to the IDs of the records that it appears in. This approach enables faster queries as rather than going through the entire table to look for the term, the query engine can refer to the inverted index to get the IDs of matching records directly.

The use of columnar storage to store data for the same field together optimises file system cache utilisation and eliminates the need to load unnecessary data for fields not used in the given query. CrateDB segments this data into blocks, storing it with metadata. That allows some queries to entirely skip blocks that don't contain matching values and query data without decompressing it first.

Objects offer a lot of flexibility when modelling data with CrateDB. Let's take a moment to learn how they are indexed. This is a CREATE TABLE statement for a table that models the details of Chicago's taxi fleet. Recall that CrateDB has three different object schema policies. When using DYNAMIC, the default policy, CrateDB creates a schema for the object based on the data inserted into it.

Here we're storing an object containing 2 text values in the operator field. CrateDB indexes dynamic objects, so both name and affiliation are indexed. The details object has a STRICT schema definition. Objects inserted here must match the declared schema. Everything inside this object is indexed too. Finally, an object using the IGNORED policy can have any structure. Only fields specified in the schema are indexed in this case. So we see that emissions and seatbeltcheck are not indexed.

This video provided an overview of indexing in CrateDB. CrateDB ensures fast responses to all types of query by indexing everything by default. Full-text and vector similarity search are also possible. These topics are covered in other videos.

Take this course for free