Introduction
CrateDB combines the advantages of typical SQL databases and strict schemas with the dynamic properties of NoSQL databases. While object-relational databases like PostgreSQL allow to store and process JSON data in columns, CrateDB handles objects as real first-level citizens. This allows users to access object properties in the same manner as columns in a table, including full indexing and aggregation capabilities.
In CrateDB an object is a container data type and is represented as a collection of key-value pairs. An object property can contain a value of any type, including nested objects or arrays. There are different kinds of variations in how CrateDB handles objects. From very strict schema enforcement to completely ignoring schema definitions. Further through the dynamic mapping capabilities, new object properties can be mapped to a datatype and fully indexed on the fly. The goal of this article is to give an overview of different types of objects supported in CrateDB and to improve the overall understanding of the corresponding indexing strategy.
Overview of Objects in CrateDB
In CrateDB objects can be defined as strict, dynamic, or ignored depending on whether we want to enforce strict or dynamic schemas and whether dynamically added properties result in a schema update of sub-columns. For the official syntax on object specification in CrateDB, we refer to the documentation.
Strict, Dynamic and Ignored Objects
In CrateDB, an object can be specified as strict if we want to enforce a predefined number and type of sub-columns. This leads to the rejection of objects with additional sub-columns that are not defined upfront in the schema. If the object schema is not predefined, one can use a dynamic object that allows adding new sub-columns dynamically. However, every time an object with a new sub-column is added, the table schema gets updated. In the updated schema, a new sub-column will have a type that is interfered based on its value and will be indexed. Finally, if we want the flexibility of dynamic objects without changing table schema, we should use ignored objects. With ignored objects, we still have column constraints as with strict or dynamic objects, but adding new sub-columns dynamically does not lead to schema updates or index updates. In the following sections, we will talk about object indexing in more detail. To clarify the difference between different object types let’s consider the table with three columns defined as follows:
The first column stores an object with strict predefined mapping, the second column stores objects with dynamic mapping, and the third column stores objects ignoring the datatype of properties. For simplicity reasons, let’s consider each object to have one sub-column called name
. Now, let’s imagine that we want to add the following object as the first, second, and third columns:
In the case of the first column, CrateDB will throw the Exception
because dynamically added columns are rejected in strict objects. In the case of the second column, insertion of a new value causes the change of obj_table
schema as follows:
Based on the guessed value type, CrateDB updates the sub-column type. In the case of age, the sub-column is anticipated as BIGINT
. Finally, the insertion of the same object to the ignored column would store the value without affecting the table schema.
Indexing Objects in CrateDB
Index structures are one of the most important tools that database management systems leverage to improve query performance. CrateDB indexing strategy is based on a Lucene index which enables efficient search on very large datasets.
By default, CrateDB stores object in a flat dotted structure. For example,
becomes:
CrateDB builds indexes for each sub-column (i.e., object property), depending on the type of sub-column value. For example, for a sub-column that stores text value, CrateDB builds an inverted index. For a numeric value, it builds BKD-trees, etc. To learn more about data structures used for building indexes in CrateDB, check our previous article on indexing and storage. Furthermore, for each stored object CrateDB builds a Column Store for the JSON representation of that object.
Indexing Dynamically Added Columns
The exception to the indexing strategy comes with objects declared as ignored. Dynamically added columns of ignored objects do not result in new indexes, as they do not affect table schema. Filtering on non-indexed columns is slower than on indexed columns, as value lookup is performed for each matching row. Furthermore, CrateDB does not create column stores for dynamically added values in ignored objects. This means that operations such as sorting, grouping, and aggregation operations are significantly slower on ignored objects.
However, CrateDB indexes all statically defined columns of ignored objects. To further clarify how the object indexing works, let’s consider our obj_table
with three object columns: strict, dynamic, and ignored and the following insert statement:
The resulting indexes are further illustrated in the table below:
Column | Type | Index Structures |
obj1 |
static object | column store |
obj1.name |
text | column store and inverted index |
obj2 |
dynamic object | column store |
obj2.name |
text | column store and inverted index |
obj2.age |
bigint | column store and BKD tree |
obj3 |
ignored object | column store |
obj3.name |
text | column store and inverted index |
obj3.age |
no schema update | no index |
As shown, in the case of dynamically added columns, a new index is created if the object is dynamic. If the object is ignored, dynamically columns do not result in a new index.
Accessing Undefined Properties
Another difference between dynamic and ignored objects is the way the queries are performed on objects that do not exist. As an example, let’s consider the list of two dynamic objects as follows:
Now, let’s query the objects on three columns: name
, age
, and phone
. The result is illustrated in the table below:
obj1.name |
obj1.age |
obj1.phone |
'Ana' | 20 | NULL |
NULL |
22 | 'abcde' |
Accessing the column that is not defined the dynamic object returns NULL
as long as the column is defined in the table schema. However, if we attempt to query the column that is not part of the schema, e.g., column address
, the system will throw ColumnUnknownException
. The difference between dynamic and ignored objects in this regard is that a query on a column that is not part of the schema would always return a NULL
value.
Now, let’s consider that we want to extend the obj2
column in obj_table
with a new subcolumn called address
. The query for adding a new subcolumn looks like the follows:
After adding a new subcolumn the table schema changes, as well as the indexes. This means that querying address
property on obj2
would return NULL
value. The same behavior is expected when querying address
on obj3
. however, as already discussed this is expected behavior as obj3
is ignored object. What is important to mention here is that changing schema affects existing indices: altering object column has as a consequence that only values added after changing schema are indexed.
Summary
This article describes the fundamentals of objects in CrateDB. Objects can be either strict, dynamic, or ignored and which one works the best depends on a use case. For properties in strict and dynamic objects, CrateDB generates indexes that allow the fast search of stored objects and efficient updates. As we illustrated, there are exceptions to this rule which should be taken into account when defining table schema.