CrateDB is a multi-model database. Its strength lies in the efficient handling of multiple data models within the same database and even within the same table. This includes structured data such as tables, time series and geospatial; semi structured data, for example JSON documents; and unstructured data including text, binaries, images, videos or vectors. All of these data models can be combined in the same record and accessed by SQL. This allows for complex queries and searches over all of your data. In this video, we'll explore some multi-model queries and introduce a couple of new concepts. Let's get started.
Here's a multi-model query using the Community Areas data set. And here's the table schema for the community_areas table. The details field is an object with a defined schema of its own. Recall that CrateDB also supports dynamic schemas for objects and that all fields will be indexed, ensuring the best query performance. Inside the details object, the description field is stored as text with full-text indexing. Don't forget that vector and hybrid searches are also possible with CrateDB. boundaries is defined as a GEO_SHAPE. Shapes can be specified as GeoJSON or in Well Known Text format (WKT).
This query returns the community area name and population for community areas whose boundaries intersect a line that represents the path of a flight between Chicago's O'Hare Airport and the smaller Midway Airport, and whose description text matches the term historic, and which have a population exceeding 30,000 people. We're returning the results in descending order of population, and we can see that the flight passes over three community areas that are thought to be of historic interest and have sufficient population.
Here's a second query that demonstrates some of the other powerful features of CrateDB. I'm using our taxi_rides and taxis tables to find details of which cars earn the most money, what their biggest single fare was, and when the ride that yielded that took place. Let's look at a couple of details. Firstly, we're joining two tables: The vehicle details come from the taxis table and the metrics from aggregations over the taxi_ rides table.
Remember that while CrateDB is a distributed database, it does support joins. With CrateDB this sort of query is extremely fast due to the use of columnar storage. Aggregation operations are fast and efficient. Let's run the query and find out which taxis are the big earners. It seems that Toyota vehicles generate the most revenue.
Next, I'll introduce two more features that will help you structure complex queries in CrateDB: Common Table Expressions and Window Functions. In SQL, a common table expression, or CTE, specifies a temporarily named result set for use within the context of a larger query. Think of common table expressions as a way to more easily write and maintain complex queries, especially when there is a need to reference or join the same data set multiple times. Let's demonstrate this by revisiting an example with taxi and taxi_rides data.
Imagine we want to combine some summary data from our taxi_rides table with data from the taxis table. This query tells us the number of rides each taxi performed and the average distance it drove.
And this query tells us the vehicleid, make, and model of each taxi. Let's use a common table expression to combine these. With the data from the taxi_rides table as our temporary named result set. CTEs are introduced using the WITH keyword.
Here I'm defining a CTE called ride_info containing the results of running that first query that we looked at, and here I'm referring to it in the SELECT query that also includes data from the taxis table. Running this query shows us some more details about the vehicles that have performed the most trips. It looks like Toyota taxis are always out on the road earning a living.
We can also reference CTEs in WHERE clauses. This query only considers those taxis whose average ride length was over 30 miles. It does this by using average miles declared in the CTE in the WHERE clause.
A window function performs a calculation across a set of rows that are somehow related to the current row. The window in window functions refers to that set of rows. Window functions are similar to aggregations, but unlike aggregate functions, which reduce the output into a single row per group, window functions return a value for each row in the output. In SQL, window functions are introduced with the OVER keyword.
Here is an example query that returns the start time and number of miles driven for each trip made by one of the taxis on April the 29th, 2024. We're using a window function to calculate a running total of the number of miles the taxi has driven that day.
The running total is created by this clause. Here, for each matching ride, sum is provided with the value of miles for that ride, plus all the rides for this cab that came before it in the day achieved with the ORDER BY clause. Let's run the query and see what it produces.
We see each of Taxi 5955's 8 trips for the day in order with a running total of the number of miles driven that day after each ride. We can use multiple window functions in a single query.
Consider this query, which again shows the cumulative miles driven by Taxi 5955 in a day, but now also ranks those trips by distance, with a rank of one being the longest.
Here we use two window functions. The first window function uses dense_rank to rank the results in descending order of miles driven, and the second calculates our running total of miles driven for the day as before. When we run this query, here's what's returned.
What if we wanted to run reports like this for more than one taxi? Partitioning allows us to group the records in a window rather like a GROUP BY clause. The window function is then only applied to the matching subset.
This query returns a report containing the cost of each ride that a taxi performed and the cumulative amount earned by that taxi for the day at the end of each ride. We're running it for two taxis, 5635 and 3523. The now running window function to calculate the running cumulative earnings is partitioned by the vehicleid, ensuring that only the records for the right cab are used in the calculation.
Here is the result of running this query. We can see that each cab has a separate running total for it's earnings so far that day. Here we have a query that combines some of the ideas we've been looking at for a given taxi. This query generates a report showing the length of each trip the taxi performed on a given day, as well as running totals for Dollars earned and time spent driving. We've also included columns that show whether a given trip was the taxi's 2nd, 3rd, 4th, etc. of the day and the trip's rank based on distance. Let's run the query now.
We start to get a picture of what the taxi drivers day looks like. This sort of information is useful for planning things like break times, fuel stops, and when the driver should think about returning to the depot. This query uses multiple window functions with some degree of repetition. Here we see we're using the same window three times. We can factor this out by using a named window.
This version of the query introduces a named window w and refers to it by name whenever we need it. Running this query yields the same results as the previous version, but our SQL is a bit more maintainable now.
In this video we demonstrated the power of querying a multi-model database with SQL. We also introduced the concepts of window functions and common table expressions. I would encourage you to experiment with these concepts using the sample data set in your own CrateDB cluster.