In SQL, a JOIN clause combines columns from one or more tables, allowing us to retrieve data stored in different tables by using values common to each. Unlike some distributed databases, CrateDB fully supports joins. In this video, you'll see how to join data from our Chicago taxis and taxi_rides tables.
Let's get started. This the schema for our taxis table. This table contains details about the vehicles comprising Chicago's taxi fleet, for example, the make, model and colour of each one. Each vehicle is assigned a unique vehicleid.
And here's the schema for our taxi_rides table. This table holds details of each taxi ride taken in April 2024. The vehicleid in this table refers to the vehicle that performed the ride. If we know the vehicleid for a ride, we can find the details of the taxi that performed the ride in the taxis table, and we can use a vehicleid from the taxis table to query the taxi_rides table. This query retrieves some of the details for Kia Soul taxis. There are eight of these vehicles in Chicago's taxi fleet. Using the vehicleid of one of these taxis, we can see the details of the rides performed by that vehicle. What if we wanted to show details from the taxis table in the result set for this sort of query? This is where we need to use a join.
This query retrieves the start time of the most recent taxi ride performed by each of our Kia Soul taxis. We want to retrieve the vehicleid, make, model and latest start time for each. The data is spread over two tables, taxis and taxi_rides. We use the t or r prefix when referencing the vehicleid column, as this column is common to both tables in the WHERE clause, we're joining the two tables by specifying that for each result, the vehicleid in the taxis table must match the vehicleid in the taxi_rides table. This query returns a result set containing data from both tables. vehicleid and ridestart come from data in the taxi_rides table, while make and model come from data in the taxis table.
Here's the same query but expressed differently. This time we're using a JOIN clause to specify how the rows in the two tables relate to each other. Running the query produces the same result as before. This query returns the number of trips made by each of our Kia Soul Taxis where the trip originated in community area 76 O'Hare and ended in community area 32: The Loop. Running this query returns three results. Each result is for a vehicle which had matching taxi rides. Recall that there are 8 Kia Soul taxis in Chicago's fleet. So why do we only have three results? This is because we're performing what's called an 'inner join' here. With an 'inner join', only records that have a match in both tables are returned, as five of our Kia Soul taxis didn't perform a trip that met the criteria. There are no matching results in the taxi_rides table for them, and they are omitted from the result set.
What if we wanted to produce a report including all the Kia Soul taxis, even those that didn't perform any qualifying trips? For that, we need an OUTER JOIN. In this case, we want a LEFT JOIN. We want to return all of the rows from the left hand table taxis even if there aren't any matching records in the taxi_rides table. We achieved that by specifying LEFT JOIN here. When we run this query, we see that there are now 8 rows in the result set, one for each of the Kia Soul taxis. This includes the five taxis that didn't perform any qualifying rides.
In this video you saw how to perform queries with joins in CrateDB. CrateDB supports inner, outer and cross joins. For more information, consult the documentation online.