This exercise requires a CrateDB cluster with the course dataset loaded. If you didn't create your cluster earlier in the course, complete these steps before going further.
Try Some of the Queries from the Video
Let’s try a query that finds the most recent ride start time for each Kia Soul taxi:
SELECT t.vehicleid, make, model, max(r.starttime) AS ride_start
FROM taxis t JOIN taxi_rides r ON t.vehicleid = r.vehicleid
WHERE make='KIA' AND model='SOUL'
GROUP BY t.vehicleid, make, model
ORDER BY ride_start DESC;
As all 8 of the Kia Souls completed at least one taxi ride, data is returned for each.
Now, run this query which returns details of how many rides each Kia Soul taxi undertook that originated in the O’Hare area and ended in The Loop:
SELECT t.vehicleid, make, model, count(r.vehicleid) AS num_trips FROM taxis t
JOIN taxi_rides r ON t.vehicleid = r.vehicleid AND pickupcommunityarea = 76 AND dropoffcommunityarea = 32
WHERE t.make='KIA' AND t.model='SOUL'
GROUP BY t.vehicleid, make, model
ORDER BY num_trips DESC;
How many rows does this return? Why are there fewer than 8 rows?
Finally, run this left outer join query which also returns details of the number of rides undertaken by each Kia Soul taxi from the O’Hare area to The Loop.
SELECT t.vehicleid, make, model, count(r.vehicleid) AS num_trips FROM taxis t
LEFT JOIN taxi_rides r ON t.vehicleid = r.vehicleid AND pickupcommunityarea=76 AND dropoffcommunityarea=32
WHERE t.make='KIA' AND t.model='SOUL'
GROUP BY t.vehicleid, make, model
ORDER BY num_trips DESC;
Why does this query return 8 rows?
Try Some Joins of Your Own
Take a moment to try out some queries of your own that use joins. Our examples used the taxis and taxi_rides tables, but you could also use the following pairs of tables:
- community_areas (areanumber) joined on three_eleven_calls (locationdetails['communityarea']).
- libraries (location['communityarea']) joined on community_areas (areanumber).
Refer to the joins documentation for more information on SQL syntax and available options.
Check if your CrateDB cluster has these tables using a count query:
SELECT count(*) FROM table_name;
Replacing table_name with community_areas, three_eleven_calls and libraries as needed.
If one or more of the queries error, create and populate the table using the SQL statements below.
Community Areas Table
DROP TABLE IF EXISTS community_areas;
CREATE TABLE community_areas (
areanumber INTEGER PRIMARY KEY,
name TEXT,
details OBJECT(STRICT) AS (
description TEXT INDEX USING fulltext WITH (analyzer='english'),
population BIGINT
),
boundaries GEO_SHAPE INDEX USING geohash WITH (PRECISION='1m', DISTANCE_ERROR_PCT=0.025)
);
COPY community_areas
FROM 'https://github.com/crate/cratedb-datasets/raw/main/academy/chicago-data/chicago_community_areas.json'
RETURN SUMMARY;
Libraries Table
DROP TABLE IF EXISTS libraries;
CREATE TABLE libraries (
name TEXT,
location OBJECT(DYNAMIC) AS (
address TEXT,
zipcode TEXT,
communityarea INTEGER,
position GEO_POINT
),
phone TEXT,
website TEXT
);
COPY libraries FROM 'https://github.com/crate/cratedb-datasets/raw/main/academy/chicago-data/chicago_libraries.json' RETURN SUMMARY;
311 Calls Table
DROP TABLE IF EXISTS three_eleven_calls;
CREATE TABLE three_eleven_calls (
srnumber TEXT,
srtype TEXT,
srshortcode TEXT,
createddept TEXT,
ownerdept TEXT,
status TEXT,
origin TEXT,
createddate TIMESTAMP,
lastmodifieddate TIMESTAMP,
closeddate TIMESTAMP,
week GENERATED ALWAYS AS date_trunc('week', createddate),
isduplicate BOOLEAN,
createdhour SMALLINT,
createddayofweek SMALLINT,
createdmonth SMALLINT,
locationdetails OBJECT(DYNAMIC) AS (
streetaddress TEXT,
city TEXT,
state TEXT,
zipcode TEXT,
streetnumber TEXT,
streetdirection TEXT,
streetname TEXT,
streettype TEXT,
communityarea SMALLINT,
ward SMALLINT,
policesector SMALLINT,
policedistrict SMALLINT,
policebeat SMALLINT,
precinct SMALLINT,
latitude DOUBLE PRECISION,
longitude DOUBLE PRECISION,
location GEO_POINT
)
) PARTITIONED BY (week);
COPY three_eleven_calls
FROM 'https://github.com/crate/cratedb-datasets/raw/main/academy/chicago-data/311_records_apr_2024.json.gz'
WITH (compression='gzip')
RETURN SUMMARY;