We've doubled the JOINs performance for almost all queries, and removed many of the previously existing restrictions!
Since the first public release of JOINs in Crate we've been working hard to remove limitations and improve query speed. These improvements will be available in the upcoming testing release of Crate 0.54.0
.
Removing Limitations
ORDER BY
Limitation
The limitation that every item in the select list of the query needs to be included in the order clause is now gone. What previously had to be written as:
SELECT
employee.id, employee.name, department.name,
FROM
employee, department
WHERE
employee.department_id = department.id
ORDER BY
employee.id, employee.name, department.name;
Can now be written as:
SELECT
employee.id, employee.name, department.name,
FROM
employee, department
WHERE
employee.department_id = department.id
ORDER BY
employee.id;
Explicit INNER JOIN
Syntax
We've added full syntax support for INNER JOIN
, which means you can now join tables explicitly:
SELECT
employee.id, employee.name, department.name,
FROM
employee
INNER JOIN department
ON employee.department_id = department.id
ORDER BY
employee.id;
Optimizing for Speed
By removing the requirement for ordering by all expressions in the select list, query time has accelerated greatly!
Example: Weather Data
To show this in an example I am using data from the German Weather Service which Matthias wrote a blog post about Analyzing Germany's weather with Crate.
The data consists of two tables, one for the actual data points and one for the stations that recorded the data:
Table Name | Records |
---|---|
german_climate.data | ~ 235M |
german_climate.stations | 1455 |
With the old "ORDER BY limitation" the query looked like this:We want to query the top 10 hottest weather records and their station names.
SELECT stations.name, data.temp, data.air_pressure, date_format(data.date)
FROM german_climate.data, german_climate.stations
WHERE data.station_id = stations.id
AND data.temp IS NOT NULL
AND data.air_pressure IS NOT NULL
ORDER BY data.temp DESC, stations.name, data.air_pressure, date_format(data.date)
LIMIT 10;
And the result:
+------------------------+-----------+-------------------+-----------------------------+
| stations.name | data.temp | data.air_pressure | date_format(data.date) |
+------------------------+-----------+-------------------+-----------------------------+
| Oldenburg | 49.1 | 1027.2 | 1993-03-25T21:00:00.000000Z |
| Oldenburg | 42.1 | 1006.5 | 1992-07-21T13:00:00.000000Z |
| Zwiesel | 42.0 | 1013.5 | 2002-10-04T02:00:00.000000Z |
| Bremervörde | 41.3 | 1018.2 | 1991-08-02T15:00:00.000000Z |
| Oldenburg | 41.0 | 959.2 | 1992-10-24T00:00:00.000000Z |
| Weiden | 40.9 | 1012.3 | 1994-07-25T14:00:00.000000Z |
| Karlsruhe | 40.1 | 1016.3 | 2003-08-13T13:00:00.000000Z |
| Karlsruhe | 39.9 | 1015.4 | 2003-08-13T15:00:00.000000Z |
| Karlsruhe | 39.8 | 1014.9 | 2003-08-09T16:00:00.000000Z |
| Schönhagen (Ostseebad) | 39.8 | 1007.0 | 1993-03-22T20:00:00.000000Z |
+------------------------+-----------+-------------------+-----------------------------+
SELECT 10 rows in set (6.908 sec)
Due to the multiple ordering, the query took about 6.9 seconds.
However, now it's possible to sort only by the relevant fields, in this case by temperature:
SELECT stations.name, data.temp, data.air_pressure, data.date
FROM german_climate.data
INNER JOIN german_climate.stations
ON data.station_id = stations.id
WHERE data.temp IS NOT NULL
AND data.air_pressure IS NOT NULL
ORDER BY data.temp DESC
LIMIT 10;
And the result:
+------------------------+-----------+-------------------+-----------------------------+
| stations.name | data.temp | data.air_pressure | date_format(data.date) |
+------------------------+-----------+-------------------+-----------------------------+
| Oldenburg | 49.1 | 1027.2 | 1993-03-25T21:00:00.000000Z |
| Oldenburg | 42.1 | 1006.5 | 1992-07-21T13:00:00.000000Z |
| Zwiesel | 42.0 | 1013.5 | 2002-10-04T02:00:00.000000Z |
| Bremervörde | 41.3 | 1018.2 | 1991-08-02T15:00:00.000000Z |
| Oldenburg | 41.0 | 959.2 | 1992-10-24T00:00:00.000000Z |
| Weiden | 40.9 | 1012.3 | 1994-07-25T14:00:00.000000Z |
| Karlsruhe | 40.1 | 1016.3 | 2003-08-13T13:00:00.000000Z |
| Karlsruhe | 39.9 | 1015.4 | 2003-08-13T15:00:00.000000Z |
| Karlsruhe | 39.8 | 1014.9 | 2003-08-09T16:00:00.000000Z |
| Schönhagen (Ostseebad) | 39.8 | 1007.0 | 1993-03-22T20:00:00.000000Z |
+------------------------+-----------+-------------------+-----------------------------+
SELECT 10 rows in set (2.578 sec)
As you can see, by removing the additional unused order by fields the query time shrank from ~6.9 seconds to ~2.6 seconds, which is an increase in speed of over 165%!
This is a simple example of a minor optimization. We're working on bigger optimizations for implementating JOIN
s, and look forward to sharing them with you soon!