For a while now, CrateDB has supported subselects in the FROM
clause, as long as the subselect can be collapsed into as a single select.
In one of our recent batches of work, my team addressed this limitation and now CrateDB supports non-collapsable subselects.
In this post, I will introduce you to the problem in a little more detail, and show you the steps we took to solve it.
Introduction
Collapsable Subselects
Here’s an example of a very simple query using a subselect:
SELECT column_a,
column_b
FROM (
SELECT *
FROM my_table
WHERE column_a > 10
) AS subselect_a
ORDER BY column_b
This is functionally the same as:
SELECT column_a,
column_b
FROM my_table
WHERE column_a > 10
ORDER BY column_b
That’s because, in the original query, the inner select applies the WHERE
operator and the outer select applies the ORDER BY
operator. And these two operators can be merged and executed in a single step.
We do not have to produce an intermediate result set from the inner select and then apply the operations of the outer select.
These are the only sorts of subselects that CrateDB was able to handle before we started our work.
Non-Collapsable Subselects
Let’s take a look at a more complex example of a subselect:
SELECT *
FROM (
SELECT *
FROM my_table
ORDER BY column_a DESC
LIMIT 10
) AS subselect_a
ORDER BY column_b
LIMIT 5
Here, the inner select applies ORDER BY
and LIMIT
operators, and the outer select applies ORDER BY
and LIMIT
operators. There is no way to merge these operators and execute them in a single step. Instead, we must first produce an intermediary result set by applying the operators from the inner select to the table, and then apply the operators from the outer select to the intermediary result set.
The first select operation looks like this:
SELECT *
FROM my_table
ORDER BY column_a DESC
LIMIT 10
If the result set produced by this query was a table called subselect_a
, the final select operation would look like this:
SELECT *
FROM subselect_a
ORDER BY column_b
LIMIT 5
It is the combination of these two queries that is needed.
In the past CrateDB did not support any sort of subqueries that required an intermediary result set, so these sorts of queries would raise an error.
Here’s another example:
SELECT *
FROM (
SELECT *
FROM (
SELECT column_a,
column_b
FROM my_table
WHERE column_c > 10
) AS subselect_a
ORDER BY column_a DESC
LIMIT 10
) AS subselect_b
ORDER BY column_b
LIMIT 5
This can be partially collapsed, to this:
SELECT *
FROM (
SELECT column_a,
column_b
FROM my_table
WHERE column_c > 10
ORDER BY column_a DESC
LIMIT 10
) AS subselect_a
ORDER BY column_b
LIMIT 5
Here, the innermost select has been collapsed. But we cannot collapse it any further, for the same reasons as before.
Adding support for non-collapsable subselects (which includes partially non-collapsable subselects) was our primary goal for this batch of work. And in the next section, I will show you how we did it.
How We Implemented Non-Collapsable Subselects
Changes to Query Rewriting
To support non-collapsable subselects, we had to make some changes to our SubselectRewriter
. This is the module responsible for collapsing subselects, per the examples in the previous section.
Previously, the SubselectRewriter
would construct a nested hierarchy of objects, modeling the nested hierarchy of selects. QueriedSelectRelation
objects are used to model selects that operate on the results of a select operation, and QueriedDocTable
objects are used to model selects that operate directly on a table.
Take this select:
SELECT column_a,
column_b
FROM (
SELECT *
FROM my_table
WHERE column_a > 10
) AS subselect_a
ORDER BY column_b
This would be modelled with the following nested object tree:
The top level object is always a QueriedSelectRelation
, because the outermost select of a nested collection of selects is always operating on the result set of another select. And the innermost select is always operating directly on a table.
Previously, the SubselectRewriter
would recursively collapse and rewrite this nested hierarchy of objects until only the QueriedDocTable
object was remaining. This corresponds to a single select operating directly on a table.
The final tree would look like this:
But let’s take this select:
SELECT *
FROM (
SELECT *
FROM my_table
ORDER BY column_a DESC
LIMIT 10
) AS subselect_a
ORDER BY column_b
LIMIT 5
We would model this with the following nested object tree:
But we cannot collapse it any further.
The SubselectRewriter
collapses the object tree from the inside out. Every time it encounters a QueriedSelectRelation
above the innermost QueriedDocTable
, it replaces both of them with a new QueriedDocTable
.
If it was unable to complete one of replacements, it would give up, and the end result would be a partially collapsed tree, with a QueriedDocTable
on the inside and at least one uncollapsed QueriedSelectRelation
object wrapped around it. CrateDB was unable to handle QueriedSelectRelation
objects, so an error would be raised.
To add support for QueriedSelectRelation
objects, we first had to change something about the way this collapsing was being handled.
In the past, we never bothered to update the parent object’s references to the fields of the child object. Because we assumed that the only useful end result would be a single collapsed QueriedDocTable
table. So a hierarchy of object with broken references was fine. Because this was an error condition.
But now we want to support a hierarchy of objects, every time we collapse an object pair, we need to update the parent object’s references.
We start with something like this:
To collapse this, we need to replace the references to xx
with references to (x+x)
:
We can collapse again by replacing x
with (x+x) + (x+x)
:
Implementing this object rewriting was a major chunk of the work we undertook.
Implementing this object rewriting was a major chunk of the work we undertook.
Changes to Data Fetching
The most complex part of the implementation though has to do with optimizing the fetch operations that take place during the execution of non-collapsable subselects.
What Is a Fetch Operation?
Query execution fetches row data in two steps:
- CrateDB fetches the minimum amount of data necessary for doing distributed filtering and ordering on specific columns
- Once CrateDB knows precisely which rows need returning to the client, it requests the rest of the data needed for those rows only
Consider this query:
SELECT column_a,
column_b,
column_c
FROM my_table
WHERE column_a > 10
ORDER BY column_b
LIMIT 100
This is what happens when CrateDB executes this query:
- Each
my_table
shard is asked to produce a set of_id
,column_a
) tuples wherecolumn_a
is greater than 10. - These tuples are then sent to the query handling node, which merges them together into a single set. This is the first fetch operation.
- The merged set of tuples is then ordered.
- The set of tuples is then limited to 100.
- The 100 document IDs we now have are used to request the rest of the data for each row. This is the second fetch.
This approach avoids unnecessary disk operations and unnecessary network traffic, thereby improving the performance of both overall.
What We Changed
To support non-collapsable subqueries, we need to modify this behaviour in a way that continues to optimise for both disk and network performance.
So, let’s take this modified example:
SELECT *
FROM (
SELECT column_a,
column_b,
column_c
FROM my_table
WHERE column_a > 10
ORDER BY column_b DESC
LIMIT 10
) AS subselect_a
ORDER BY column_a DESC
LIMIT 5
The inner query, if it was a standalone query, would fetch a set of (_id
, column_a
) tuples, and then fetch the values of column_b
and column_c
as the final step.
As the inner query of a subselect, this process is modified so that the outer query is notified that there’s a pending fetch operation for the values of column_b
and column_c
.
Because column_b
and column_c
are also not required for the execution of the outer query, this fetch can be postponed. Whether or not this fetch needs to be done later depends on whether the outer query's final fetch would fetch this data.
Because column_b
and column_c
are also not required for the execution of the outer query, but only for the final response, this fetch can be postponed. Whether the fetch needs to be done immediately or not, depends on whether the outer query operates on those columns (e.g. uses them in the WHERE
clause, ORDER BY
clause, and so on). In some cases, the fetch can be ignored entirely, because the fetch that the outer query produces fetches all the necessary data.
Additional Work
Group By on Subselects
A GROUP BY
on a subselect is more complex than on a table.
Consider this non-collapsable subselect:
SELECT column_a,
count(*)
FROM (
SELECT column_a
FROM my_table
ORDER BY column_a
LIMIT 100
) AS subselect_a
GROUP BY column_a
HAVING count(*) > 10
ORDER BY 2
LIMIT 10
Let’s take a look at how CrateDB now handles this.
First of all, the inner select is planned as if it was an independent query.
If that plan only touches the node that is handling the client request, then the GROUP BY
of the outer select will be handled with a non-distributed grouping plan. However, if the plan for the inner select touches additional nodes, the GROUP BY
of the outer select will be handled with a distributed grouping plan.
In the case of non-distributed grouping plan, the grouping operation takes place on the handler node.
In the case of distributed grouping plan, the grouping operation takes place on all nodes that hold the data being operated on.
In both cases, the grouping operation also involves calculating aggregate functions like min
, max
, avg
, and so on.
If the grouping operation was performed across multiple nodes, the handling node fetches that data and merges it together. This includes merging the results of aggregate functions.
HAVING
, ORDER BY
, and LIMIT
clauses are performed the GROUP BY
is done (which for distributed grouping plans means after the merge is complete).
Group By on Joins
A GROUP BY
on a join requires similar handling.
Consider this query:
SELECT table_a.column_b
table_b.column_b,
count(*)
FROM FROM table_a
INNER JOIN table_b
ON table_a.column_a = table_b.column_a
GROUP BY table_a.column_c,
table_b.column_c
HAVING count(*) > 10
ORDER BY 3
LIMIT 10
To handle this, we need to rewrite the query so that the join conditions are separated from the GROUP BY
operation and any subsequent operations.
We can do that, like so:
SELECT subselect_a.ta_column_b,
subselect_a.tb_column_b,
count(*)
FROM (
SELECT table_a.column_b AS ta_column_b,
table_b.column_b AS tb_column_b,
table_a.column_c AS ta_column_c,
table_b.column_c AS tb_column_c,
FROM table_a
INNER JOIN table_b
ON table_a.column_a = table_b.column_a
) AS subselect_a
GROUP BY subselect_a.ta_column_c,
subselect_a.tb_column_c
HAVING count(*) > 10
ORDER BY 3
LIMIT 10
The inner select is planned as if it was a independent select and a distributed or non-distributed plan is produced. On top of this plan, the grouping plan is created along with any subsequent operations.
This whole process functions like it does with grouping on subselects, as detailed in the previous section.
Implement DISTINCT for All Queries
The DISTINCT
operator is implemented by actually adding a group by operation on top of the existing query. Example:
When you use a DISTINCT
operation, like this:
SELECT DISTINCT column_a,
column_b,
column_c
FROM my_table
WHERE column_a > 10
ORDER BY 2
Internally this is rewritten by adding a GROUP BY
operation, like so:
SELECT column_a,
column_b,
column_c
FROM my_table
WHERE column_a > 10
ORDER BY 2
GROUP BY column_a,
column_b,
column_c
Previously doing work already covered in this post, it was not possible to use DISTINCT
with queries that already had a GROUP BY
clause.
Why?
Consider this query:
SELECT DISTINCT column_a,
max(column_b)
FROM my_table
GROUP BY column_a,
column_b
Internally, this would be rewritten as:
SELECT column_a,
max(column_b)
FROM (
SELECT DISTINCT column_a,
max(column_b)
FROM my_table
GROUP BY column_a,
column_b
) AS subselect_a
GROUP BY column_a,
max(column_b)
This is a non-collapsable subselect, and so wasn’t supported.
With the addition of support for non-collapsable subselects and grouping on subselects, DISTINCT
operations can now be rewritten internally and are thereby now supported for all queries.
Wrap Up
In this batch of work we:
- Implemented nested subquery class attribute rewriting, finally ;)
- Rewrote how row fetching works so that subqueries could be supported without intermediary result sets prematurely fetching data (which would degrade disk and network performance, increasing infrastructure costs)
- Modified CrateDB’s grouping logic (which includes aggregate functions) so that subselects are supported
- Added support for grouping on joins, because now they can be internally rewritten as subselects
- Added support for DISTINCT for all queries, because the queries we rewrite them to internally are now also supported
This isn’t the end of our work on subselects.
In my next lab notes post I’m going to take a look at some additional subselects improvements we’ve been working on.
Stay tuned.