Relational / JOINs

Overview

CrateDB implements the relational concept of joining tables.

About

When selecting data from CrateDB, you can join one or more relations (tables) to combine columns into one result set.

Details

Joins are essential operations in relational databases. They create a link between rows based on common values and allow the meaningful combination of these rows.

CrateDB was designed to support distributed joins effectively from the very beginning.

Blog: Support for Joins on Multi-Row Sub-Selects

Joining on virtual tables is a crucial feature for many users, and it is especially useful when doing analytics on your data. A virtual table is what we call the result set returned by a subquery.

Being able to use sub-selects as virtual tables for the lifetime of a query is very useful because it means that you can slice and dice your data multiple ways without having to alter your source data, or store duplicate versions of it.

Blog

Example

SELECT *
      FROM (SELECT *
              FROM table_1
          ORDER BY column_a
             LIMIT 100)
        AS virtual_table_1,
INNER JOIN (SELECT *
              FROM (SELECT *
                      FROM table_2
                  ORDER BY column_b
                     LIMIT 100)
                AS virtual_table_2
          GROUP BY column_c)
        AS virtual_table_3
     ON virtual_table_1.column_a = 
        virtual_table_3.column_a

Lab Notes

Join Feature Sub-Selects

2018

Blog: How We Made Joins 23 Thousand Times Faster

Introduces you to the nested-loop join, equi-join, sorted merge vs. hash join, and the block hash join algorithms, and the advancement into a distributed block hash join algorithm.

This blog post illustrates the implementation of the distributed block hash join algorithm to support users who want to run joins on large tables for their analytics needs.

Joins x23 - Part 1 Joins x23 - Part 2 Joins x23 - Part 2
FOSDEM ‘22: Distributed Join Algorithms in CrateDB

Lab Notes

CrateDB Internals Join Algorithms

2018

Blog: Fine-tuning the query optimizer in CrateDB

In cases where you need it, the query optimizer, which tries to find the best logical plan possible for a given query, can be fine-tuned for specific queries, in order to yield better performance.

By default, the effective join order is based on table statistics collected into the pg_catalog.pg_stats system table.

CrateDB offers the option to disable join-reordering and rely exactly on the order of the tables as described in the query, which is helpful to get more control over the query execution.

Blog

Example

SET optimizer_reorder_hash_join = false;
SET optimizer_reorder_nested_loop_join = false;

Tuning Tipps

Join Performance Performance Settings

2023