In https://cratedb.com/blog/hybrid-search-explained we learned about Hybrid Search and how to do it in pure SQL. The resulting query can be hard to understand if you don't have too much experience with Common Table Expressions (CTEs). In this article, we will dive deeper into CTEs and the finer details of the query.
Recap
In the previous article, we learned that hybrid search is about running some queries that capture different meanings and combine them. Keep this in mind as we see how CTEs are very similar.
Common Table Expressions.
Common table expressions are sub-queries that can be referenced from a main query. They are temporal, meaning that outside of this main query, they do not exist.
Let's look at the most basic CTE:
WITH a AS (SELECT 1) SELECT a FROM a
+---+ | 1 | +---+ | 1 | +---+
WITH x as is where we define the name of the sub-query, inside the brackets (query) is the actual sub-query. The main query comes after the definition of the sub-queries.
We can include as many sub-queries as needed.
WITH Q AS (SELECT 1), Q2 AS (SELECT 2), Q3 AS (SELECT 3) SELECT * FROM Q, Q2, Q3
+---+---+---+ | 1 | 2 | 3 | +---+---+---+ | 1 | 2 | 3 | +---+---+---+
See the pattern now? To top it off, let's look at a practical example. Let's build a query that reports the names of the tables in CrateDB, their health, and how many rows there are. We'll use this for our shiny new dashboard.
In CrateDB, several tables contain metadata about the database itself. Some of these can be found in the sys schema. In sys.health there is information about the health of the tables. Read more about health.
SELECT table_name, CASE WHEN 'RED' = ANY(ARRAY_AGG(health)) then 'RED' WHEN 'YELLOW' = ANY(ARRAY_AGG(health)) then 'YELLOW' ELSE 'GREEN' END AS overall_health FROM sys.health GROUP BY table_name, table_schema
+---------------+----------------+ | table_name | overall_health | +---------------+----------------+ | "foo" | "GREEN" | | "tbl1" | "GREEN" | | "restaurants" | "GREEN" | | "taxi" | "GREEN" | | "data" | "GREEN" | +---------------+----------------+
Now, for the second part, the row count information can be found in sys.shards are split into different records, each representing one shard. If we sum all the records from the primary shards of a table, we get the total existing records for that table.
SELECT table_name, SUM(num_docs) as total_records FROM sys.shards WHERE primary = TRUE GROUP BY table_name, schema_name
+---------------+---------------+ | table_name | total_records | +---------------+---------------+ | "foo" | 4 | | "taxi" | 10 | | "tbl1" | 2 | | "restaurants" | 25359 | | "data" | 149 | +---------------+---------------+
Now, we can perform both queries as sub-queries in a CTE. We will join both sub-queries using an inner-join on the table name.
WITH health as ( SELECT table_name, CASE WHEN 'RED' = ANY(ARRAY_AGG(health)) then 'RED' WHEN 'YELLOW' = ANY(ARRAY_AGG(health)) then 'YELLOW' ELSE 'GREEN' END AS overall_health FROM sys.health GROUP BY table_name, table_name ), row_count as ( SELECT table_name, SUM(num_docs) as total_records FROM sys.shards WHERE primary = TRUE GROUP BY table_name, schema_name ) SELECT * FROM health h, row_count r WHERE h.table_name = r.table_name
+--------------+----------------+---------------+ | table_name | overall_health | total_records | +--------------+----------------+---------------+ | foo | GREEN | 4 | | taxi | GREEN | 10 | | tbl1 | GREEN | 2 | | restaurants | GREEN | 25359 | | data | GREEN | 149 | +--------------+----------------+---------------+
Does this ring a bell with our Hybrid Search queries? We did the same thing, made a vector search query and a full-text query, and joined the results using some special weighted combination (convex or RRF).
Let's have a look at the `convex` combination query again.
WITH bm25 as ( SELECT (_score - MIN(_score) OVER ()) / (MAX(_score) OVER () - MIN(_score) OVER ()) as _score, title FROM fs_search WHERE MATCH("content", 'knn search') ORDER BY _score DESC ), vector as ( SELECT _score, title FROM fs_search WHERE KNN_MATCH(xs, [vector...], 15) ) SELECT a * bm25._score + (1 - a) * vector._score as hybrid_score, bm25._score as bm25_score, vector._score as vector_score, bm25.title FROM bm25, vector WHERE bm25.title = vector.title ORDER BY hybrid_score DESC
Two sub-queries are defined: bm25 and vector, and later on in the main query they are joined by title and the data we want selected, one of them being the actual convex combination that gives us the hybrid score.
Looking at the query you might be wondering what are we doing with _score - MIN(_score) OVER ()) / (MAX(_score) OVER () - MIN(_score) OVER () in the bm25 sub-query, we are normalizing the values.
Normalizing bm25 _score.
bm25 or full-text searches in CrateDB return a _score column. The higher the scores are the more the searched tokens are valued by the algorithm. The scores are not normalized between 0 and 1, e.g:
+-----------+-----------------+-----------------+------------+ | _score | city | country | population | |-----------|-----------------|-----------------|------------| | 2.4358468 | "New York" | "United States" | 18908608 | | 2.4358468 | "New Orleans" | "United States" | 932759 | +-----------+-----------------+-----------------+------------+
See how the first row's _score is 2.4358468.
A Vector search _score is a measure of how similar two vectors are. Even though we might use the euclidean distance to calculate their similarity, the actual distance is not returned by our KNN_MATCH function, only how related they are. We usually can divide 1 / distance and normalize between 0 and 1 to get the aforementioned value.
We should not sum a 0-1 normalized value with another value that can reach high values like 14 for example; a 'good' vector match might give us 0.89 (while 1 is a perfect match) but a bad full-text match might give a score of 3 if a higher score exists within the same result set. It would bias the final score towards the bad full-text search result and get a very inaccurate result.
Note that _score values of different queries cannot be meaningfully compared, as their values depend on the context of the result set.
By normalizing the bm25 value between 0 and 1 we can more accurately combine it with other normalized values. In this case using a common normalize method, min-max. The only special thing here is that we use window functions MIN(_score) OVER ()` and `MAX(_score) OVER () to calculate the min and max.
There are several methods to normalize values, using one or another might depend on your data and experimentation, although one thing you could do is use a simple min-max normalization method and fix the bias in the convex combination weights.
RRF query.
WITH bm25 as ( SELECT _score, RANK() OVER ( ORDER BY _score DESC ) as rank, title FROM fs_search WHERE MATCH("content", 'knn search') ORDER BY _score DESC ), vector as ( SELECT _score, RANK() OVER ( ORDER BY _score DESC ) as rank, title FROM fs_search WHERE KNN_MATCH( xs, [vector...], 15 ) ) SELECT TRUNC((1.0 / (bm25.rank + 60)) + (1.0 / (vector.rank + 60)), 6) as final_rank, bm25.rank as bm25_rank, vector.rank as vector_rank, bm25.title FROM bm25, vector WHERE bm25.title = vector.title ORDER BY final_rank DESC
At this point, you are well aware of the CTE structure, and we can skip that part. The only new thing we do in this query is using the RANK OVER () window function to assign a rank to our result set. Let's see an example re-using the `health-total_records` query we built before in the article, for clarity we will omit the sub-queries.
WITH health as (...), row_count as (...) SELECT * FROM health h, row_count r WHERE h.table_name = r.table_name ORDER BY total_records DESC
We will assign a rank, in descending order, to the tables that have more records, with RANK() OVER (). We need to fill in the OVER () clause with the condition for the ranks to be assigned, in this case OVER( ORDER BY SUM(NUM_DOCS) DESC ).
WITH health as (...), row_count as (...) SELECT *, RANK() OVER ( ORDER BY SUM(num_docs) DESC ) as rank FROM health h, row_count r WHERE h.table_name = r.table_name ORDER BY total_records DESC
+--------------+------------------+---------------+------+ | table_name | overall_health | total_records | RANK | +--------------+------------------+---------------+------+ | restaurants | GREEN | 25359 | 1 | | data | GREEN | 149 | 2 | | taxi | GREEN | 10 | 3 | | foo | GREEN | 4 | 4 | | tbl1 | GREEN | 2 | 5 | +--------------+------------------+---------------+------+
Besides RANK the only new thing we do is TRUNC the final_rank because the formula returns long decimals, at some point they carry nothing meaningful for us and can be safely truncated. This makes it easier to read and fit in consoles/on screen. You can, of course, not truncate at all: it's up to you.
Overview
We looked at what CTEs are, how they are constructed, and how to use them to build powerful hybrid-search queries. Also, we looked at smaller details of the hybrid search queries sucSh as how and why we normalized the scores in bm25 queries, how RANK() works, and why we TRUNC in RRF.