Independent Time Series Benchmark Confirms CrateDB’s Top-Tier Performance

Read now
Skip to content
Blog

Dissecting a Hybrid Search query in SQL

In https://cratedb.com/blog/hybrid-search-explained we learned about Hybrid Search and how to do it in pure SQL. The resulting query can be hard to understand if you don't have too much experience with Common Table Expressions (CTEs). In this article, we will dive deeper into CTEs and the finer details of the query.

Recap

In the previous article, we learned that hybrid search is about running some queries that capture different meanings and combine them. Keep this in mind as we see how CTEs are very similar.

Common Table Expressions.

Common table expressions are sub-queries that can be referenced from a main query. They are temporal, meaning that outside of this main query, they do not exist.

Let's look at the most basic CTE:

WITH a AS (SELECT 1)
SELECT a FROM a
+---+
| 1 |
+---+
| 1 |
+---+

WITH x as is where we define the name of the sub-query, inside the brackets (query) is the actual sub-query. The main query comes after the definition of the sub-queries.

We can include as many sub-queries as needed.

WITH
Q AS (SELECT 1),
Q2 AS (SELECT 2),
Q3 AS (SELECT 3)
SELECT * FROM Q, Q2, Q3
+---+---+---+
| 1 | 2 | 3 |
+---+---+---+
| 1 | 2 | 3 |
+---+---+---+

See the pattern now? To top it off, let's look at a practical example. Let's build a query that reports the names of the tables in CrateDB, their health, and how many rows there are. We'll use this for our shiny new dashboard.

In CrateDB, several tables contain metadata about the database itself. Some of these can be found in the sys schema. In sys.health there is information about the health of the tables. Read more about health.

SELECT table_name,
  CASE 
    WHEN 'RED' = ANY(ARRAY_AGG(health)) then 'RED' 
    WHEN 'YELLOW' = ANY(ARRAY_AGG(health)) then 'YELLOW'
    ELSE 'GREEN'
  END AS overall_health
FROM
  sys.health
GROUP BY 
  table_name,
  table_schema
+---------------+----------------+
| table_name    | overall_health |
+---------------+----------------+
| "foo"         | "GREEN"        |
| "tbl1"        | "GREEN"        |
| "restaurants" | "GREEN"        |
| "taxi"        | "GREEN"        |
| "data"        | "GREEN"        |
+---------------+----------------+

Now, for the second part, the row count information can be found in sys.shards are split into different records, each representing one shard. If we sum all the records from the primary shards of a table, we get the total existing records for that table.

Read more about shards.

SELECT
  table_name,
  SUM(num_docs) as total_records
FROM
  sys.shards
WHERE
  primary = TRUE
GROUP BY
  table_name,
  schema_name
+---------------+---------------+
| table_name    | total_records |
+---------------+---------------+
| "foo"         | 4             |
| "taxi"        | 10            | 
| "tbl1"        | 2             |
| "restaurants" | 25359         |
| "data"        | 149           |
+---------------+---------------+

Now, we can perform both queries as sub-queries in a CTE. We will join both sub-queries using an inner-join on the table name.

WITH health as (
  SELECT 
    table_name,
    CASE WHEN 'RED' = ANY(ARRAY_AGG(health)) then 'RED'
    WHEN 'YELLOW' = ANY(ARRAY_AGG(health)) then 'YELLOW'
    ELSE 'GREEN' END AS overall_health 
  FROM sys.health
    GROUP BY
      table_name,
      table_name
),
row_count as (
  SELECT 
    table_name,
    SUM(num_docs) as total_records
  FROM
    sys.shards
  WHERE primary = TRUE 
  GROUP BY
    table_name,
    schema_name
)
SELECT 
  * 
FROM 
  health h,
  row_count r
WHERE 
  h.table_name = r.table_name
+--------------+----------------+---------------+
|  table_name  | overall_health | total_records |
+--------------+----------------+---------------+
| foo          | GREEN          | 4             |
| taxi         | GREEN          | 10            |
| tbl1         | GREEN          | 2             |
| restaurants  | GREEN          | 25359         |
| data         | GREEN          | 149           |
+--------------+----------------+---------------+

Does this ring a bell with our Hybrid Search queries? We did the same thing, made a vector search query and a full-text query, and joined the results using some special weighted combination (convex or RRF).

Let's have a look at the `convex` combination query again.

WITH bm25 as (
  SELECT
    (_score - MIN(_score) OVER ()) / (MAX(_score) OVER () - MIN(_score) OVER ()) as _score,
    title
  FROM
    fs_search
  WHERE
    MATCH("content", 'knn search')
  ORDER BY
    _score DESC
),
vector as (
  SELECT
    _score,
    title
  FROM
    fs_search
  WHERE
    KNN_MATCH(xs, [vector...], 15)
)
SELECT
  a * bm25._score + (1 - a) * vector._score as hybrid_score,
  bm25._score as bm25_score,
  vector._score as vector_score,
  bm25.title
FROM
  bm25,
  vector
WHERE
  bm25.title = vector.title
ORDER BY
  hybrid_score DESC

Two sub-queries are defined: bm25 and vector, and later on in the main query they are joined by title and the data we want selected, one of them being the actual convex combination that gives us the hybrid score.

Looking at the query you might be wondering what are we doing with _score - MIN(_score) OVER ()) / (MAX(_score) OVER () - MIN(_score) OVER () in the bm25 sub-query, we are normalizing the values.

Normalizing bm25 _score.

bm25 or full-text searches in CrateDB return a _score column. The higher the scores are the more the searched tokens are valued by the algorithm. The scores are not normalized between 0 and 1, e.g:

+-----------+-----------------+-----------------+------------+
| _score    | city            | country         | population |
|-----------|-----------------|-----------------|------------|
| 2.4358468 | "New York"      | "United States" | 18908608   |
| 2.4358468 | "New Orleans"   | "United States" | 932759     |
+-----------+-----------------+-----------------+------------+

See how the first row's _score is 2.4358468.

A Vector search _score is a measure of how similar two vectors are. Even though we might use the euclidean distance to calculate their similarity, the actual distance is not returned by our KNN_MATCH function, only how related they are. We usually can divide 1 / distance and normalize between 0 and 1 to get the aforementioned value.

We should not sum a 0-1 normalized value with another value that can reach high values like 14 for example; a 'good' vector match might give us 0.89 (while 1 is a perfect match) but a bad full-text match might give a score of 3 if a higher score exists within the same result set. It would bias the final score towards the bad full-text search result and get a very inaccurate result.

Note that _score values of different queries cannot be meaningfully compared, as their values depend on the context of the result set.

By normalizing the bm25 value between 0 and 1 we can more accurately combine it with other normalized values. In this case using a common normalize method, min-max. The only special thing here is that we use window functions MIN(_score) OVER ()` and `MAX(_score) OVER () to calculate the min and max.

There are several methods to normalize values, using one or another might depend on your data and experimentation, although one thing you could do is use a simple min-max normalization method and fix the bias in the convex combination weights.

RRF query.

WITH bm25 as (
  SELECT
    _score,
    RANK() OVER (
      ORDER BY
        _score DESC
    ) as rank,
    title
  FROM
    fs_search
  WHERE
    MATCH("content", 'knn search')
  ORDER BY
    _score DESC
),

vector as (
  SELECT
    _score,
    RANK() OVER (
      ORDER BY
        _score DESC
    ) as rank,
    title
  FROM
    fs_search
  WHERE
    KNN_MATCH(
      xs,
      [vector...],
      15
    )
)

SELECT
  TRUNC((1.0 / (bm25.rank + 60)) + (1.0 / (vector.rank + 60)), 6) as final_rank,
  bm25.rank as bm25_rank,
  vector.rank as vector_rank,
  bm25.title
FROM
  bm25,
  vector
WHERE
  bm25.title = vector.title
ORDER BY final_rank DESC

At this point, you are well aware of the CTE structure, and we can skip that part. The only new thing we do in this query is using the RANK OVER () window function to assign a rank to our result set. Let's see an example re-using the `health-total_records` query we built before in the article, for clarity we will omit the sub-queries.

WITH health as (...),
row_count as (...)
SELECT
  *
FROM
  health h,
  row_count r
WHERE
  h.table_name = r.table_name
ORDER BY
  total_records DESC

We will assign a rank, in descending order, to the tables that have more records, with RANK() OVER (). We need to fill in the OVER () clause with the condition for the ranks to be assigned, in this case OVER( ORDER BY SUM(NUM_DOCS) DESC ).

WITH health as (...),
row_count as (...)
SELECT
  *,
  RANK() OVER (
    ORDER BY
      SUM(num_docs) DESC
  ) as rank
FROM
  health h,
  row_count r
WHERE
  h.table_name = r.table_name
ORDER BY
  total_records DESC
+--------------+------------------+---------------+------+
|  table_name  |  overall_health  | total_records | RANK |
+--------------+------------------+---------------+------+
| restaurants  | GREEN	          | 25359         | 1    |
| data	       | GREEN	          | 149           | 2    |
| taxi	       | GREEN	          | 10            | 3    |
| foo          | GREEN	          | 4             | 4    |
| tbl1	       | GREEN	          | 2             | 5    |
+--------------+------------------+---------------+------+

Besides RANK the only new thing we do is TRUNC the final_rank because the formula returns long decimals, at some point they carry nothing meaningful for us and can be safely truncated. This makes it easier to read and fit in consoles/on screen. You can, of course, not truncate at all: it's up to you.

Overview

We looked at what CTEs are, how they are constructed, and how to use them to build powerful hybrid-search queries. Also, we looked at smaller details of the hybrid search queries sucSh as how and why we normalized the scores in bm25 queries, how RANK() works,  and why we TRUNC in RRF.