Hybrid Search Explained

Written by Ivan Sanchez | 2024-07-22

CrateDB supports three search functions: kNN search via KNN_MATCH, bm25 search via MATCH, and geospatial search via MATCH.

Hybrid Search is a technique that enhances relevancy and accuracy by combining the results of two or more search algorithms, achieving better accuracy and relevancy than each algorithm would individually.

A common scenario is to combine semantic search (vector search) with lexical search (keyword search). Semantic search excels at understanding the context of a phrase. Lexical search is great at finding how many times a keyword or phrase appears in a document, taking into account the length and the average length of your documents.

Imagine an e-commerce platform that sells tech products. The user could query: GPU ASUS or graphic card ASUS. Semantic search would understand that both GPU and graphic card refer to a graphics processing unit. If we relied solely on semantic search, we would only get results for graphic cards. However, since the user specifies a particular brand, ASUS, wouldn't it be great if we could return results that match the exact product type and the specific brand they are looking for?

We will go through both search methods: bm25 and vector search, learn different re-ranking techniques, and finally apply them in CrateDB.

Full-text search (BM25)

More in: https://cratedb.com/docs/crate/reference/en/latest/general/dql/fulltext.html

BM25 or full-text search is a bag-of-words algorithm that ranks an unordered set of documents based on the keyword of a query. It considers the number of times the keywords appear, the length of the document, and the average length of all documents.

Source: Wikipedia - Okapi BM25.

CrateDB leverages Lucene's powerful full-text search capabilities allowing you to tweak this search; you can search on several columns, add fuzziness, choose between several analyzers, tokenizers, composite indexes, and more.

To enable full-text search on a column, add a full-text index:

Now we can start doing interesting MATCH queries:

We can tweak fuzziness to get approximate matches.

Vector search (k-nearest neighbors)

There are machine learning models that can transform data like text or images into dense vectors (list of non-zero floating point numbers).
The closer two vectors are in space, the more similar they are, we can use this property to
calculate how similar a query is to our data.

"nvidia gpu" -> [0.393, 0.393, 0.02, 0.18, 0.84, 0.49...]

They are commonly used for search, clustering, recommendations, classification, etc.

You can store these vectors in the FLOAT_VECTOR type, which supports vectors up to a length of 2048, for reference models like OpenAI's text-embedding-3-small can generate vectors of length 1536.

The k-nearest neighbors algorithm calculates the k-closest vectors from a given vector, returning data that is similar/related, search quality heavily depends on
the model used and the quality of the data. You will need to transform the data you want to search into vectors.

Note: CrateDB uses Approximate nearest neighbor search, making it suitable to search through big amounts of data.

The query is "city of love"

Interestingly, "Napoleon was born in this city" yields:

+ LIMIT 1

Actually doing hybrid search

As we mentioned before, hybrid search is the combination of two search results; we do one search, we do another and then combine the results into a single one, there are many different ways of merging results, this process is commonly called Re-ranking. We will cover two simple yet powerful methods.

Convex combination

We can apply a weighted function known as a Convex combination, in simple terms, we add both scores with a weight, and the optimal value of the weight will depend on the data and use case, for some datasets semantic search might be more important than lexical search for example. You will need to experiment and tweak the weight of your use case.

Where α is the weight.

Note: Both scores need to be normalised between [0, 1].

Let's apply the formula to the following search results:

If we take the first row (id = 4) with α = 0.3,

convex = 0.3 * 0.3843 + (1 - 0.3) * 0.2891
convex = 0.31766

The same table ordered by convex:

Reciprocal rank fusion

Reciprocal rank fusion or RRF is a very popular method due to its simplicity and effectiveness. It merges ranks without caring about the specific scores given by different search methods.

Source: Cormack et al.

Note: k is typically recommended to be 60.

Unlike Convex Combination, which sums the scores for each search method, RRF uses the ranks. A rank is the position in the search results; a lower rank indicates a better score, with the best rank being 1.

Let's see an example, given two search results:

With two searches, our formula expands to:

rrf(1) = 1 / (60 + 1) + 1 / (60 + 1) = 0.03278
rrf(4) = 1 / (60 + 2) + 1 / (60 + 3) = 0.03200
rrf(6) = 1 / (60 + 3) + 1 / (60 + 2) = 0.03200

Let's put everything together to get a more precise look:

Wrapping it up (An actual example)

Now that you have a more clear overview of hybrid search in CrateDB, let's see
what would a one-query hybrid search look like.

We will leverage common table expressions to structure our query like:

Note how we use an inner-join to join results from both search methods into a single table.
This means that you will need to have the vector and the full-text indexed columns in the same table or in different tables that can be inner-join, either with an id or some other column.

Example of convex Combination

In this example, we apply a min/max normalization to bm25, which is not normalized by default.

With α = 0.4

Note: In this example we join by title, but as they are in the same table, we could have used the internal _id column.

Reciprocal rank fusion

We will use RANK window function to assign a rank to every row, note how we don't care about normalizing bm25 score here.

Note: final_rank is truncated for convenience.

Summary

In this article, we covered the concept of hybrid search in CrateDB, combining results from different search methods to get more relevant results leveraging semantic and lexical search. CrateDB is a great database for creating products with this technique due to its strong resiliency, fast queries thanks to its unique index-all, SQL language, and its rich and complete data types.

View full post