Skip to content
Explore

Try CrateDB Live

Scenario: Hybrid Search

To familiarize yourself with the data, let's run a simple SELECT statement that returns data for one specific area number:

SELECT
    name,
    details['description'] AS desc_text,
    details['description_vec'] AS desc_vec
FROM demo.community_areas
WHERE areanumber = 51;

Take a look at the values for desc_text and desc_vec.

  • desc_text is a free-text description of the characteristics of the community area, sourced from Wikipedia. We'll use this to explore CrateDB's full-text search capabilities.
  • desc_vec is a FLOAT_VECTOR column, containing vector embeddings created from the text in desc_text by running it through OpenAI's text-embedding-3-large model. These embeddings have been created for you, so you don't need to use the OpenAI API to work with data. We chose to use 2048 dimensions.

A. Full-text Search

The first type of search we'll learn about here is full-text search. We use full-text search when we want to find documents containing particular words or phrases whilst considering the search query can contain typos or synonyms and that we may want to search for given prefixes or perform fuzzy matching.

CrateDB uses Apache Lucene for full-text search. Search indexes can be built over any number of TEXT columns in a table, including those deeply nested inside OBJECT columns. Composite indexes containing data from more than one TEXT column can also be created.

Consider our community_areas table schema:

CREATE TABLE demo.community_areas (
   ...
   details OBJECT(DYNAMIC) AS (
       description TEXT INDEX USING fulltext WITH (analyzer = 'english'),
       ...

Here, description is declared as TEXT with the additional INDEX using full-text clause. This tells CrateDB to create a full-text index for this field and that we expect the content to be in English.

The MATCH predicate is used to perform full-text searches. Let's search for the term "railway" in our community area data:

SELECT
    name,
    _score,
    details['description'] as description
FROM demo.community_areas
WHERE MATCH(details['description'], 'railway')
ORDER BY _score DESC;

MATCH returns a special column, _score. The _score column represents how well a document matches the query. Higher scores indicate more relevant results.

A1. Experimenting with Full-text Search

The following query searches for the terms "railroad" OR "tracks":

SELECT
    name,
    _score,
    details['description'] AS description
FROM demo.community_areas
WHERE MATCH(details['description'], 'railroad tracks')
ORDER BY _score DESC
LIMIT 5;

Take a moment to study where the terms "railroad" or "tracks" are contained in the above matches and how the _score differs.

What if we wanted to search for the specific phrase "railroad tracks"? For that, we add USING phrase:

SELECT
    name,
    _score,
    details['description'] AS description
FROM demo.community_areas
WHERE MATCH(details['description'], 'railroad tracks') USING phrase
ORDER BY _score DESC
LIMIT 5;

Take a moment to look at the results here and see how they differ to those from the previous query that searched for "railroad" or "tracks".

Now, let's search for communities whose description matches both "railroad" and "historic":

SELECT
    name,
    _score,
    details['description'] AS description
FROM demo.community_areas
WHERE MATCH(details['description'], 'railroad historic') USING best_fields WITH (operator = 'and')
ORDER BY _score DESC
LIMIT 5;

Again, take a moment to study the text in each matching result.

A2. Combining Full-text Search with other criteria

As full-text search in CrateDB uses SQL, you can combine it with other criteria. For example, let's search for community areas whose description matches term "Univresity" (term containing a typo).

SELECT
    name,
    _score,
    details['population'] AS population,
    details['description'] AS description
FROM demo.community_areas
WHERE MATCH(details['description'], 'Univresity')
ORDER BY _score DESC
LIMIT 5;

How many results do we get? None... because there's a small typo in the search term. Specifying a fuzziness factor helps compensate for this sort of error in user input. Let's try again:

SELECT
    name,
    _score,
    details['population'] AS population,
    details['description'] AS description
FROM demo.community_areas
WHERE MATCH(details['description'], 'Univresity') USING best_fields WITH (fuzziness = 2)
ORDER BY _score DESC
LIMIT 5;

By adding a second clause, we can limit the results to those areas with a population of at least 30,000 people:

SELECT
    name,
    _score,
    details['population'] AS population,
    details['description'] AS description
FROM demo.community_areas
WHERE MATCH(details['description'], 'Univresity') USING best_fields WITH (fuzziness = 2)
  AND details['population'] >= 30000
ORDER BY _score DESC
LIMIT 5;

Here's an example of a negative search... we'll look for smaller communities with a population of 10,000 or fewer and whose descriptions don't mention railroads:

SELECT
    name,
    _score,
    details['population'] AS population,
    details['description'] AS description
FROM demo.community_areas
WHERE NOT MATCH(details['description'], 'railroad')
  AND details['population'] <= 10000
ORDER BY _score DESC;