Try CrateDB Live
- 1. Run CrateDB
- 2. Choose Scenario
- 3. Get Ready
- 4. Import Data
- 5. Explore Queries
- 6. More Queries
- 7. Connect
- 8. Next Steps
To familiarize yourself with the data, let's run a simple SELECT statement that returns data for one specific area number:
SELECT name, details['description'] AS desc_text, details['description_vec'] AS desc_vec FROM demo.community_areas WHERE areanumber = 51;
Take a look at the values for desc_text and desc_vec.
desc_textis a free-text description of the characteristics of the community area, sourced from Wikipedia. We'll use this to explore CrateDB's full-text search capabilities.desc_vecis aFLOAT_VECTORcolumn, containing vector embeddings created from the text in desc_text by running it through OpenAI's text-embedding-3-large model. These embeddings have been created for you, so you don't need to use the OpenAI API to work with data. We chose to use 2048 dimensions.
A. Full-text Search
The first type of search we'll learn about here is full-text search. We use full-text search when we want to find documents containing particular words or phrases whilst considering the search query can contain typos or synonyms and that we may want to search for given prefixes or perform fuzzy matching.
CrateDB uses Apache Lucene for full-text search. Search indexes can be built over any number of TEXT columns in a table, including those deeply nested inside OBJECT columns. Composite indexes containing data from more than one TEXT column can also be created.
Consider our community_areas table schema:
CREATE TABLE demo.community_areas ( ... details OBJECT(DYNAMIC) AS ( description TEXT INDEX USING fulltext WITH (analyzer = 'english'), ...
Here, description is declared as TEXT with the additional INDEX using full-text clause. This tells CrateDB to create a full-text index for this field and that we expect the content to be in English.
The MATCH predicate is used to perform full-text searches. Let's search for the term "railway" in our community area data:
SELECT name, _score, details['description'] as description FROM demo.community_areas WHERE MATCH(details['description'], 'railway') ORDER BY _score DESC;
MATCH returns a special column, _score. The _score column represents how well a document matches the query. Higher scores indicate more relevant results.
A1. Experimenting with Full-text Search
The following query searches for the terms "railroad" OR "tracks":
SELECT name, _score, details['description'] AS description FROM demo.community_areas WHERE MATCH(details['description'], 'railroad tracks') ORDER BY _score DESC LIMIT 5;
Take a moment to study where the terms "railroad" or "tracks" are contained in the above matches and how the _score differs.
What if we wanted to search for the specific phrase "railroad tracks"? For that, we add USING phrase:
SELECT name, _score, details['description'] AS description FROM demo.community_areas WHERE MATCH(details['description'], 'railroad tracks') USING phrase ORDER BY _score DESC LIMIT 5;
Take a moment to look at the results here and see how they differ to those from the previous query that searched for "railroad" or "tracks".
Now, let's search for communities whose description matches both "railroad" and "historic":
SELECT name, _score, details['description'] AS description FROM demo.community_areas WHERE MATCH(details['description'], 'railroad historic') USING best_fields WITH (operator = 'and') ORDER BY _score DESC LIMIT 5;
Again, take a moment to study the text in each matching result.
A2. Combining Full-text Search with other criteria
As full-text search in CrateDB uses SQL, you can combine it with other criteria. For example, let's search for community areas whose description matches term "Univresity" (term containing a typo).
SELECT name, _score, details['population'] AS population, details['description'] AS description FROM demo.community_areas WHERE MATCH(details['description'], 'Univresity') ORDER BY _score DESC LIMIT 5;
How many results do we get? None... because there's a small typo in the search term. Specifying a fuzziness factor helps compensate for this sort of error in user input. Let's try again:
SELECT name, _score, details['population'] AS population, details['description'] AS description FROM demo.community_areas WHERE MATCH(details['description'], 'Univresity') USING best_fields WITH (fuzziness = 2) ORDER BY _score DESC LIMIT 5;
By adding a second clause, we can limit the results to those areas with a population of at least 30,000 people:
SELECT name, _score, details['population'] AS population, details['description'] AS description FROM demo.community_areas WHERE MATCH(details['description'], 'Univresity') USING best_fields WITH (fuzziness = 2) AND details['population'] >= 30000 ORDER BY _score DESC LIMIT 5;
Here's an example of a negative search... we'll look for smaller communities with a population of 10,000 or fewer and whose descriptions don't mention railroads:
SELECT name, _score, details['population'] AS population, details['description'] AS description FROM demo.community_areas WHERE NOT MATCH(details['description'], 'railroad') AND details['population'] <= 10000 ORDER BY _score DESC;