CrateDB supports full text search over text data with Lucene. In this video I'll show you when you should consider full text search, how it works and how its behaviour can be customized. We will of course also examine some full text search queries using our Chicago neighbourhoods data set.
When should you consider using full text search?
- When you have lots of semi or unstructured data and you want to query for terms and phrases in that data.
- When you need a search engine that understands the language that the text is written in.
- When you want to consider search results in order of relevance to your query.
Querying text data for character sequences using LIKE isn't going to help with these requirements, especially if you need fuzzy searching or want to search for phrases.
What can full text search be applied to in CrateDB? Any text column in a table, including those deeply nested in objects.
And how does this work in Crate DB? Full text search is powered by the indexing capabilities of Apache Lucene and queries are expressed in standard SQL.
Full text search queries can also be combined with the querying and filtering by other data types. For example, with our Chicago data set, we might perform a full text search over the community areas data, find areas that match the term historic or railroad and which have a population above a certain threshold, where the population data is stored in a regular numerical column in the table.
Here's a SQL statement that creates a table for our community area data. As part of the data set, we have a textual description of each area that we're storing as the field description inside the details object. Description is of type TEXT and here we're telling CrateDB to create a full text index for this field and that we have English language content.
When new rows are written to this table, CrateDB will analyse the content of the description field and build and maintain a search index for it. This indexing process is extremely fast, minimising the time between new data being received and the ability to query it. In this version of the CREATE TABLE statement for our community_areas table, we're defining a separate Composite Index that combines the contents of the name and the description fields. Let's look at how the indexing process works in more detail.
Imagine we have two text documents that look like this, and that content is a TEXT column with full text search indexing enabled. As documents in this column are added or updated, they are first processed by an analyzer. The analyzer takes the contents of the field and splits it into tokens which are used in the search index. The analyzer consists of a tokenizer, zero or more token filters, and zero or more char filters.
When content is analysed to become a stream of tokens, the char filter is applied, filtering out some special characters, for example apostrophes. A tokeniser takes the filtered character stream and splits it into tokens, for example using white space to distinguish between tokens. Using this process an inverted search index is created. This maps tokens or terms back to the documents that contain them and normalises them for example to lowercase. The analyser process is highly configurable and can be adapted to different languages and for different content types.
CrateDB supports numerous analysers out-of-the-box. These are pre configured for common scenarios. For example, the standard analyzer includes the lowercase token filter to transform tokens to lowercase and a stop words token filter. This removes commonly used words that aren't needed for searches because they contain very little meaning. Examples stop words in English include a, the, is, and are. When configuring an analyzer here are some of the tokenier options supported. When processing documents containing e-mail addresses, the URL e-mail tokeniser ensures that these are recognised as individual tokens.
Here are some of the supported token filters:
For example, stemmer will have the token plus words with the same root form or stem. This is language specific. In English, runs, running and ran might all stem to the word run.
Phonetic converts tokens to their phonetic representations. Finally, it is a representative selection of the character filters.
The HTML strip filter removes HTML elements from analyse text, useful when the text contains markup.
Remember that more than one token and character filter can be used when creating a custom analyzer. There are many more options available to you. Consult the full Text Analyzers section of the CrateDB documentation for a comprehensive guide.
Custom analyzers can be created using the CREATE ANALYZER SQL statement. Here I'm specifying an analyzer called myanalyzer that uses the white space tokenizer and the lowercase filter. It also contains a custom char filter that maps certain sequences of characters to others, for example replacing ph with F. The custom analyzer can then be referenced by name when defining full text indexes against text columns in a CrateDB table. You can also create new analyzers by extending the behaviour of existing ones.
Here I'm extending the Snowball analyzer to use the German language when tokenizing. CrateDB supports a wide range of languages for building full text search indexes out-of-the-box. Consult the documentation for more details on what's supported with each language.
Now we've indexed our data for full text search, it's time to try some queries against it. The MATCH predicate is used to perform these queries. At it's simplest, MATCH takes the field to query as a first parameter and a search query term as a second. This query searches for community areas whose description matches the term railway. When using MATCH, we can also return a score that indicates the relative quality of the match. Select _score to retrieve. This is the result of running this query. This is the result with the highest relevant score and we can see where our search term matched in the description. This query uses 2 terms, 'railroad' and 'tracks'. MATCH treats this as a query for railroad or tracks. When we run this query, we see that the highest scoring result contains 2 occurrences of tracks and one of railroad.
Here I'm specifying I want to search for the phrase 'railroad yard'. This means that I want to find the results containing those terms together in that order. Here we see we have a match for railroad yards. Other community areas that contain mentions of railroads and stock yards don't match and aren't returned here. MATCH also supports the option to specify a 'slop' or proximity for words in a phrase. For example, setting a slop of one would match 'railroad goods yards'. It's important to remember that with CrateDB, other data types can be used in SELECT queries that also leverage full text search.
Here I'm looking for community areas whose descriptions contain both the terms 'railroad' and 'historic' or words that stem from these words and I've specified the AND operator instead of the default OR as an option to the MATCH predicate. I've also added a regular SQL AND clause to check the value of the population field inside the details object.
These multi model queries can be very powerful. For example, we might combine geospatial data with review ratings and text descriptions to find a restaurant that's got a high star rating, allows dogs to visit, and is within 5 kilometres of our hotel.
This video provided a high level introduction to full text search in CrateDB. To learn more about how to create and configure full text search indexes and perform rich queries with the match predicate, please refer to the Crate DB documentation where you'll find reference materials and tutorials.