Analyzers, Tokenizers, and Filters¶
CrateDB provides the capabilities to adjust the full-text search behaviour by using analyzers from the large array of open-source Lucene components, or by employing proprietary analyzers.
Analyzers consist of two parts, filters, and tokenizers. Each analyzer must contain one tokenizer and only one tokenizer can be used.
Tokenizers decide how to divide the given text into parts. Filters perform a series of transformations by passing the given text through a number of operations. They are divided into token filters and character filters, discriminating between filters applied before, or after the tokenization step.
Popular filters are stopword lists, lowercase transformations, or word stemmers. Those articles share more insights about the concepts of analyzers on behalf of Apache Solr and Elasticsearch, likewise building upon Lucene technologies.
Through parameterization of filters and tokenizers, the ingredients of
analyzers, and the CREATE ANALYZER
SQL command, analyzers can be created and
configured at runtime.
Adjusting Tokenization¶
A common approach in information retrieval is to use n-grams for fuzzy search or spelling corrections. Each word will be split into tokens of n number of grams. By decreasing the length of the n-grams, the search will become more fuzzy.
Tip
CrateDB also provides a dedicated option for conducting Fuzzy Search, which is the recommended way to conduct fuzzy searches. This section outlines how to achieve the same by using a custom analyzer.
Example
-- Start with a blank canvas.
DROP TABLE IF EXISTS person;
-- Define a custom analyzer, configuring its tokenizer.
-- A 2-2 n-gram tokenizer will split a word into tokens
-- starting from length 2 up until length 2.
CREATE ANALYZER fuzzy_ngram (
TOKENIZER _ WITH (
type='ngram',
min_gram=2,
max_gram=2,
token_chars=['letter']));
-- Define table schema using a full-text index
-- and the custom analyzer defined above.
CREATE TABLE person (
firstname VARCHAR,
lastname VARCHAR,
INDEX lastname_ft
USING FULLTEXT (lastname) WITH (analyzer = 'fuzzy_ngram'));
-- Provide a few samples worth of data.
INSERT INTO person (firstname, lastname)
VALUES ('charly', 'brown'), ('charly', 'braun'), ('charly', 'browne');
-- Synchronize writes.
REFRESH TABLE person;
-- Invoke a full-text query including a typing error.
SELECT firstname, lastname, _score
FROM person
WHERE MATCH(lastname_ft, 'bronw')
ORDER BY _score DESC;
+-----------+----------+------------+
| firstname | lastname | _score |
+-----------+----------+------------+
| charly | brown | 0.17363958 |
| charly | browne | 0.1585405 |
| charly | braun | 0.13076457 |
+-----------+----------+------------+
SELECT 3 rows in set (0.016 sec)
Synonyms¶
In a table with firstname and lastname, find matches that tolerate different variants of first names coming from user input, to find nicknames and diminutive names such as
William=Bill
.
CrateDB supports synonym files. The synonym file needs to be placed in the config folder and must be in the Solr or WordNet synonym file format.
Inherited from Elasticsearch, this feature provides the same features as illustrated at the documentation about its synonym token filter.
Apache Solr File Format
Solr’s synonym file format contains a list of synonyms, one per line. This may also be a comma-separated list of multiple paths. This format uses two different definitions:
- Equivalents:
Groups of words that are equivalent. Define by separating words by commas.
couch, sofa, divan ipod, i-pod, i pod computer, pc, laptop
- Mappings:
Explicitly map a group of words to other words. Words on the left hand side of the rule definition are expanded into all the possibilities described on the right hand side.
teh => the small => tiny,teeny,weeny huge,ginormous,humungous => large personal computer => pc sea biscuit, sea biscit => seabiscuit
Wordnet File Format
A typical Wordnet file defining synonyms looks like this.
s(900516492,3,'Computer',n,1,0).
s(900516492,3,'PC',n,1,0).
s(900516492,3,'Laptop',n,1,0).
Examples
To exercise the examples below, manifest two synonym files, and start CrateDB using the Docker command. In a different setup, please adjust file system paths accordingly.
File synonyms-solr.txt
:
William => Bill
File synonyms-wordnet.txt
:
s(900516492,3,'William',n,1,0).
s(900516492,3,'Bill',n,1,0).
docker run --rm -it --name=cratedb --publish=4200:4200 --env=CRATE_HEAP_SIZE=2g \
--volume="$PWD/synonyms-solr.txt:/crate/config/synonyms-solr.txt" \
--volume="$PWD/synonyms-wordnet.txt:/crate/config/synonyms-wordnet.txt" \
crate -Cdiscovery.type=single-node
This example uses the synonyms-solr.txt
in Solr format.
-- Start with a blank canvas.
DROP TABLE IF EXISTS person;
DROP ANALYZER firstname_synonyms;
-- Define a custom analyzer using a synonym file in Solr format.
CREATE ANALYZER firstname_synonyms (
TOKENIZER lowercase,
TOKEN_FILTERS (
_ WITH (
type='synonym',
synonyms_path='synonyms-solr.txt')));
-- Define table schema using a full-text index
-- and the custom analyzer defined above.
CREATE TABLE person (
firstname VARCHAR,
lastname VARCHAR,
INDEX firstname_ft
USING FULLTEXT (firstname) WITH (analyzer = 'firstname_synonyms'));
-- Provide a few samples worth of data.
INSERT INTO person (firstname, lastname)
VALUES ('Bill', 'Gates'), ('William', 'Pereira'), ('William', 'Shatner');
-- Synchronize writes.
REFRESH TABLE person;
-- Invoke a full-text query including a typing error.
SELECT firstname, lastname, _score
FROM person
WHERE MATCH(firstname_ft, 'William')
ORDER BY _score DESC;
+-----------+----------+------------+
| firstname | lastname | _score |
+-----------+----------+------------+
| Bill | Gates | 0.13076457 |
| William | Shatner | 0.13076457 |
| William | Pereira | 0.13076457 |
+-----------+----------+------------+
SELECT 3 rows in set (0.070 sec)
This example uses the synonyms-wordnet.txt
in Wordnet format.
-- Start with a blank canvas.
DROP TABLE IF EXISTS person;
DROP ANALYZER firstname_synonyms;
-- Define a custom analyzer using a synonym file in Wordnet format.
CREATE ANALYZER firstname_synonyms (
TOKENIZER lowercase,
TOKEN_FILTERS (
_ WITH (
type='synonym',
format='wordnet',
synonyms_path='synonyms-wordnet.txt')));
-- Define table schema using a full-text index
-- and the custom analyzer defined above.
CREATE TABLE person (
firstname VARCHAR,
lastname VARCHAR,
INDEX firstname_ft
USING FULLTEXT (firstname) WITH (analyzer = 'firstname_synonyms'));
-- Provide a few samples worth of data.
INSERT INTO person (firstname, lastname)
VALUES ('Bill', 'Gates'), ('William', 'Pereira'), ('William', 'Shatner');
-- Synchronize writes.
REFRESH TABLE person;
-- Invoke a full-text query including a typing error.
SELECT firstname, lastname, _score
FROM person
WHERE MATCH(firstname_ft, 'William')
ORDER BY _score DESC;
+-----------+----------+------------+
| firstname | lastname | _score |
+-----------+----------+------------+
| Bill | Gates | 0.20922333 |
| William | Pereira | 0.1325975 |
| William | Shatner | 0.1325975 |
+-----------+----------+------------+
SELECT 3 rows in set (0.107 sec)