Fulltext indices¶
Fulltext indices take the contents of one or more fields and split it up into tokens that are used for fulltext-search. The transformation from a text to separate tokens is done by an analyzer. In order to create fulltext search queries a fulltext index with an analyzer must be defined for the related columns.
Table of contents
Index definition¶
In CrateDB, every column’s data is indexed using the plain
index method by
default. Currently there are three choices related to index definition:
Warning
Creating an index after a table was already created is currently not supported, so think carefully while designing your table definition.
Disable indexing¶
Indexing can be turned off by using the INDEX OFF
column definition.
cr> create table table_a (
... first_column text INDEX OFF
... );
CREATE OK, 1 row affected (... sec)
Note
Disabling indexing will reduce disk consumption but cause poorer query performance.
Note
INDEX OFF
cannot be used with
partition columns, as those are not stored
as normal columns of a table.
Plain index (default)¶
An index of type plain
is indexing the input data as-is without analyzing.
Using the plain
index method is the default behaviour but can also be
declared explicitly:
cr> create table table_b1 (
... first_column text INDEX using plain
... );
CREATE OK, 1 row affected (... sec)
This results in the same behaviour than without any index declaration:
cr> create table table_b2 (
... first_column text
... );
CREATE OK, 1 row affected (... sec)
Fulltext index with analyzer¶
By defining an index on a column, it’s analyzed data is indexed instead of the raw data. Thus, depending on the used analyzer, querying for the exact data may not work anymore. See Built-in analyzers for details about available builtin analyzer or Creating a custom analyzer.
If no analyzer is specified when using a fulltext index, the standard analyzer is used:
cr> create table table_c (
... first_column text INDEX using fulltext
... );
CREATE OK, 1 row affected (... sec)
Defining the usage of a concrete analyzer is straight forward by defining the
analyzer as a parameter using the WITH
statement:
cr> create table table_d (
... first_column text INDEX using fulltext with (analyzer = 'english')
... );
CREATE OK, 1 row affected (... sec)
Defining a named index column definition¶
It’s also possible to define an index column which treat the data of a given column as input. This is especially useful if you want to search for both, the exact and analyzed data:
cr> create table table_e (
... first_column text,
... INDEX first_column_ft using fulltext (first_column)
... );
CREATE OK, 1 row affected (... sec)
Of course defining a custom analyzer is possible here too:
cr> create table table_f (
... first_column text,
... INDEX first_column_ft
... using fulltext(first_column) with (analyzer = 'english')
... );
CREATE OK, 1 row affected (... sec)
Defining a composite index¶
Defining a composite (or combined) index is done using the same syntax as above
despite multiple columns are given to the fulltext
index method:
cr> create table documents_a (
... title text,
... body text,
... INDEX title_body_ft
... using fulltext(title, body) with (analyzer = 'english')
... );
CREATE OK, 1 row affected (... sec)
Composite indices can include nested columns within object columns as well:
cr> create table documents_b (
... title text,
... author object(dynamic) as (
... name text,
... birthday timestamp with time zone
... ),
... INDEX author_title_ft using fulltext(title, author['name'])
... );
CREATE OK, 1 row affected (... sec)
Note
If plain
index method is used, this internally translates to
fulltext with (analyzer = 'keyword')
.
Creating a custom analyzer¶
An analyzer consists of one tokenizer, zero or more token-filters, and zero or more char-filters.
When a field-content is analyzed to become a stream of tokens, the char-filter is applied at first. It is used to filter some special chars from the stream of characters that make up the content.
Tokenizers split the possibly filtered stream of characters into tokens.
Token-filters can add tokens, delete tokens or transform them to finally produce the desired stream of tokens.
With these elements in place, analyzers provide fine grained control over building a token stream used for fulltext search. For example you can use language specific analyzers, tokenizers and token-filters to get proper search results for data provided in a certain language.
Here is a simple Example:
cr> CREATE ANALYZER myanalyzer (
... TOKENIZER whitespace,
... TOKEN_FILTERS (
... lowercase,
... kstem
... ),
... CHAR_FILTERS (
... html_strip
... )
... );
CREATE OK, 1 row affected (... sec)
This creates a custom analyzer called myanalyzer
. It uses the built-in
Whitespace tokenizer tokenizer and two built-in token filters.
lowercase and kstem, as well as a
mapping char-filter.
:
It is possible to further customize the built-in token filters, char-filters or
tokenizers:
cr> create ANALYZER myanalyzer_customized (
... TOKENIZER whitespace,
... TOKEN_FILTERS (
... lowercase,
... kstem
... ),
... CHAR_FILTERS (
... mymapping WITH (
... type='mapping',
... mappings = ['ph=>f', 'qu=>q', 'foo=>bar']
... )
... )
... );
CREATE OK, 1 row affected (... sec)
This example creates another analyzer. This time called
myanalyzer_customized
. It uses the same tokenizer and token filters as in
the previous example, but specifies custom options to the
mapping char-filter.
:
The name (mymapping
) is a custom name which may not conflict with built-in
char-filters or other custom char-filters.
The provided type
property is required as it specifies which built-in
char-filter should be customized. The other option mappings
is specific to
the used type/char-filter.
Tokenizer and token-filters can be customized in the same way.
Note
Altering analyzers is not supported yet.
See also
CREATE ANALYZER for the syntax reference.
Built-in tokenizers for a list of built-in tokenizer.
Built-in token filters for a list of built-in token-filter.
Built-in char filter for a list of built-in char-filter.
Extending a built-in analyzer¶
Existing Analyzers can be used to create custom Analyzers by means of extending them.
You can extend and parameterize Built-in analyzers like this:
cr> create ANALYZER "german_snowball" extends snowball WITH (
... language = 'german'
... );
CREATE OK, 1 row affected (... sec)
If you extend Built-in analyzers, tokenizer, char-filter or token-filter cannot be defined. In this case use the parameters available for the extended Built-in analyzers.
If you extend custom-analyzers, every part of the analyzer that is omitted will be taken from the extended one. Example:
cr> create ANALYZER e2 EXTENDS myanalyzer (
... TOKENIZER mypattern WITH (
... type = 'pattern',
... pattern = '.*'
... )
... );
CREATE OK, 1 row affected (... sec)
This analyzer will use the char-filters and token-filters from myanalyzer
and will override the tokenizer with mypattern
.
See also
See the reference documentation of the Built-in analyzers to get detailed information on the available analyzers.