Fulltext search¶
In order to use fulltext search on one or more columns, a
fulltext index with an analyzer has to be
defined while creating the column: either with CREATE TABLE
or ALTER
TABLE ADD COLUMN
. For more information see Fulltext indices.
Table of contents
MATCH
Predicate¶
Synopsis¶
MATCH (
{ column_or_idx_ident | ( column_or_idx_ident [boost] [, ...] ) }
, query_term
) [ using match_type [ with ( match_parameter [= value] [, ... ] ) ] ]
The MATCH predicate performs a fulltext search on one or more indexed columns or indices and supports different matching techniques. It can also be used to perform geographical searches on Geometric shapes indices.
The actual applicability of the MATCH predicate depends on the index’s type. In
fact, the availability of certain match_types
and match_parameters
depend on the index. This section however, only covers the usage of the MATCH
predicate on fulltext
indices on text
columns. To use MATCH on
Geometric shapes indices, see Geo search.
In order to use fulltext searches on a column, a fulltext index with an analyzer must be created for this column. See Fulltext indices for details. There are different types of Fulltext indices with different goals, however it’s not possible to query multiple index columns with different index types within the same MATCH predicate.
To get the relevance of a matching row, a specific system column _score can be selected. It contains a numeric score relative to the other rows: The higher, the more relevant the row:
cr> select name, _score from locations
... where match(name_description_ft, 'time') order by _score desc;
+-----------+-----------+
| name | _score |
+-----------+-----------+
| Altair | 0.6570115 |
| Bartledan | 0.6416173 |
+-----------+-----------+
SELECT 2 rows in set (... sec)
The MATCH predicate in its simplest form performs a fulltext search against a
single column. It takes the query_term
and, if no analyzer
was
provided, analyzes the term with the analyzer configured on
column_or_idx_ident
. The resulting tokens are then matched against the
index at column_or_idx_ident
and if one of them matches, MATCH returns
TRUE
.
The MATCH predicate can be also used to perform a fulltext search on multiple
columns with a single query_term
and to add weight to specific columns it’s
possible to add a boost
argument to each column_or_idx_ident
. Matches
on columns with a higher boost result in a higher _score value for that document.
The match_type
argument determines how a single query_term
is applied
and how the resulting _score is computed. For more information see
Match Types.
Results are ordered by _score
by default, but can be overridden by adding an ORDER BY
clause.
Arguments¶
- column_or_idx_ident:
A reference to a column or an index.
If the column has an implicit index (e.g. created with something like
TEXT column_a INDEX USING FULLTEXT
) this should be the name of the column.If the column has an explicit index (e.g. created with something like
INDEX "column_a_idx" USING FULLTEXT ("column_a") WITH (...)
) this should be the name of the index.By default every column is indexed but only the raw data is stored, so matching against a
text
column without a fulltext index is equivalent to using the=
operator. To perform real fulltext searches use a fulltext index.- boost:
A column ident can have a boost attached. That is a weight factor that increases the relevance of a column in respect to the other columns. The default boost is 1.
- query_term:
This string is analyzed (using the explicitly given
analyzer
or the analyzer of the columns to perform the search on) and the resulting tokens are compared to the index. The tokens used for search are combined using the booleanOR
operator unless stated otherwise using theoperator
option.- match_type:
Optional. Defaults to
best_fields
for fulltext indices. For details see Match Types.
Note
The MATCH
predicate can only be used in the WHERE clause
and on user-created tables. Using the MATCH
predicate on system tables is
not supported.
One MATCH
predicate cannot combine columns of both relations of a join.
Additionally, MATCH
predicates cannot be used on columns of both
relations of a join if they cannot be logically applied to each of them
separately. For example:
This is allowed:
FROM t1, t2 WHERE match(t1.txt, 'foo') AND match(t2.txt, 'bar');``
But this is not:
FROM t1, t2 WHERE match(t1.txt, 'foo') OR match(t2.txt, 'bar');
Match Types¶
The match type determines how the query_term
is applied and the
_score is created, thereby
influencing which documents are considered more relevant. The default
match_type
for fulltext indices is best_fields
.
- best_fields:
Use the _score of the column that matched best. For example if a column contains all the tokens of the
query_term
it’s considered more relevant than other columns containing only one.This type is the default, if omitted.
- most_fields:
This match type takes the _score of every matching column and averages their scores.
- cross_fields:
This match type analyzes the
query_term
into tokens and searches all tokens in all given columns at once as if they were one big column (given they have the same analyzer). All tokens have to be present in at least one column, so querying forfoo bar
should have the tokensfoo
in one column andbar
in the same or any other.- phrase:
This match type differs from
best_fields
in that it constructs a phrase query from thequery_term
. A phrase query will only match if the tokens in the columns are exactly in the same order as the analyzed columns from thequery_term
. So, querying forfoo bar
(analyzed tokens:foo
andbar
) will only match if one of the columns contains those two token in that order - without any other tokens in between.- phrase_prefix:
This match type is roughly the same than
phrase
but it allows to match by prefix on the last token of thequery_term
. For example if your query forfoo ba
, one of the columns has to containfoo
and a token that starts withba
in that order. So a column containingfoo baz
would match andfoo bar
too.
Options¶
The match options further distinguish the way the matching process using a certain match type works. Not all options are applicable to all match types. See the options below for details.
- analyzer:
The analyzer used to convert the
query_term
into tokens.- boost:
This numeric value is multiplied with the resulting _score of this
match
call.If this
match
call is used with other conditions in the where clause a value above1.0
will increase its influence on the overall _score of the whole query, a value below1.0
will decrease it.- cutoff_frequency:
The token frequency is the number of occurrences of a token in a column.
This option specifies a minimum token frequency that excludes matching tokens with a higher frequency from the overall _score. Their _score is only included if another token with a lower frequency also matches. This can be used to suppress results where only high frequency terms like
the
would cause a match.
- fuzziness:
Can be used to perform fuzzy full text search.
On numeric columns use a numeric, on timestamp columns a long indicating milliseconds, on strings use a number indicating the maximum allowed Levenshtein Edit Distance. Use
prefix_length
,fuzzy_rewrite
andmax_expansions
to fine tune the fuzzy matching process.- fuzzy_rewrite:
The same than
rewrite
but only applies to queries usingfuzziness
.- max_expansions:
When using
fuzziness
orphrase_prefix
this options controls to how many different possible tokens a search token will be expanded. Thefuzziness
controls how big the distance or difference between the original token and the set of tokens it is expanded to can be. This option controls how big this set can get.- minimum_should_match:
The number of tokens from the
query_term
to match whenor
is used. Defaults to1
.- operator:
Can be
or
orand
. The default operator isor
. It is used to combine the tokens of thequery_term
. Ifand
is used, every token from thequery_term
has to match. Ifor
is used only the number ofminimum_should_match
have to match.- prefix_length:
When used with
fuzziness
option or withphrase_prefix
this options controls how long the common prefix of the tokens that are considered as similar (same prefix or fuzziness distance/difference)has to be.- rewrite:
When using
phrase_prefix
the prefix query is constructed using all possible terms and rewriting them into another kind of query to compute the score. Possible values areconstant_score_auto
,constant_score_boolean
,constant_score_filter
,scoring_boolean
,``top_terms_N``,top_terms_boost_N
. Theconstant_...
values can be used together with theboost
option to set a constant _score for rows with a matching prefix or fuzzy match.- slop:
When matching for phrases this option controls how exact the phrase match should be (proximity search). If set to
0
(the default), the terms must be in the exact order. If two transposed terms should match, a minimumslop
of2
has to be set. Only applicable tophrase
andphrase_prefix
queries. As an example withslop
2, querying forfoo bar
will not only matchfoo bar
but alsofoo what a bar
.- tie_breaker:
When using
best_fields
,phrase
orphrase_prefix
the _score of every other column will be multiplied with this value and added to the _score of the best matching column.Defaults to
0.0
.Not applicable to match type
most_fields
as this type is executed as if it had atie_breaker
of1.0
.- zero_terms_query:
If no tokens are generated analyzing the
query_term
then no documents are matched. Ifall
is given here, all documents are matched.
Usage¶
A fulltext search is done using the MATCH Predicate predicate:
cr> select name from locations where match(name_description_ft, 'time') order by _score desc;
+-----------+
| name |
+-----------+
| Altair |
| Bartledan |
+-----------+
SELECT 2 rows in set (... sec)
It returns TRUE
for rows which match the search string. To get more
detailed information about the quality of a match, the relevance of the row,
the _score can be selected:
cr> select name, _score
... from locations where match(name_description_ft, 'time') order by _score desc;
+-----------+-----------+
| name | _score |
+-----------+-----------+
| Altair | 0.6570115 |
| Bartledan | 0.6416173 |
+-----------+-----------+
SELECT 2 rows in set (... sec)
Note
The _score
is not an absolute value. It just sets a row in relation to
the other ones.
Searching On Multiple Columns¶
There are two possibilities if a search should span the contents of multiple columns:
use a composite index column on your table. See Defining a composite index.
use the MATCH Predicate predicate on multiple columns. Note, that each of those columns must be indexed using
FULLTEXT
to have the same effect as usingMATCH
against a composite index.
When querying multiple columns, there are many ways how the relevance a.k.a. _score can be computed. These different techniques are called Match Types.
To increase the relevance of rows where one column matches extremely well, use
best_fields
(the default).
If rows with good matches spread over all included columns should be more
relevant, use most_fields
. If searching multiple columns as if they were
one, use cross_fields
.
For searching of matching phrases (tokens are in the exact same order) use
phrase
or phrase_prefix
:
cr> select name, _score from locations
... where match(
... (name_description_ft, inhabitants['name'] 1.5, kind 0.75),
... 'end of the galaxy'
... ) order by _score desc;
+-------------------+------------+
| name | _score |
+-------------------+------------+
| NULL | 1.2599468 |
| Altair | 0.49754602 |
| Outer Eastern Rim | 0.47476405 |
| North West Ripple | 0.46413797 |
| Aldebaran | 0.23530701 |
+-------------------+------------+
SELECT 5 rows in set (... sec)
cr> select name, description, _score from locations
... where match(
... (name_description_ft), 'end of the galaxy'
... ) using phrase with (analyzer='english', slop=4);
+------+-------------------------+-----------+
| name | description | _score |
+------+-------------------------+-----------+
| NULL | The end of the Galaxy.% | 1.2599468 |
+------+-------------------------+-----------+
SELECT 1 row in set (... sec)
A vast amount of options exist to fine-tune your fulltext search. A detailed reference can be found here MATCH Predicate.
Negative Search¶
A negative fulltext search can be done using a NOT
clause:
cr> select name, _score from locations
... where not match(name_description_ft, 'time')
... order by _score, name asc;
+------------------------------------+--------+
| name | _score |
+------------------------------------+--------+
| | 1.0 |
| Aldebaran | 1.0 |
| Algol | 1.0 |
| Allosimanius Syneca | 1.0 |
| Alpha Centauri | 1.0 |
| Argabuthon | 1.0 |
| Arkintoofle Minor | 1.0 |
| Galactic Sector QQ7 Active J Gamma | 1.0 |
| North West Ripple | 1.0 |
| Outer Eastern Rim | 1.0 |
| NULL | 1.0 |
+------------------------------------+--------+
SELECT 11 rows in set (... sec)
Filter By _score¶
It is possible to filter results by the _score column but as its value is a computed value relative to the highest score of all results and consequently never absolute or comparable across searches the usefulness outside of sorting is very limited.
Although possible, filtering by the greater-than-or-equals operator (>=
) on the _score column would not make much sense and
can lead to unpredictable result sets.
Anyway let’s do it here for demonstration purpose:
cr> select name, _score
... from locations where match(name_description_ft, 'time')
... and _score >= 0.8 order by _score;
+-----------+-----------+
| name | _score |
+-----------+-----------+
| Bartledan | 1.6416173 |
| Altair | 1.6570115 |
+-----------+-----------+
SELECT 2 rows in set (... sec)
As you might have noticed, the _score value has changed for the same query text and document because it’s a ratio relative to all results, and by filtering on _score, ‘all results’ has already changed.
Caution
As noted above _score is a relative number and not comparable across searches. Filtering is therefore greatly discouraged.