Full-Text Search¶
BM25 term search based on Apache Lucene, using SQL: CrateDB is all you need.
Overview
CrateDB can be used as a database to conduct full-text search operations building upon Apache Lucene.
CrateDB is an exceptional choice for handling complex queries and large-scale data sets. One of its standout features are its full-text search capabilities, built on top of the powerful Lucene library. This makes it a great fit for organizing, searching, and analyzing extensive datasets.
About
Full-text search leverages the BM25 search ranking algorithm, effectively implementing the storage and retrieval parts of a search engine.
For managing a full-text search index for text values, Lucene uses an inverted index data structure, and the Okapi BM25 search ranking algorithm.
The inverted index data structure is a central component of a typical search engine indexing algorithm. Together with ranking, which enables search result relevance features, both effectively provide the storage and retrieval parts of a search engine.
Reference Manual
Related
SQL Full-Text Search Okapi BM25
Full-text search using SQL
CrateDB uses Lucene as a storage layer, so it inherits the implementation and concepts of Lucene, in the same spirit as the Apache Solr search server, and Elasticsearch. While Elasticsearch uses a query DSL based on JSON, in CrateDB, you can work with text search using SQL, using a PostgreSQL-compatible interface.
Details about the inverted index
Lucene’s indexing strategy for text fields relies on a data structure called [inverted index], which enables very efficient search over textual data, and is defined as a “data structure storing a mapping from content, such as words and numbers, to its location in the database file, document or set of documents”.
Depending on the configuration of a column, the index can be plain (default) or full-text. An index of type “plain” indexes content of one or more fields without analyzing and tokenizing their values into terms.
To create a “full-text” index, the field value is first analyzed and, based on the used analyzer, split into smaller units, such as individual words, a processing step called tokenization. A full-text index is then created for each text unit separately.
Details about ranking with BM25
In information retrieval, Okapi BM25 is a popular ranking function used by search engines to estimate the relevance of documents to a given search query. The BM25 method has become the default scoring formula in Lucene, and is also the relevance scoring formula used by CrateDB.
The article BM25: The Next Generation of Lucene Relevance compares traditional TF/IDF to BM25, including illustrative graphs. To learn more details about what’s inside, please also refer to Similarity in Elasticsearch and BM25 vs. Lucene Default Similarity.
Synopsis¶
Populate and query a Lucene full-text index using SQL.
DDL
CREATE TABLE documents (
name STRING PRIMARY KEY,
description TEXT,
INDEX ft_english
USING FULLTEXT(description) WITH (
analyzer = 'english'
),
INDEX ft_german
USING FULLTEXT(description) WITH (
analyzer = 'german'
)
);
DML
INSERT INTO documents (name, description)
VALUES
('Quick fox', 'The quick brown fox jumps over the lazy dog.'),
('Franz jagt', 'Franz jagt im komplett verwahrlosten Taxi quer durch Bayern.')
;
DQL
SELECT name, _score
FROM documents
WHERE
MATCH(
(ft_english, ft_german),
'jump OR verwahrlost'
)
ORDER BY _score DESC;
Result
+------------+------------+
| name | _score |
+------------+------------+
| Franz jagt | 0.13076457 |
| Quick fox | 0.13076457 |
+------------+------------+
SELECT 2 rows in set (0.034 sec)
More Examples
Tweak fuzziness to get approximate matches.
SELECT _score,
city,
country,
population
FROM cities
WHERE MATCH(city_ascii, 'nw yurk') USING best_fields
WITH (fuzziness = 1)
ORDER BY
_score DESC;
Usage¶
Full-text search in CrateDB means using the MATCH predicate, and optionally configuring analyzers.
MATCH predicate
CrateDB’s MATCH predicate performs a fulltext search on one or more indexed columns or indices and supports different matching techniques. In order to use fulltext searches on a column, a fulltext index with an analyzer must be created for this column.
Analyzers, Tokenizers, and Filters
Analyzers consist of two parts, tokenizers and filters. With CrateDB, you can define custom analyzers, or configure the standard analyzers according to your needs.
Learn¶
Learn how to set up your database for full-text search, how to create the relevant indices, and how to query your text data efficiently. A few must-reads for anyone looking to make sense of large volumes of unstructured text data.
Advanced Features
FTS Options
Learn about the stack of options relevant for full-text search, like applying Fuzzy Search, or using Synonyms.
Introduction
FTS Options
Fuzzy Matching
Synonyms
Tutorials
Articles
Indexing and Storage in CrateDB
This article explores the internal workings of the storage layer in CrateDB, with a focus on Lucene’s indexing strategies.
The CrateDB storage layer is based on Lucene indexes. Lucene offers scalable and high-performance indexing which enables efficient search and aggregations over documents and rapid updates to the existing documents. We will look at the three main Lucene structures that are used within CrateDB: Inverted Indexes for text values, BKD-Trees for numeric values, and Doc Values.
- Inverted Index:
You will learn how inverted indexes are implemented in Lucene and CrateDB.
- BKD Tree:
Better understand the BKD tree, starting from KD trees, and how this data structure supports range queries in CrateDB.
- Doc Values:
This data structure supports more efficient querying document fields by id, performs column-oriented retrieval of data, and improves the performance of aggregation and sorting operations.
Introduction
Lucene Indexing
Indexing Text for Both Effective Search and Accurate Analysis
This article explores how Qualtrics uses CrateDB in Text iQ to provide text analysis services for everything from sentiment analysis to identifying key topics, and powerful search-based data exploration.
CrateDB uses Elasticsearch technology under the hood to manage cluster creation and communication, and also exposes an Elasticsearch API that provides access to all the indexing capabilities in Elasticsearch that Qualtrics needed.
The articles explains integral parts of an FTS text processing pipeline, including analyzers, optionally using tokenizers or character filters, and how they can be customized to specific needs, using plugins for CrateDB.
Introduction
Analyzer Tokenizer
Plugin