Full-Text Search¶

BM25 term search based on Apache Lucene, using SQL: CrateDB is all you need.

Overview

CrateDB can be used as a database to conduct full-text search operations building upon Apache Lucene.

CrateDB is an exceptional choice for handling complex queries and large-scale data sets. One of its standout features are its full-text search capabilities, built on top of the powerful Lucene library. This makes it a great fit for organizing, searching, and analyzing extensive datasets.

About

Full-text search leverages the BM25 search ranking algorithm, effectively implementing the storage and retrieval parts of a search engine.

For managing a full-text search index for text values, Lucene uses an inverted index data structure, and the Okapi BM25 search ranking algorithm.

The inverted index data structure is a central component of a typical search engine indexing algorithm. Together with ranking, which enables search result relevance features, both effectively provide the storage and retrieval parts of a search engine.

Reference Manual

SQL Full-Text Search Okapi BM25

Full-text search using SQL

CrateDB uses Lucene as a storage layer, so it inherits the implementation and concepts of Lucene, in the same spirit as the Apache Solr search server, and Elasticsearch. While Elasticsearch uses a query DSL based on JSON, in CrateDB, you can work with text search using SQL, using a PostgreSQL-compatible interface.

Details about the inverted index

Lucene’s indexing strategy for text fields relies on a data structure called [inverted index], which enables very efficient search over textual data, and is defined as a “data structure storing a mapping from content, such as words and numbers, to its location in the database file, document or set of documents”.

Depending on the configuration of a column, the index can be plain (default) or full-text. An index of type “plain” indexes content of one or more fields without analyzing and tokenizing their values into terms.

To create a “full-text” index, the field value is first analyzed and, based on the used analyzer, split into smaller units, such as individual words, a processing step called tokenization. A full-text index is then created for each text unit separately.

Details about ranking with BM25

In information retrieval, Okapi BM25 is a popular ranking function used by search engines to estimate the relevance of documents to a given search query. The BM25 method has become the default scoring formula in Lucene, and is also the relevance scoring formula used by CrateDB.

The article BM25: The Next Generation of Lucene Relevance compares traditional TF/IDF to BM25, including illustrative graphs. To learn more details about what’s inside, please also refer to Similarity in Elasticsearch and BM25 vs. Lucene Default Similarity.

Synopsis¶

Populate and query a Lucene full-text index using SQL.

DDL

CREATE TABLE documents (
  name STRING PRIMARY KEY,
  description TEXT,
  INDEX ft_english
    USING FULLTEXT(description) WITH (
      analyzer = 'english'
    ),
  INDEX ft_german
    USING FULLTEXT(description) WITH (
      analyzer = 'german'
    )
);

DML

INSERT INTO documents (name, description)
VALUES
  ('Quick fox', 'The quick brown fox jumps over the lazy dog.'),
  ('Franz jagt', 'Franz jagt im komplett verwahrlosten Taxi quer durch Bayern.')
;

DQL

SELECT name, _score
FROM documents
WHERE
  MATCH(
    (ft_english, ft_german),
    'jump OR verwahrlost'
  )
ORDER BY _score DESC;

Result

+------------+------------+
| name       |     _score |
+------------+------------+
| Franz jagt | 0.13076457 |
| Quick fox  | 0.13076457 |
+------------+------------+
SELECT 2 rows in set (0.034 sec)

More Examples

Tweak fuzziness to get approximate matches.

SELECT _score,
       city,
       country,
       population
FROM cities
WHERE MATCH(city_ascii, 'nw yurk') USING best_fields
WITH (fuzziness = 1)
ORDER BY
    _score DESC;

Usage¶

Full-text search in CrateDB means using the MATCH predicate, and optionally configuring analyzers.

MATCH predicate

CrateDB’s MATCH predicate performs a fulltext search on one or more indexed columns or indices and supports different matching techniques. In order to use fulltext searches on a column, a fulltext index with an analyzer must be created for this column.

Analyzers, Tokenizers, and Filters

Analyzers consist of two parts, tokenizers and filters. With CrateDB, you can define custom analyzers, or configure the standard analyzers according to your needs.

Learn¶

Learn how to set up your database for full-text search, how to create the relevant indices, and how to query your text data efficiently. A few must-reads for anyone looking to make sense of large volumes of unstructured text data.

Advanced Features

FTS Options

Learn about the stack of options relevant for full-text search, like applying Fuzzy Search, or using Synonyms.

Introduction
FTS Options
Fuzzy Matching
Synonyms

Custom Analyzers

This page illustrates how to define custom analyzers using the CREATE ANALYZER SQL command, for example to use fuzzy searching, how to use synonym files, and corresponding technical backgrounds about their implementations.

Introduction
Full-Text Search
Lucene Analyzer

Tutorials

Exploring the Netflix catalog using full-text search

The usage guide illustrates the BM25 ranking algorithm for information retrieval, by exploring how to manage a dataset of Netflix titles.

Introduction
Full-Text Search
BM25

Explanations

Indexing and Storage in CrateDB

This article explores the internal workings of the storage layer in CrateDB, with a focus on Lucene’s indexing strategies.

The CrateDB storage layer is based on Lucene indexes. Lucene offers scalable and high-performance indexing, which enables efficient search and aggregations over documents and rapid updates to the existing documents.

CrateDB uses three important Lucene data structures: Inverted indexes for text values, BKD trees for numeric values, and doc values.

Introduction Lucene Indexing

Indexing and storage in CrateDB

Indexing Text for Both Effective Search and Accurate Analysis

This article explores how Qualtrics uses CrateDB in their Text iQ product to provide text analysis services for everything from sentiment analysis to identifying key topics, and powerful search-based data exploration.

It explains integral parts of an FTS text processing pipeline, including analyzers, optionally using tokenizers or character filters, and how they can be customized to specific needs, using plugins for CrateDB.

CrateDB uses Elasticsearch technology under the hood to manage cluster creation and communication, and also exposes an Elasticsearch API that provides access to all the indexing capabilities in Elasticsearch that Qualtrics needed.

Introduction Analyzer Tokenizer Plugin

Indexing Text for Both Effective Search and Accurate Analysis