Contents Menu Expand Light mode Dark mode Skip to content

Join our webinar on Modern Data Pipelines for Smart Factories​ on Jun 26th

Register Now
  • Product
    • Database
      • Overview
      • SQL examples
      • Integrations
      • Security
    • Data models
      • Time-series
      • Document/JSON
      • Vector
      • Full-text
      • Spatial
      • Relational
  • Use cases
    • Real-time analytics
    • Hybrid search
    • AI/ML integration
    • AI chatbots
    • Internet of Things
    • Geospatial analytics
    • Log & event analysis
  • Industries
    • Energy
    • Financial Services
    • FMCG
    • Logistics
    • Manufacturing
    • Oil, gas & mining
    • Smart city solutions
    • Technology platforms
    • Telco
    • Transportation
  • Resources
    • Customer stories
    • Academy
    • Asset library
    • Blog
    • Events
  • Developer
    • Documentation
    • Drivers and tools
    • Community
    • GitHub
    • Support
  • Pricing
  • Login
  • Get Started
  • Overview
  • CrateDB Cloud
  • Guides and Tutorials
    • Installation
    • Getting Started
    • All Features
      • SQL
      • Connectivity
      • Document Store
      • Relational / JOINs
      • Search: FTS, Geo, Vector, Hybrid
        • Full-Text Search
          • Synopsis
          • Usage
          • Learn
        • Geospatial Search
        • Vector Search
        • Hybrid Search
      • BLOB Store
      • Clustering
      • Snapshots
      • Cloud Native
      • Storage Layer
      • Hybrid Index
      • Advanced Querying
      • Generated Columns
      • Server-Side Cursors
      • Foreign Data Wrapper
      • User-Defined Functions
      • Cross-Cluster Replication
    • Administration
    • Performance Guides
    • Application Domains
    • Integrations
    • Migrations
    • Reference Architectures
  • Reference Manual
  • Admin UI
  • CrateDB CLI
  • Cloud CLI
  • Drivers and Integrations
  • Support
  • Community
  • Integration Tutorials
  • Sample Applications
  • Academy

Full-Text Search¶

BM25 term search based on Apache Lucene, using SQL: CrateDB is all you need.

Overview

CrateDB can be used as a database to conduct full-text search operations building upon Apache Lucene.

CrateDB is an exceptional choice for handling complex queries and large-scale data sets. One of its standout features are its full-text search capabilities, built on top of the powerful Lucene library. This makes it a great fit for organizing, searching, and analyzing extensive datasets.

About

Full-text search leverages the BM25 search ranking algorithm, effectively implementing the storage and retrieval parts of a search engine.

For managing a full-text search index for text values, Lucene uses an inverted index data structure, and the Okapi BM25 search ranking algorithm.

The inverted index data structure is a central component of a typical search engine indexing algorithm. Together with ranking, which enables search result relevance features, both effectively provide the storage and retrieval parts of a search engine.

Reference Manual

  • Fulltext search

  • Fulltext indices

  • MATCH Predicate

  • CREATE ANALYZER

Related

  • SQL

  • Geospatial Search

  • Vector Search

  • Hybrid Search

  • Advanced Querying

SQL Full-Text Search Okapi BM25

Full-text search using SQL

CrateDB uses Lucene as a storage layer, so it inherits the implementation and concepts of Lucene, in the same spirit as the Apache Solr search server, and Elasticsearch. While Elasticsearch uses a query DSL based on JSON, in CrateDB, you can work with text search using SQL, using a PostgreSQL-compatible interface.

Details about the inverted index

Lucene’s indexing strategy for text fields relies on a data structure called [inverted index], which enables very efficient search over textual data, and is defined as a “data structure storing a mapping from content, such as words and numbers, to its location in the database file, document or set of documents”.

Depending on the configuration of a column, the index can be plain (default) or full-text. An index of type “plain” indexes content of one or more fields without analyzing and tokenizing their values into terms.

To create a “full-text” index, the field value is first analyzed and, based on the used analyzer, split into smaller units, such as individual words, a processing step called tokenization. A full-text index is then created for each text unit separately.

Details about ranking with BM25

In information retrieval, Okapi BM25 is a popular ranking function used by search engines to estimate the relevance of documents to a given search query. The BM25 method has become the default scoring formula in Lucene, and is also the relevance scoring formula used by CrateDB.

The article BM25: The Next Generation of Lucene Relevance compares traditional TF/IDF to BM25, including illustrative graphs. To learn more details about what’s inside, please also refer to Similarity in Elasticsearch and BM25 vs. Lucene Default Similarity.

Synopsis¶

Populate and query a Lucene full-text index using SQL.

DDL

CREATE TABLE documents (
  name STRING PRIMARY KEY,
  description TEXT,
  INDEX ft_english
    USING FULLTEXT(description) WITH (
      analyzer = 'english'
    ),
  INDEX ft_german
    USING FULLTEXT(description) WITH (
      analyzer = 'german'
    )
);

DML

INSERT INTO documents (name, description)
VALUES
  ('Quick fox', 'The quick brown fox jumps over the lazy dog.'),
  ('Franz jagt', 'Franz jagt im komplett verwahrlosten Taxi quer durch Bayern.')
;

DQL

SELECT name, _score
FROM documents
WHERE
  MATCH(
    (ft_english, ft_german),
    'jump OR verwahrlost'
  )
ORDER BY _score DESC;

Result

+------------+------------+
| name       |     _score |
+------------+------------+
| Franz jagt | 0.13076457 |
| Quick fox  | 0.13076457 |
+------------+------------+
SELECT 2 rows in set (0.034 sec)

More Examples

Tweak fuzziness to get approximate matches.

SELECT _score,
       city,
       country,
       population
FROM cities
WHERE MATCH(city_ascii, 'nw yurk') USING best_fields
WITH (fuzziness = 1)
ORDER BY
    _score DESC;

Usage¶

Full-text search in CrateDB means using the MATCH predicate, and optionally configuring analyzers.

MATCH predicate

CrateDB’s MATCH predicate performs a fulltext search on one or more indexed columns or indices and supports different matching techniques. In order to use fulltext searches on a column, a fulltext index with an analyzer must be created for this column.

Analyzers, Tokenizers, and Filters

Analyzers consist of two parts, tokenizers and filters. With CrateDB, you can define custom analyzers, or configure the standard analyzers according to your needs.

Learn¶

Learn how to set up your database for full-text search, how to create the relevant indices, and how to query your text data efficiently. A few must-reads for anyone looking to make sense of large volumes of unstructured text data.

Advanced Features

FTS Options

Learn about the stack of options relevant for full-text search, like applying Fuzzy Search, or using Synonyms.

Full-Text Search Options

Introduction
FTS Options
Fuzzy Matching
Synonyms

Custom Analyzers

This tutorial illustrates how to define custom analyzers using the CREATE ANALYZER SQL command, for example to use fuzzy searching, how to use synonym files, and corresponding technical backgrounds about their implementations.

Analyzers, Tokenizers, and Filters

Introduction
Full-Text Search
Lucene Analyzer

Tutorials

Exploring the Netflix catalog using full-text search

The tutorial illustrates the BM25 ranking algorithm for information retrieval, by exploring how to manage a dataset of Netflix titles.

Netflix Tutorial

Introduction
Full-Text Search
BM25

Articles

Indexing and Storage in CrateDB

This article explores the internal workings of the storage layer in CrateDB, with a focus on Lucene’s indexing strategies.

Indexing and Storage in CrateDB

The CrateDB storage layer is based on Lucene indexes. Lucene offers scalable and high-performance indexing which enables efficient search and aggregations over documents and rapid updates to the existing documents. We will look at the three main Lucene structures that are used within CrateDB: Inverted Indexes for text values, BKD-Trees for numeric values, and Doc Values.

Inverted Index:

You will learn how inverted indexes are implemented in Lucene and CrateDB.

BKD Tree:

Better understand the BKD tree, starting from KD trees, and how this data structure supports range queries in CrateDB.

Doc Values:

This data structure supports more efficient querying document fields by id, performs column-oriented retrieval of data, and improves the performance of aggregation and sorting operations.

Introduction
Lucene Indexing

Indexing Text for Both Effective Search and Accurate Analysis

This article explores how Qualtrics uses CrateDB in Text iQ to provide text analysis services for everything from sentiment analysis to identifying key topics, and powerful search-based data exploration.

Indexing Text for Both Effective Search and Accurate Analysis

CrateDB uses Elasticsearch technology under the hood to manage cluster creation and communication, and also exposes an Elasticsearch API that provides access to all the indexing capabilities in Elasticsearch that Qualtrics needed.

The articles explains integral parts of an FTS text processing pipeline, including analyzers, optionally using tokenizers or character filters, and how they can be customized to specific needs, using plugins for CrateDB.

Introduction
Analyzer Tokenizer
Plugin

Next
Full-text Search Options
Previous
Search
  Feedback

  Suggest improvement

  Edit page

  View page source

On this page
  • Full-Text Search
    • Synopsis
    • Usage
    • Learn

Subscribe to the CrateDB Newsletter now

  • Imprint
  • Contact
  • Legal
Follow us
Follow us on X Follow us on LinkedIn Follow us on Facebook Follow us on Instagram Follow us on Facebook