Contents Menu Expand Light mode Dark mode Skip to content
  • Product
    • Editions
      • CrateDB Cloud
      • CrateDB Enterprise
      • CrateDB OSS
    • Features
      • Overview
      • High cardinality
      • SQL syntax
      • Integrations
      • Security
    • Data models
      • Time-series
      • Document/JSON
      • Vector
      • Full-text
      • Spatial
      • Relational
  • Solutions
    • By use cases | Real-time
      • Industrial Analytics
      • AI operations
      • Application analytics
    • By industry
      • Manufacturing
      • Energy
      • FMCG
      • Logistics
      • Oil, Gas & Mining
      • Transportation
      • SaaS
      • Media & Entertainment
  • Resources
    • Customer stories
    • Academy
    • Asset library
    • Blog
    • Guides
    • Events
  • Developer
    • Documentation
    • Drivers and tools
    • Community
    • GitHub
    • Support
  • Pricing
  • Login
  • Get Started
    • Overview
      • Solutions and use cases
        • Time series data
          • Fundamentals
            • Generate time series data
              • Generate time series data from the command line
              • Generate time series data using Python
              • Generate time series data using Node.js
              • Generate time series data using Go
            • Normalize time series data intervals
            • Analyzing weather data
            • Analyzing device readings with metadata integration
          • Advanced analysis
          • Video tutorials
        • Industrial big data
          • Azure IoT
          • Machine Learning
          • ABB insights
          • Rauch insights
          • SPGo! insights
          • TGW insights
        • Long-term store
          • Automatic retention and expiration
        • Real-time raw-data analytics
          • Bitmovin insights
        • Machine learning
    • Getting Started
      • Video learning
      • Data modelling
        • Relational data
        • JSON data
        • Time series data
        • Geospatial data
        • Full-text data
        • Vector data
        • Primary key strategies
      • Query capabilities
        • Aggregations
        • Ad-hoc queries
        • Search
        • AI integration
        • Performance
      • Import data
      • Sample applications

    Build

    • Load data into CrateDB
      • Load and Export (ETL)
      • Change Data Capture (CDC)
      • Metrics, telemetry, and logs
    • Connect / Drivers
      • General information
      • Applications
      • Software Testing
      • C#
      • Elixir
      • Erlang
        • Erlang ODBC
        • Erlang epgsql
      • F#
      • Go
        • pgx
        • pq
        • KSQL
      • Groovy
      • Java
        • PostgreSQL JDBC
        • CrateDB JDBC
        • Hibernate / JPA
        • jOOQ
        • Software testing
      • JavaScript
        • node-postgres
        • node-crate
      • Julia
      • Kotlin
      • Perl
      • PHP
        • AMPHP PostgreSQL
        • PostgreSQL PDO
        • CrateDB PDO
        • CrateDB DBAL
      • Python
        • crate-python
        • sqlalchemy-cratedb
        • Conecta
        • cratedb-async
        • micropython-cratedb
        • psycopg2
        • psycopg3
        • aiopg
        • asyncpg
        • ConnectorX
        • Records
        • turbodbc
      • R
      • Ruby
      • Rust
      • Scala
      • ODBC
        • C#
        • Erlang
        • Python
        • Visual Basic
      • Visual Basic
      • Zig
      • Natural language
    • Integrations
      • Categories
        • Business Intelligence
        • Data Lineage
        • Data Visualization
        • Programming Frameworks
        • Migrations
          • Rockset
            • Migrate Queries
      • Airflow / Astronomer
        • Getting started
        • Import Parquet files
        • Import stock market data
        • Export to S3
        • Data retention policy
        • Hot/cold data retention
      • AMQP
        • Usage
      • Arrow
        • Import Parquet files
      • Atlan
      • AWS Lambda
      • Azure Functions
        • Tutorial
      • Balena
        • Usage
      • Cluvio
        • Usage
      • collectd
        • Usage with collectd
        • Usage with Telegraf
      • Conecta
      • Coreflux
        • Usage
      • Dapr
        • Usage
      • Dask
        • Usage
      • Databricks
        • Azure Databricks
      • DataGrip
      • Datashader
      • DBeaver
      • dbt
        • Usage
      • Debezium
        • Tutorial
      • Django
        • Settings
        • Models
        • Fields
        • Scalar functions
      • dlt
        • Usage
      • DMS (AWS Database Migration Service)
      • DynamoDB
      • Estuary
      • Explo
      • Flink
      • Gradio
      • Grafana
        • Tutorial
      • HiveMQ
        • Usage
      • Hop
      • Iceberg
      • InfluxDB
        • Usage
        • Cloud to Cloud
        • Data Model
      • ingestr
      • JMeter
      • Kafka
        • Using Kafka with Python
        • Using Confluent Kafka Connect
      • Kestra
        • Usage
      • Kinesis
      • LangChain
        • Usage
      • LlamaIndex
        • Text-to-SQL synopsis
        • Text-to-SQL usage
      • Locust
        • Tutorial
      • Marquez
        • Usage
      • Model Context Protocol (MCP)
        • CrateDB MCP Server
        • Community servers
      • Meltano
      • Metabase
        • Usage
      • MindsDB
      • MLflow
      • MongoDB
        • Usage
        • Cloud to Cloud
        • MongoDB’s data model
      • Mosquitto
        • Usage
      • MQTT
      • MySQL and MariaDB
        • Usage
        • Use CSV
      • n8n
      • NiFi
        • Usage
      • Node-RED
        • Tutorial
      • OpenTelemetry
        • Collector Usage
        • Telegraf Usage
      • Oracle
        • Usage
      • pandas
        • Starter tutorial
        • Jupyter tutorial
        • Efficient ingest
      • Plotly and Dash
      • Polars
      • PostgreSQL
        • Usage
      • Power BI
        • Power BI Desktop
        • Power BI Service
      • Prefect
        • Usage
      • Prometheus
        • Usage
      • PyCaret
      • PyViz
      • QueryZen
      • R
        • Tutorial
      • Rill
        • Usage
      • RisingWave
        • Stream processing from Iceberg tables to CrateDB using RisingWave
      • rsyslog
        • Usage
      • scikit-learn
      • Spark
        • Usage
      • SQL Server
      • StatsD
        • Usage
      • Streamlit
      • StreamSets
        • Usage
      • Superset / Preset
        • Usage
        • Sandbox
      • Tableau
      • Telegraf
        • Usage
      • TensorFlow
        • Tutorial
      • Terraform
        • Usage
      • Trino
        • Usage
    • All Features
      • Highlights
      • SQL
      • Document Store
        • Tutorial
      • Relational / JOINs
      • Search: FTS, Geo, Vector, Hybrid
        • Full-Text Search
          • Full-text Search Options
          • Analyzers, Tokenizers, and Filters
          • Tutorial
          • Indexing Text for Both Effective Search and Accurate Analysis
        • Geospatial Search
        • Vector Search
        • Hybrid Search
      • BLOB Store
      • Clustering
      • Snapshots
      • Cloud Native
      • Storage Layer
        • Indexing and storage in CrateDB
      • Hybrid Index
      • Advanced Querying
        • Recurrent queries
      • Generated Columns
      • Server-Side Cursors
      • Foreign data wrappers
      • User-Defined Functions
      • Cross-Cluster Replication
        • Usage

    Operations

    • Installation
      • Debian, Ubuntu
      • Red Hat, SUSE
      • Windows
      • Tarball
      • Container setup
        • Docker
        • Kubernetes
          • CrateDB and Kubernetes
          • Run CrateDB with Kubernetes Operator
      • Cloud hosting
        • Amazon AWS
          • CrateDB on Amazon EC2
          • Deploy using Terraform
          • Using Amazon S3 as a snapshot repository
        • Microsoft Azure
          • CrateDB on Azure VMs
          • Deploy using Terraform
      • Configuration settings
      • Multi-node setup
      • Multi-zone setup
    • Administration
      • Bootstrap checks
      • User management
      • Going into production
      • Monitoring and diagnostics
        • Prometheus and Grafana
        • Prometheus JMX Exporter
        • Prometheus SQL Exporter
      • Memory configuration
      • Circuit breaker
      • Troubleshooting
        • System Tables
        • CrateDB Flight Recorder (CFR)
        • Java Flight Recorder (JFR)
        • The jcmd Utility
          • Using jcmd with CrateDB on Docker
          • Java Flight Recorder (JFR)
        • The crate-node command
      • Scaling
        • Expand
        • On-Demand
        • Autoscale
        • On Kubernetes
      • Upgrading
        • Guidelines
        • Rolling Upgrade
        • Full Restart Upgrade
    • Performance guides
      • Sharding and partitioning 101
      • Sharding recommendations
      • Scaling
      • Storage
      • Fast Inserts
        • Insert Methods
        • Bulk Inserts
        • Parallel Inserts
        • Configuration Tuning for Inserts
        • Testing Insert Performance
      • Fast Selects
      • Query Optimization 101

    References

  • CrateDB Cloud
    • CrateDB
      • Tools

      • Admin UI
        • CrateDB CLI
          • Cloud CLI
            • CrateDB MCP
            • CrateDB Toolkit
            • Support
            • Community

            Full-Text Search¶

            BM25 term search based on Apache Lucene, using SQL: CrateDB is all you need.

            Overview

            CrateDB can be used as a database to conduct full-text search operations building upon Apache Lucene.

            CrateDB is an exceptional choice for handling complex queries and large-scale data sets. One of its standout features are its full-text search capabilities, built on top of the powerful Lucene library. This makes it a great fit for organizing, searching, and analyzing extensive datasets.

            About

            Full-text search leverages the BM25 search ranking algorithm, effectively implementing the storage and retrieval parts of a search engine.

            For managing a full-text search index for text values, Lucene uses an inverted index data structure, and the Okapi BM25 search ranking algorithm.

            The inverted index data structure is a central component of a typical search engine indexing algorithm. Together with ranking, which enables search result relevance features, both effectively provide the storage and retrieval parts of a search engine.

            Reference Manual

            • Fulltext search

            • Fulltext indices

            • MATCH Predicate

            • CREATE ANALYZER

            Related

            • SQL

            • Geospatial Search

            • Vector Search

            • Hybrid Search

            • Advanced Querying

            • Storage Layer

            SQL Full-Text Search Okapi BM25

            Full-text search using SQL

            CrateDB uses Lucene as a storage layer, so it inherits the implementation and concepts of Lucene, in the same spirit as the Apache Solr search server, and Elasticsearch. While Elasticsearch uses a query DSL based on JSON, in CrateDB, you can work with text search using SQL, using a PostgreSQL-compatible interface.

            Details about the inverted index

            Lucene’s indexing strategy for text fields relies on a data structure called [inverted index], which enables very efficient search over textual data, and is defined as a “data structure storing a mapping from content, such as words and numbers, to its location in the database file, document or set of documents”.

            Depending on the configuration of a column, the index can be plain (default) or full-text. An index of type “plain” indexes content of one or more fields without analyzing and tokenizing their values into terms.

            To create a “full-text” index, the field value is first analyzed and, based on the used analyzer, split into smaller units, such as individual words, a processing step called tokenization. A full-text index is then created for each text unit separately.

            Details about ranking with BM25

            In information retrieval, Okapi BM25 is a popular ranking function used by search engines to estimate the relevance of documents to a given search query. The BM25 method has become the default scoring formula in Lucene, and is also the relevance scoring formula used by CrateDB.

            The article BM25: The Next Generation of Lucene Relevance compares traditional TF/IDF to BM25, including illustrative graphs. To learn more details about what’s inside, please also refer to Similarity in Elasticsearch and BM25 vs. Lucene Default Similarity.

            Synopsis¶

            Populate and query a Lucene full-text index using SQL.

            DDL

            CREATE TABLE documents (
              name STRING PRIMARY KEY,
              description TEXT,
              INDEX ft_english
                USING FULLTEXT(description) WITH (
                  analyzer = 'english'
                ),
              INDEX ft_german
                USING FULLTEXT(description) WITH (
                  analyzer = 'german'
                )
            );
            

            DML

            INSERT INTO documents (name, description)
            VALUES
              ('Quick fox', 'The quick brown fox jumps over the lazy dog.'),
              ('Franz jagt', 'Franz jagt im komplett verwahrlosten Taxi quer durch Bayern.')
            ;
            

            DQL

            SELECT name, _score
            FROM documents
            WHERE
              MATCH(
                (ft_english, ft_german),
                'jump OR verwahrlost'
              )
            ORDER BY _score DESC;
            

            Result

            +------------+------------+
            | name       |     _score |
            +------------+------------+
            | Franz jagt | 0.13076457 |
            | Quick fox  | 0.13076457 |
            +------------+------------+
            SELECT 2 rows in set (0.034 sec)
            

            More Examples

            Tweak fuzziness to get approximate matches.

            SELECT _score,
                   city,
                   country,
                   population
            FROM cities
            WHERE MATCH(city_ascii, 'nw yurk') USING best_fields
            WITH (fuzziness = 1)
            ORDER BY
                _score DESC;
            

            Usage¶

            Full-text search in CrateDB means using the MATCH predicate, and optionally configuring analyzers.

            MATCH predicate

            CrateDB’s MATCH predicate performs a fulltext search on one or more indexed columns or indices and supports different matching techniques. In order to use fulltext searches on a column, a fulltext index with an analyzer must be created for this column.

            Analyzers, Tokenizers, and Filters

            Analyzers consist of two parts, tokenizers and filters. With CrateDB, you can define custom analyzers, or configure the standard analyzers according to your needs.

            Learn¶

            Learn how to set up your database for full-text search, how to create the relevant indices, and how to query your text data efficiently. A few must-reads for anyone looking to make sense of large volumes of unstructured text data.

            Advanced Features

            FTS Options

            Learn about the stack of options relevant for full-text search, like applying Fuzzy Search, or using Synonyms.

            Full-Text Search Options

            Introduction
            FTS Options
            Fuzzy Matching
            Synonyms

            Custom Analyzers

            This page illustrates how to define custom analyzers using the CREATE ANALYZER SQL command, for example to use fuzzy searching, how to use synonym files, and corresponding technical backgrounds about their implementations.

            Analyzers, Tokenizers, and Filters

            Introduction
            Full-Text Search
            Lucene Analyzer

            Tutorials

            Exploring the Netflix catalog using full-text search

            The usage guide illustrates the BM25 ranking algorithm for information retrieval, by exploring how to manage a dataset of Netflix titles.

            Netflix Tutorial

            Introduction
            Full-Text Search
            BM25

            Explanations

            Indexing and Storage in CrateDB

            This article explores the internal workings of the storage layer in CrateDB, with a focus on Lucene’s indexing strategies.

            The CrateDB storage layer is based on Lucene indexes. Lucene offers scalable and high-performance indexing, which enables efficient search and aggregations over documents and rapid updates to the existing documents.

            CrateDB uses three important Lucene data structures: Inverted indexes for text values, BKD trees for numeric values, and doc values.

            Introduction Lucene Indexing

            Indexing and storage in CrateDB
            Indexing Text for Both Effective Search and Accurate Analysis

            This article explores how Qualtrics uses CrateDB in their Text iQ product to provide text analysis services for everything from sentiment analysis to identifying key topics, and powerful search-based data exploration.

            It explains integral parts of an FTS text processing pipeline, including analyzers, optionally using tokenizers or character filters, and how they can be customized to specific needs, using plugins for CrateDB.

            CrateDB uses Elasticsearch technology under the hood to manage cluster creation and communication, and also exposes an Elasticsearch API that provides access to all the indexing capabilities in Elasticsearch that Qualtrics needed.

            Introduction Analyzer Tokenizer Plugin

            Indexing Text for Both Effective Search and Accurate Analysis
            Next
            Full-text Search Options
            Previous
            Search
              Feedback

              Suggest improvement

              Edit page

              View page source

            On this page
            • Full-Text Search
              • Synopsis
              • Usage
              • Learn
            • Imprint
            • Contact
            • Legal
            Follow us
            Follow us on X Follow us on LinkedIn Follow us on Facebook Follow us on Instagram Follow us on Facebook