Contents Menu Expand Light mode Dark mode Skip to content
  • Product
    • Editions
      • CrateDB Cloud
      • CrateDB Enterprise
      • CrateDB OSS
    • Features
      • Overview
      • High cardinality
      • SQL syntax
      • Integrations
      • Security
    • Data models
      • Time-series
      • Document/JSON
      • Vector
      • Full-text
      • Spatial
      • Relational
  • Solutions
    • By use cases | Real-time
      • Industrial Analytics
      • AI operations
      • Application analytics
    • By industry
      • Manufacturing
      • Energy
      • FMCG
      • Logistics
      • Oil, Gas & Mining
      • Transportation
      • SaaS
      • Media & Entertainment
  • Resources
    • Customer stories
    • Academy
    • Asset library
    • Blog
    • Guides
    • Events
  • Developer
    • Documentation
    • Drivers and tools
    • Community
    • GitHub
    • Support
  • Pricing
  • Login
  • Get Started
    • Overview
      • Solutions and use cases
        • Time series data
          • Fundamentals
            • Generate time series data
              • Generate time series data from the command line
              • Generate time series data using Python
              • Generate time series data using Node.js
              • Generate time series data using Go
            • Normalize time series data intervals
            • Analyzing weather data
            • Analyzing device readings with metadata integration
          • Advanced analysis
          • Video tutorials
        • Industrial big data
          • Azure IoT
          • Machine Learning
          • ABB insights
          • Rauch insights
          • SPGo! insights
          • TGW insights
        • Long-term store
          • Automatic retention and expiration
        • Real-time raw-data analytics
          • Bitmovin insights
        • Machine learning
    • Getting Started
      • Video learning
      • Data modelling
        • Relational data
        • JSON data
        • Time series data
        • Geospatial data
        • Full-text data
        • Vector data
        • Primary key strategies
      • Query capabilities
        • Aggregations
        • Ad-hoc queries
        • Search
        • AI integration
        • Performance
      • Import data
      • Sample applications

    Build

    • Load data into CrateDB
      • Load and Export (ETL)
      • Change Data Capture (CDC)
      • Metrics, telemetry, and logs
    • Connect / Drivers
      • General information
      • Applications
      • Software Testing
      • C#
      • Elixir
      • Erlang
        • Erlang ODBC
        • Erlang epgsql
      • F#
      • Go
        • pgx
        • pq
        • KSQL
      • Groovy
      • Java
        • PostgreSQL JDBC
        • CrateDB JDBC
        • Hibernate / JPA
        • jOOQ
        • Software testing
      • JavaScript
        • node-postgres
        • node-crate
      • Julia
      • Kotlin
      • Perl
      • PHP
        • AMPHP PostgreSQL
        • PostgreSQL PDO
        • CrateDB PDO
        • CrateDB DBAL
      • Python
        • crate-python
        • sqlalchemy-cratedb
        • Conecta
        • cratedb-async
        • micropython-cratedb
        • psycopg2
        • psycopg3
        • aiopg
        • asyncpg
        • ConnectorX
        • Records
        • turbodbc
      • R
      • Ruby
      • Rust
      • Scala
      • ODBC
        • C#
        • Erlang
        • Python
        • Visual Basic
      • Visual Basic
      • Zig
      • Natural language
    • Integrations
      • Categories
        • Business Intelligence
        • Data Lineage
        • Data Visualization
        • Programming Frameworks
        • Migrations
          • Rockset
            • Migrate Queries
      • Airflow / Astronomer
        • Getting started
        • Import Parquet files
        • Import stock market data
        • Export to S3
        • Data retention policy
        • Hot/cold data retention
      • AMQP
        • Usage
      • Arrow
        • Import Parquet files
      • Atlan
      • AWS Lambda
      • Azure Functions
        • Tutorial
      • Balena
        • Usage
      • Cluvio
        • Usage
      • collectd
        • Usage with collectd
        • Usage with Telegraf
      • Conecta
      • Coreflux
        • Usage
      • Dapr
        • Usage
      • Dask
        • Usage
      • Databricks
        • Azure Databricks
      • DataGrip
      • Datashader
      • DBeaver
      • dbt
        • Usage
      • Debezium
        • Tutorial
      • Django
        • Settings
        • Models
        • Fields
        • Scalar functions
      • dlt
        • Usage
      • DMS (AWS Database Migration Service)
      • DynamoDB
      • Estuary
      • Explo
      • Flink
      • Gradio
      • Grafana
        • Tutorial
      • HiveMQ
        • Usage
      • Hop
      • Iceberg
      • InfluxDB
        • Usage
        • Cloud to Cloud
        • Data Model
      • ingestr
      • JMeter
      • Kafka
        • Using Kafka with Python
        • Using Confluent Kafka Connect
      • Kestra
        • Usage
      • Kinesis
      • LangChain
        • Usage
      • LlamaIndex
        • Text-to-SQL synopsis
        • Text-to-SQL usage
      • Locust
        • Tutorial
      • Marquez
        • Usage
      • Model Context Protocol (MCP)
        • CrateDB MCP Server
        • Community servers
      • Meltano
      • Metabase
        • Usage
      • MindsDB
      • MLflow
      • MongoDB
        • Usage
        • Cloud to Cloud
        • MongoDB’s data model
      • Mosquitto
        • Usage
      • MQTT
      • MySQL and MariaDB
        • Usage
        • Use CSV
      • n8n
      • NiFi
        • Usage
      • Node-RED
        • Tutorial
      • OpenTelemetry
        • Collector Usage
        • Telegraf Usage
      • Oracle
        • Usage
      • pandas
        • Starter tutorial
        • Jupyter tutorial
        • Efficient ingest
      • Plotly and Dash
      • Polars
      • PostgreSQL
        • Usage
      • Power BI
        • Power BI Desktop
        • Power BI Service
      • Prefect
        • Usage
      • Prometheus
        • Usage
      • PyCaret
      • PyViz
      • QueryZen
      • R
        • Tutorial
      • Rill
        • Usage
      • RisingWave
        • Stream processing from Iceberg tables to CrateDB using RisingWave
      • rsyslog
        • Usage
      • scikit-learn
      • Spark
        • Usage
      • SQL Server
      • StatsD
        • Usage
      • Streamlit
      • StreamSets
        • Usage
      • Superset / Preset
        • Usage
        • Sandbox
      • Tableau
      • Telegraf
        • Usage
      • TensorFlow
        • Tutorial
      • Terraform
        • Usage
      • Trino
        • Usage
    • All Features
      • Highlights
      • SQL
      • Document Store
        • Tutorial
      • Relational / JOINs
      • Search: FTS, Geo, Vector, Hybrid
        • Full-Text Search
          • Full-text Search Options
          • Analyzers, Tokenizers, and Filters
          • Tutorial
          • Indexing Text for Both Effective Search and Accurate Analysis
        • Geospatial Search
        • Vector Search
        • Hybrid Search
      • BLOB Store
      • Clustering
      • Snapshots
      • Cloud Native
      • Storage Layer
        • Indexing and storage in CrateDB
      • Hybrid Index
      • Advanced Querying
        • Recurrent queries
      • Generated Columns
      • Server-Side Cursors
      • Foreign data wrappers
      • User-Defined Functions
      • Cross-Cluster Replication
        • Usage

    Operations

    • Installation
      • Debian, Ubuntu
      • Red Hat, SUSE
      • Windows
      • Tarball
      • Container setup
        • Docker
        • Kubernetes
          • CrateDB and Kubernetes
          • Run CrateDB with Kubernetes Operator
      • Cloud hosting
        • Amazon AWS
          • CrateDB on Amazon EC2
          • Deploy using Terraform
          • Using Amazon S3 as a snapshot repository
        • Microsoft Azure
          • CrateDB on Azure VMs
          • Deploy using Terraform
      • Configuration settings
      • Multi-node setup
      • Multi-zone setup
    • Administration
      • Bootstrap checks
      • User management
      • Going into production
      • Monitoring and diagnostics
        • Prometheus and Grafana
        • Prometheus JMX Exporter
        • Prometheus SQL Exporter
      • Memory configuration
      • Circuit breaker
      • Troubleshooting
        • System Tables
        • CrateDB Flight Recorder (CFR)
        • Java Flight Recorder (JFR)
        • The jcmd Utility
          • Using jcmd with CrateDB on Docker
          • Java Flight Recorder (JFR)
        • The crate-node command
      • Scaling
        • Expand
        • On-Demand
        • Autoscale
        • On Kubernetes
      • Upgrading
        • Guidelines
        • Rolling Upgrade
        • Full Restart Upgrade
    • Performance guides
      • Sharding and partitioning 101
      • Sharding recommendations
      • Scaling
      • Storage
      • Fast Inserts
        • Insert Methods
        • Bulk Inserts
        • Parallel Inserts
        • Configuration Tuning for Inserts
        • Testing Insert Performance
      • Fast Selects
      • Query Optimization 101

    References

  • CrateDB Cloud
    • CrateDB
      • Tools

      • Admin UI
        • CrateDB CLI
          • Cloud CLI
            • CrateDB MCP
            • CrateDB Toolkit
            • Support
            • Community

            Hybrid Search¶

            Combined BM25 term search and vector search based on Apache Lucene, using SQL: CrateDB is all you need.

            Overview

            The capabilities of Vector Search are impressive, but it isn’t a perfect technology. Without domain-specific datasets to fine-tune models on, a traditional term-based Full-Text Search still has a few advantages.

            About

            Vector search unlocks semantic search. Based on existing models, it provides incredible and intelligent data retrieval, but struggles when it comes to adapting to new domains.

            Combining both approaches, in order to leverage the best from both worlds, is called hybrid search, aiming to use the performance potential of vector search, and the zero-shot adaptability of traditional search.

            Details

            Hybrid search as a technique enhances relevancy and accuracy by combining the results of two or more search algorithms, achieving better accuracy and relevancy than each algorithm would individually.

            CrateDB supports three search functions:

            • kNN search, using KNN_MATCH

            • Okapi BM25 similarity scoring, using MATCH

            • Geospatial search, using MATCH

            Reference Manual

            • Fulltext search

            • Fulltext indices

            • MATCH

            • KNN_MATCH

            • FLOAT_VECTOR

            Related

            • SQL

            • Full-Text Search

            • Vector Search

            • Advanced Querying

            Full-Text Search Semantic Search Vector Search

            SQL BM25 kNN HNSW

            Synopsis¶

            A quick impression how a single-query hybrid search looks like.

            WITH 
                vector_search as (vector_query),
                bm25_search as (bm25_query)
            
            SELECT
                (CONVEX or RRF) as hybrid_score
            
            FROM
                search_method_1, search_method_2
            
            WHERE
                search_method_1.id = search_method_2.id;
            

            The SQL expression uses common table expressions for a better structure, and an inner-join to join results from both search methods into a single unified result, based on the application requirements at hand.

            The blog article referenced below features a full example that explores hybrid scoring using convex combination, and in the final version assigns a rank to every row, using a window function and reciprocal rank fusion (RRF).

            Example Results

            Example result for a search with convex combination scoring, yielding individual scores for bm25 vs. vector, and a synthesized hybrid score. The search expression was MATCH("content", 'knn search').

            +----------------------+-------------+--------------+----------------------------------------------------------------+
            | hybrid_score         | bm25_score  | vector_score | title                                                          |
            |----------------------|-------------|--------------|----------------------------------------------------------------|
            | 0.7440367221832276   |  1          | 0.57339454   | knn_match(float_vector, float_vector, int)                     |
            | 0.4868442595005036   |  0.5512639  | 0.4438978    | Searching On Multiple Columns                                  |
            | 0.4716400563716888   |  0.56939983 | 0.40646687   | array_position(anycompatiblearray, anycompatible [, integer ] )|
            | 0.4702456831932068   |  0.55290174 | 0.41514164   | Text search functions and operators                            |
            | 0.4677474081516266   |  0.5523509  | 0.4113451    | Synopsis                                                       |
            +----------------------+-------------+--------------+----------------------------------------------------------------+
            

            Example result for a search with reciprocal rank fusion, assigning a rank to every result row, also yielding ranks for individual components bm25 vs. vector, and a synthesized final rank. The search expression was MATCH("content", 'knn search').

            +------------+-----------+-------------+----------------------------------------------+
            | final_rank | bm25_rank | vector_rank | title                                        |
            |------------|-----------|-------------|----------------------------------------------|
            | 0.032786   |  1        |     1       |   knn_match(float_vector, float_vector, int) |
            | 0.031054   |  7        |     2       |   Searching On Multiple Columns              |
            | 0.030578   |  8        |     3       |   Usage                                      |
            | 0.028717   |  5        |     15      |   Text search functions and operators        |
            | 0.02837    |  10       |     11      |   Synopsis                                   |
            +------------+-----------+-------------+----------------------------------------------+
            

            Usage¶

            Working with hybrid search in CrateDB.

            Pure SQL

            Querying both CrateDB’s inverted index with BM25 scoring for FTS, and navigating the vector space of machine learning embeddings, are available through SQL and can be used by any application speaking it.

            Learn¶

            Learn how to use hybrid search techniques in CrateDB, using pure SQL.

            Articles

            Blog: Hybrid Search in CrateDB

            A common scenario is to combine semantic search (vector search) with lexical/term/keyword search (inverted index + BM25).

            Semantic search excels at understanding the context of a phrase. Lexical search is great at finding how many times a keyword or phrase appears in a document, taking into account the length and the average length of your documents (TF–IDF).

            The article will go through both search methods. You will learn how to combine them, and how to apply different scoring and re-ranking techniques, all using CrateDB and pure SQL.

            Doing Hybrid Search in CrateDB

            Introduction
            Hybrid Search
            Pure SQL

            What’s Inside

            • Full-Text Search (BM25)

            • Vector Search (kNN/HNSW)

            • Convex Combination

            • Reciprocal Rank Fusion (RRF)

            • SQL: CTE, JOIN, RANK

            Next
            BLOB Store
            Previous
            Vector Search
              Feedback

              Suggest improvement

              Edit page

              View page source

            On this page
            • Hybrid Search
              • Synopsis
              • Usage
              • Learn
            • Imprint
            • Contact
            • Legal
            Follow us
            Follow us on X Follow us on LinkedIn Follow us on Facebook Follow us on Instagram Follow us on Facebook