Skip to content
Explore

Try CrateDB Live

Scenario: Hybrid Search

Dataset

In this step, we load a sample IoT dataset into CrateDB using the COPY FROM statement.

The dataset is provided as a JSON file hosted:
https://cdn.crate.io/downloads/datasets/cratedb-datasets/academy/chicago-data/chicago_community_areas_with_vectors.json

CrateDB can ingest data directly from HTTP(S) endpoints, so there is no need to download the file locally.

1. Create the target table:

Before importing the data, create a table that reflects the structure of the dataset.

CREATE TABLE demo.community_areas (
   areanumber INTEGER PRIMARY KEY,
   name TEXT,
   details OBJECT(DYNAMIC) AS (
       description TEXT INDEX USING fulltext WITH (analyzer = 'english'),
       description_vec FLOAT_VECTOR(2048),
       population BIGINT
   ),
   boundaries GEO_SHAPE
);

2. Understanding the schema:

  • areanumber is the unique identifier of each area and therefore our primary key.
  • name is the human-readable name of the area.
  • details is a dynamic object that groups additional attributes.
    • description is a text field elaborating on the particularities of the respective community area. Text columns are indexed by default using a plain index. We need to opt-in to a full-text index by enabling it explicitly.
    • description_vec is a 2048-dimensional vector generated from description. Vectors have been pre-computed during generation of the data set.
  • boundaries is a GEO_SHAPE representing the border of the area.

Using an OBJECT(DYNAMIC) column allows you to group related measurements while still querying them using standard SQL.

3. Import the data using COPY FROM:

Run the following statement to load the data:

COPY demo.community_areas
FROM 'https://cdn.crate.io/downloads/datasets/cratedb-datasets/academy/chicago-data/chicago_community_areas_with_vectors.json'
RETURN SUMMARY;

The success_count column of the returned result will indicate the number of imported rows.

4. Validate the data:

Check that the data has been loaded:

SELECT COUNT(*) FROM demo.community_areas;

Preview a few rows:

SELECT * FROM demo.community_areas LIMIT 10; 

You can also query nested fields directly:

SELECT
    areanumber,
    name,
    details['description'],
    details['description_vec']
FROM demo.community_areas
LIMIT 10;

Notes and considerations

  • CrateDB distributes ingestion across nodes automatically in a cluster
  • The OBJECT(DYNAMIC)column allows additional fields to be added later without upfront schema changes
  • Even without specifying indexes, all columns are indexed by default

Summary

In this step, you:

  • Created a table with a structured + semi-structured schema
  • Imported JSON data directly from an external source
  • Verified that both top-level and nested fields were correctly ingested

This dataset will be used in the next sections to explore queries and search in CrateDB.