Skip to content
Explore

Try CrateDB Live: Import Data

Scenario: Weather Monitoring
  • 1. Choose Scenario
  • 2. Get Ready
  • 3. Run CrateDB
  • 4. Import Data
  • 5. Explore Queries
  • 6. More Queries
  • 7. Connect
  • 8. Next Steps

Sensor Dataset – 260k readings, 5 metrics

In this step, we load the sample weather dataset into three tables in CrateDB using the COPY FROM statements.

The dataset is provided as JSON files hosted in S3:

  • https://guided-path.s3.us-east-1.amazonaws.com/export-demo_climate_data_large_v2.json
  • https://guided-path.s3.us-east-1.amazonaws.com/geo_points.json
  • https://guided-path.s3.us-east-1.amazonaws.com/german_regions.json

It is derived from a combination of several publicly available climate datasets published by Copernicus Climate Data Store.

CrateDB can ingest data directly from HTTP(S) endpoints, so there is no need to download the file locally.

1. Create the target tables:

Before importing the data, create the tables that reflect the structure of the dataset.

The first table is climate_data. It stores around 264,990 raw sensor readings from multiple locations:

CREATE TABLE IF NOT EXISTS demo.climate_data (
   measurement_time TIMESTAMP WITHOUT TIME ZONE,
   geo_location GEO_POINT,
   data OBJECT(DYNAMIC) AS (
      temperature DOUBLE PRECISION,
      pressure DOUBLE PRECISION,
      u10 DOUBLE PRECISION,
      v10 DOUBLE PRECISION,
      latitude DOUBLE PRECISION,
      longitude DOUBLE PRECISION
   )
);

measurement_time stores the time of the observation

  • geo_location is a GEO_POINT used for geospatial queries
  • data is a dynamic object that groups all sensor readings:
    • temperature, pressure
    • wind components (u10, v10)
    • raw latitude and longitude values

Using an OBJECT(DYNAMIC) column allows you to group related measurements while still querying them using standard SQL.

The second table is geo_points. It maps sensor locations to their nearest town, and has 726 rows when loaded:

CREATE TABLE IF NOT EXISTS  demo.geo_points (
    latitude     DOUBLE PRECISION,
    longitude    DOUBLE PRECISION,
    geo_location GEO_POINT,
    nearest_town TEXT,
    PRIMARY KEY (latitude, longitude)
);
The third table is 'german_regions'. It stores 16 regions.
CREATE TABLE IF NOT EXISTS demo.german_regions (
    region_name        TEXT PRIMARY KEY,
    geo_coords         GEO_SHAPE,
    tourism_info       TEXT INDEX USING FULLTEXT WITH (analyzer = 'english'),
    transportation     TEXT INDEX USING FULLTEXT WITH (analyzer = 'english'),
    economics          TEXT INDEX USING FULLTEXT WITH (analyzer = 'english'),
    introduced_species TEXT INDEX USING FULLTEXT WITH (analyzer = 'english'),
    embedding          FLOAT_VECTOR(1536)
);

For each region we store:

  • region_name, such as 'Berlin'
  • geo_coords, a shape that defines the outline of region in around 100 points.
  • tourism_info, transportation, economics and introduced_species. These each contain around 2,000 words of text on the subject in question for each region. Used for text search.
  • embedding, a vector created by looking at the above fields. Used for vector search.

2. Understanding the schema.

The schema is a hierarchy:

  • Germany is divided into german_regions.
  • Each region contains 1 or more geo_points. Note we don't have a traditional foreign key relationship. We use the WITHIN operator instead.
  • The lowest level is climate_data. A row is a reading, for a geo_point, for a specific time.

3. Import the data using COPY FROM:

Run the following statements to load the data:

COPY demo.climate_data
FROM 'https://guided-path.s3.us-east-1.amazonaws.com/export-demo_climate_data_large_v2.json'
WITH (
    format = 'json',
    header = true
) return summary;
COPY demo.geo_points
FROM 'https://guided-path.s3.us-east-1.amazonaws.com/geo_points.json'
WITH (
    format = 'json',
    header = true
) return summary;
COPY demo.german_regions
FROM 'https://guided-path.s3.us-east-1.amazonaws.com/german_regions.json'
WITH (
    format = 'json',
    header = true
) return summary;

Note that if you are using a Crate Cloud free tier account, this will take a few seconds, as we use a shared environment with limited resources. For information on the performance aspect of loading lots of data, see here.

4. Validate the data:

Check that the data has been loaded:

SELECT COUNT(*) FROM demo.climate_data;
SELECT COUNT(*) FROM demo.geo_points;
SELECT COUNT(*) FROM demo.german_regions;
Preview a few rows:
SELECT * FROM demo.climate_data LIMIT 10;
You can also query nested fields directly:
SELECT
    measurement_time,
    data['temperature'],
    data['pressure']
FROM demo.climate_data
LIMIT 10;

Notes and considerations

  • CrateDB distributes ingestion across nodes automatically in a cluster
  • The OBJECT(DYNAMIC) column allows additional fields to be added later without schema changes
  • Storing both geo_location and raw latitude/longitude enables flexible geospatial queries

Summary

In this step, you:
  • Created tables with structured + semi-structured schemas
  • Imported JSON data directly from an external source
  • Verified that both top-level and nested fields were correctly ingested

This dataset will be used in the next sections to explore querying, analytics, and geospatial operations in CrateDB.