Try CrateDB Live: Import Data
Sensor Dataset – 260k readings, 5 metrics
In this step, we load the sample weather dataset into three tables in CrateDB using the COPY FROM statements.
The dataset is provided as JSON files hosted in S3:
https://guided-path.s3.us-east-1.amazonaws.com/export-demo_climate_data_large_v2.jsonhttps://guided-path.s3.us-east-1.amazonaws.com/geo_points.jsonhttps://guided-path.s3.us-east-1.amazonaws.com/german_regions.json
It is derived from a combination of several publicly available climate datasets published by Copernicus Climate Data Store.
CrateDB can ingest data directly from HTTP(S) endpoints, so there is no need to download the file locally.
1. Create the target tables:
Before importing the data, create the tables that reflect the structure of the dataset.
The first table is climate_data. It stores around 264,990 raw sensor readings from multiple locations:
CREATE TABLE IF NOT EXISTS demo.climate_data ( measurement_time TIMESTAMP WITHOUT TIME ZONE, geo_location GEO_POINT, data OBJECT(DYNAMIC) AS ( temperature DOUBLE PRECISION, pressure DOUBLE PRECISION, u10 DOUBLE PRECISION, v10 DOUBLE PRECISION, latitude DOUBLE PRECISION, longitude DOUBLE PRECISION ) );
measurement_time stores the time of the observation
geo_locationis a GEO_POINT used for geospatial queriesdatais a dynamic object that groups all sensor readings:- temperature, pressure
- wind components (u10, v10)
- raw latitude and longitude values
Using an OBJECT(DYNAMIC) column allows you to group related measurements while still querying them using standard SQL.
The second table is geo_points. It maps sensor locations to their nearest town, and has 726 rows when loaded:
CREATE TABLE IF NOT EXISTS demo.geo_points ( latitude DOUBLE PRECISION, longitude DOUBLE PRECISION, geo_location GEO_POINT, nearest_town TEXT, PRIMARY KEY (latitude, longitude) );
CREATE TABLE IF NOT EXISTS demo.german_regions ( region_name TEXT PRIMARY KEY, geo_coords GEO_SHAPE, tourism_info TEXT INDEX USING FULLTEXT WITH (analyzer = 'english'), transportation TEXT INDEX USING FULLTEXT WITH (analyzer = 'english'), economics TEXT INDEX USING FULLTEXT WITH (analyzer = 'english'), introduced_species TEXT INDEX USING FULLTEXT WITH (analyzer = 'english'), embedding FLOAT_VECTOR(1536) );
For each region we store:
region_name, such as 'Berlin'geo_coords, a shape that defines the outline of region in around 100 points.tourism_info,transportation,economicsandintroduced_species. These each contain around 2,000 words of text on the subject in question for each region. Used for text search.embedding, a vector created by looking at the above fields. Used for vector search.
2. Understanding the schema.
The schema is a hierarchy:
- Germany is divided into
german_regions. - Each region contains 1 or more
geo_points. Note we don't have a traditional foreign key relationship. We use the WITHIN operator instead. - The lowest level is
climate_data. A row is a reading, for ageo_point, for a specific time.
3. Import the data using COPY FROM:
Run the following statements to load the data:
COPY demo.climate_data FROM 'https://guided-path.s3.us-east-1.amazonaws.com/export-demo_climate_data_large_v2.json' WITH ( format = 'json', header = true ) return summary;
COPY demo.geo_points FROM 'https://guided-path.s3.us-east-1.amazonaws.com/geo_points.json' WITH ( format = 'json', header = true ) return summary;
COPY demo.german_regions FROM 'https://guided-path.s3.us-east-1.amazonaws.com/german_regions.json' WITH ( format = 'json', header = true ) return summary;
Note that if you are using a Crate Cloud free tier account, this will take a few seconds, as we use a shared environment with limited resources. For information on the performance aspect of loading lots of data, see here.
4. Validate the data:
Check that the data has been loaded:
SELECT COUNT(*) FROM demo.climate_data;
SELECT COUNT(*) FROM demo.geo_points;
SELECT COUNT(*) FROM demo.german_regions;
SELECT * FROM demo.climate_data LIMIT 10;
SELECT measurement_time, data['temperature'], data['pressure'] FROM demo.climate_data LIMIT 10;
Notes and considerations
- CrateDB distributes ingestion across nodes automatically in a cluster
- The
OBJECT(DYNAMIC)column allows additional fields to be added later without schema changes - Storing both geo_location and raw latitude/longitude enables flexible geospatial queries
Summary
In this step, you:- Created tables with structured + semi-structured schemas
- Imported JSON data directly from an external source
- Verified that both top-level and nested fields were correctly ingested
This dataset will be used in the next sections to explore querying, analytics, and geospatial operations in CrateDB.