Time-Series: Analyzing Weather Data

CrateDB is a powerful database designed to handle various use cases, one of which is managing time series data. Time series data refers to collections of data points recorded at specific intervals over time, like the hourly temperature of a city or the daily sales of a store.

For this tutorial, imagine a dataset that captures weather readings from CrateDB offices across the globe. Each record includes:

  • timestamp: The exact time of the recording.

  • location: The location of the weather station.

  • temperature: The temperature in degrees Celsius.

  • humidity: The humidity in percentage.

  • wind_speed: The wind speed in km/h.

Creating the Table

CrateDB uses SQL, a powerful and familiar language for database management. To store the weather data, create a table with columns tailored to the dataset using the CREATE TABLE command:

CREATE TABLE "weather_data" (
    "timestamp" TIMESTAMP,
    "location" VARCHAR,
    "temperature" DOUBLE,
    "humidity" DOUBLE,
    "wind_speed" DOUBLE

Run the above SQL command in CrateDB to set up your table. With the table ready, you are now set to insert the dataset.

Inserting Data

Insert the data using the COPY FROM SQL statement.

COPY weather_data
FROM 'https://github.com/crate/cratedb-datasets/raw/main/cloud-tutorials/data_weather.csv.gz'
WITH (format='csv', compression='gzip', empty_string_as_null=true);

Analyzing Data

Start with a basic SELECT statement on all columns, and limit the output to display only 10 records, in order to quickly explore a few samples worth of data.

FROM weather_data

CrateDB is built for fast aggregation using the columnar storage to speed up queries. For example, calculate the average temperature for each location by using the AVG aggregation function:

SELECT location, AVG(temperature) AS avg_temp
FROM weather_data
GROUP BY location;

Computing basic averages is nothing special, but what if you need to answer more detailed questions? For example, if you want to know the highest temperature for each place and when it occurred.

Simple groupings might not be enough, but thankfully, CrateDB has enhanced tools for time series data. You can use the max_by(returned_value, maximized_value) function, which gives you a value (like the time) when another value (like the temperature) is at its highest.

Let’s put this to use with the following query:

SELECT location,
       max(temperature) AS highest_temp,
       max_by(timestamp, temperature) AS time_of_highest_temp
FROM weather_data
GROUP BY location;

You have probably observed by now, that there are gaps in the dataset for certain metrics. Such occurrences are common, perhaps due to a sensor malfunction or disconnection. To address this, the missing values need to be filled in. You can employ another useful tool: window functions paired with the IGNORE NULLS feature. Within a Common Table Expression (CTE), we utilize window functions to spot the next and prior non-null temperature recordings, and then compute the arithmetic mean to bridge the gap:

WITH OrderedData AS (
    SELECT timestamp,
           LAG(temperature, 1) IGNORE NULLS OVER (ORDER BY timestamp) AS prev_temp,
           LEAD(temperature, 1) IGNORE NULLS OVER (ORDER BY timestamp) AS next_temp
    FROM weather_data
SELECT timestamp,
       (prev_temp + next_temp) / 2 AS interpolated_temperature
FROM OrderedData
ORDER BY location, timestamp

This query retrieves the missing temperature values and interpolates them by averaging the previous and next available temperature readings.