Time Series Data Analysis with SQL

SELECT location, COUNT(*)
FROM weather_data_json
GROUP BY 1;

+-------------+----------+---------------+
| count(name) | count(*) | date          |
+-------------+----------+---------------+
|      7      |    8     | 1373932800000 |
|      4      |    4     |  308534400000 |
|      1      |    1     | 1367366400000 |
+-------------+----------+---------------+
SELECT 3 rows in set (... sec)

SELECT location, AVG(vals['humidity']) AS avg_humidity
FROM weather_data_json
GROUP BY 1;

+---------------+--------------+
| location      | avg_humidity |
+---------------+--------------+
| San Francisco |         60.0 |
| Sydney        |         41.0 |
+---------------+--------------+
SELECT 2 rows in set (0.003 sec)

SELECT location, MAX(ts) AS latest_reading, MAX_BY(vals['humidity'], ts) AS latest_humidity
FROM weather_data_json
GROUP BY 1;

+---------------+----------------+-----------------+
| location      | latest_reading | latest_humidity |
+---------------+----------------+-----------------+
| San Francisco |  1705069860000 |            60.0 |
| Sydney        |  1705069860000 |            40.0 |
+---------------+----------------+-----------------+
SELECT 2 rows in set (0.052 sec)

SELECT location, MIN(ts) AS earliest_reading, MIN_BY(vals['humidity'], ts) AS earliest_humidity
FROM weather_data_json
GROUP BY 1;

+---------------+------------------+-------------------+
| location      | earliest_reading | earliest_humidity |
+---------------+------------------+-------------------+
| San Francisco |    1705069800000 |              60.0 |
| Sydney        |    1705069770000 |              41.0 |
+---------------+------------------+-------------------+
SELECT 2 rows in set (0.005 sec)

SELECT location, GEOMETRIC_MEAN(vals['humidity']) AS geo_mean_humidity
FROM weather_data_json
GROUP BY 1;

+---------------+--------------------+
| location      |  geo_mean_humidity |
+---------------+--------------------+
| San Francisco | 59.999999999999986 |
| Sydney        | 40.991868306014425 |
+---------------+--------------------+
SELECT 2 rows in set (0.052 sec)

SELECT location, ARBITRARY(vals['humidity']) AS any_humidity
FROM weather_data_json
GROUP BY 1;

+---------------+--------------+
| location      | any_humidity |
+---------------+--------------+
| San Francisco |         60.0 |
| Sydney        |         42.0 |
+---------------+--------------+
SELECT 2 rows in set (0.003 sec)

WITH hourly_stats AS (
  SELECT location,
         DATE_TRUNC('hour', ts) AS ts_hour,
         AVG(vals['humidity']) AS avg_humidity,
         AVG(vals['temperature']) AS avg_temperature,
         AVG(vals['wind_speed']) AS avg_wind_speed
  FROM weather_data_json
  GROUP BY 1, 2
)
SELECT 
FROM hourly_stats s
JOIN warning_thresholds t ON s.avg_wind_speed >= t.max_wind_speed
WHERE ts_hour >= NOW() - '5 hours'::INTERVAL;

+--------------+-------------+--------------+-----------------+----------------+----------------+
| location     |   ts_hour   | avg_humidity | avg_temperature | avg_wind_speed | max_wind_speed |
+--------------+-------------+--------------+-----------------+----------------+----------------+
|San Francisco |1705068000000|         60.0 |            17.5 |           35.6 |           35.0 |
+--------------+-------------+--------------+-----------------+----------------+----------------+

Window Functions

CrateDB utilizes window functions to effectively manage time series data. With the OVER clause, these functions identify previous and next non-null temperature readings and calculate the mean to fill gaps. The IGNORE NULLS clause ensures null values are excluded, enhancing the accuracy and efficiency of data processing.

For the list of the window functions available, check out the documentation here.

Interpolation of Missing Values

CrateDB uses the LAG and LEAD query functions to find "previous" and "next" non-empty values for variables like temperature, humidity, and wind speed within a data window defined by location and ordered by timestamp.

When a previous or next value is unavailable, typically at the start or end of each data window, the COALESCE clause provides the "current" value, ensuring data continuity in handling complex datasets.

SELECT locations.location,
       s.ts,
       vals['temperature'] AS current_temperature,
       LAG(vals['temperature']) IGNORE NULLS OVER w AS prev_temperature,
       LEAD(vals['temperature']) IGNORE NULLS OVER w AS next_temperature
FROM 
       GENERATE_SERIES('2024-01-12 14:30:00', '2024-01-12 14:32:00',   '1 minute'::INTERVAL) s("ts")
CROSS JOIN 
       UNNEST(['San Francisco', 'Sydney']) AS locationns("location")
LEFT JOIN 
       weather_data_json w ON locations.location = w.location AND w.ts = s.ts
WINDOW w 
       AS (PARTITION BY locations.location ORDER BY s.ts)
ORDER BY 
       1, 2 DESC;
       
+---------------+---------------+---------------------+------------------+------------------+
| location      |            ts | current_temperature | prev_temperature | next_temperature |
+---------------+---------------+---------------------+------------------+------------------+
| San Francisco | 1705069920000 |              NULL   |             17.5 |           NULL   |
| San Francisco | 1705069860000 |                17.5 |             16.0 |           NULL   |
| San Francisco | 1705069800000 |                16.0 |           NULL   |             17.5 |
| Sydney        | 1705069920000 |              NULL   |             27.8 |             27.8 |
| Sydney        | 1705069860000 |                27.8 |             29.1 |           NULL   |
| Sydney        | 1705069800000 |                29.1 |           NULL   |             27.8 |
+---------------+---------------+---------------------+------------------+------------------+
SELECT 6 rows in set (0.004 sec)

SELECT ts,
       vals['humidity'] AS current_humidity,
       AVG(vals['humidity']) OVER last_2 AS avg_humidity_last_2_readings,
       AVG(vals['humidity']) OVER last_20 AS avg_humidity_last_20_readings
FROM weather_data_json w
WHERE location = 'San Francisco'
WINDOW 
      last_2 AS (PARTITION BY location ORDER BY ts ROWS BETWEEN 2 PRECEDING AND CURRENT ROW),
      last_20 AS (PARTITION BY location ORDER BY ts ROWS BETWEEN 20 PRECEDING AND CURRENT ROW)
ORDER BY 1, 2 ASC
LIMIT 5;

+---------------+------------------+------------------------------+-------------------------------+
|            ts | current_humidity | avg_humidity_last_2_readings | avg_humidity_last_20_readings |
+---------------+------------------+------------------------------+-------------------------------+
| 1705069800000 |             60.0 |           60.0               |            60.0               |
| 1705069860000 |             60.0 |           60.0               |            60.0               |
| 1717686351524 |             20.0 |           46.666666666666664 |            46.666666666666664 |
| 1717686353859 |             50.0 |           43.333333333333336 |            a                  |
| 1717686355629 |             55.0 |           41.666666666666664 |            49.0               |
+---------------+------------------+------------------------------+-------------------------------+
SELECT 5 rows in set (0.054 sec)

SELECT w.ts,
       w.location,
       s.software_version,
       s.geo_location,
       w.vals['temperature'] AS temperature,
       w.vals['humidity'] AS humidity
FROM UNNEST(['San Francisco', 'Sydney']) AS locations("location")
LEFT JOIN weather_data_json w USING (location)
LEFT JOIN weather_station s ON s.location = locations.location;

+---------------+---------------+----------------+-------------------------------------------+-----------+--------+
|            ts | location      |software_version| geo_location                              |temperature|humidity|
+---------------+---------------+----------------+-------------------------------------------+-----------+--------+
| 1705069800000 | San Francisco | C-5.rev2       | [122.23999997600913, 37.45999999344349]   |      16.0 |   60.0 |
| 1705069860000 | San Francisco | C-5.rev2       | [122.23999997600913, 37.45999999344349]   |      17.5 |   60.0 |
| 1705069800000 | Sydney        | A-1            | [151.11999998800457, -33.520000032149255] |      29.1 |   42.0 |
| 1705069770000 | Sydney        | A-1            | [151.11999998800457, -33.520000032149255] |      28.5 |   41.0 |
| 1705069860000 | Sydney        | A-1            | [151.11999998800457, -33.520000032149255] |      27.8 |   40.0 |
+---------------+---------------+----------------+-------------------------------------------+-----------+--------+
SELECT 5 rows in set (0.003 sec)

Additionally, CrateDB provides a host of other useful functions that are essential for data analysis:

Data binning: The DATE_BIN function is useful for grouping data by time-based intervals, e.g., every 10 seconds. It is also a handy tool for harmonizing non-equidistant events to correlate them to equidistant events.
Sampling of the dataset: In some use cases, it might be necessary to sample the dataset. This can easily be achieved with a modulo operator on the internal doc id attribute.
User Defined Functions: CrateDB allows for domain-specific logic to be put into user-defined functions, making it reusable in an easy and consistent manner. For example, we can calculate the temperature in Fahrenheit from degrees Celsius using a dedicated function.

Time Series Data Analysis with SQL

Aggregations

Count() function

AVG() function

MAX_BY and MIN_BY functions

Geometric_mean() function

Arbitrary() function

Common Table Expressions (CTEs)

Window Functions

Interpolation of Missing Values

Moving Averages

Joins

Want to read more?

Related blog posts

Blog

Challenges when Analyzing Time Series Data

Blog

Checklist for Time Series Analysis

Talk

Not All Time Series Are Equal - Challenges of Storing and Analyzing Industrial Data

Company

Ecosystem

Contact