Skip to content
Data models > Time series

Time Series Data Analysis with SQL

Using native SQL language is very convenient when accessing and manipulating various types of data, specifically time-series data, allowing for efficient and robust data analysis.

For hands-on practice with SQL queries, follow this Jupyter notebook. It will guide you through creating tables, importing sample data, querying the data with SQL, and visualizing the results using pandas DataFrames and Plotly.
cr-quote-img-white

The extensive SQL querying capabilities of CrateDB include multiple functions:

 Aggregations

CrateDB is built for fast aggregation using a columnar storage engine to speed up queries.

Count() function

The count(column) aggregate function simply returns the number of rows that match the query. When used with a column name as a parameter, it returns the count of rows where the specified column has a non-NULL value. 

SELECT location, COUNT(*)
FROM weather_data_json
GROUP BY 1;

+-------------+----------+---------------+
| count(name) | count(*) | date          |
+-------------+----------+---------------+
|      7      |    8     | 1373932800000 |
|      4      |    4     |  308534400000 |
|      1      |    1     | 1367366400000 |
+-------------+----------+---------------+
SELECT 3 rows in set (... sec)

AVG() function

CrateDB employs the AVG function to calculate averages, such as temperature per location.

SELECT location, AVG(vals['humidity']) AS avg_humidity
FROM weather_data_json
GROUP BY 1;

+---------------+--------------+
| location      | avg_humidity |
+---------------+--------------+
| San Francisco |         60.0 |
| Sydney        |         41.0 |
+---------------+--------------+
SELECT 2 rows in set (0.003 sec)

MAX_BY and MIN_BY functions

For advanced queries, for example finding the highest and lowest temperatures with their corresponding times, CrateDB offers functions like MAX_BY and MIN_BY, demonstrating its facility to handle complex data aggregations.

SELECT location, MAX(ts) AS latest_reading, MAX_BY(vals['humidity'], ts) AS latest_humidity
FROM weather_data_json
GROUP BY 1;

+---------------+----------------+-----------------+
| location      | latest_reading | latest_humidity |
+---------------+----------------+-----------------+
| San Francisco |  1705069860000 |            60.0 |
| Sydney        |  1705069860000 |            40.0 |
+---------------+----------------+-----------------+
SELECT 2 rows in set (0.052 sec)
SELECT location, MIN(ts) AS earliest_reading, MIN_BY(vals['humidity'], ts) AS earliest_humidity
FROM weather_data_json
GROUP BY 1;

+---------------+------------------+-------------------+
| location      | earliest_reading | earliest_humidity |
+---------------+------------------+-------------------+
| San Francisco |    1705069800000 |              60.0 |
| Sydney        |    1705069770000 |              41.0 |
+---------------+------------------+-------------------+
SELECT 2 rows in set (0.005 sec)

Geometric_mean() function

The aggregate function geometric_mean(column) calculates the geometric mean, which is a type of average used for positive numbers. This function is applicable on all numeric types as well as timestamps and consistently returns double values.

SELECT location, GEOMETRIC_MEAN(vals['humidity']) AS geo_mean_humidity
FROM weather_data_json
GROUP BY 1;

+---------------+--------------------+
| location      |  geo_mean_humidity |
+---------------+--------------------+
| San Francisco | 59.999999999999986 |
| Sydney        | 40.991868306014425 |
+---------------+--------------------+
SELECT 2 rows in set (0.052 sec)

Arbitrary() function

The arbitrary(column) function produces a single value from a specified column, however, it does not define which value is returned. The output type of this function matches the type of the parameter column used.

SELECT location, ARBITRARY(vals['humidity']) AS any_humidity
FROM weather_data_json
GROUP BY 1;

+---------------+--------------+
| location      | any_humidity |
+---------------+--------------+
| San Francisco |         60.0 |
| Sydney        |         42.0 |
+---------------+--------------+
SELECT 2 rows in set (0.003 sec)

You can find the list of all available aggregation functions in the documentation here.

Common Table Expressions (CTEs)

Common Table Expressions provide a way to reference named sub-queries within the primary query. These sub-queries serve as temporary tables or views for the duration of the primary query, improving the readability of SQL code by breaking down complex queries into smaller, more manageable parts.
 
WITH hourly_stats AS (
  SELECT location,
         DATE_TRUNC('hour', ts) AS ts_hour,
         AVG(vals['humidity']) AS avg_humidity,
         AVG(vals['temperature']) AS avg_temperature,
         AVG(vals['wind_speed']) AS avg_wind_speed
  FROM weather_data_json
  GROUP BY 1, 2
)
SELECT 
FROM hourly_stats s
JOIN warning_thresholds t ON s.avg_wind_speed >= t.max_wind_speed
WHERE ts_hour >= NOW() - '5 hours'::INTERVAL;

+--------------+-------------+--------------+-----------------+----------------+----------------+
| location     |   ts_hour   | avg_humidity | avg_temperature | avg_wind_speed | max_wind_speed |
+--------------+-------------+--------------+-----------------+----------------+----------------+
|San Francisco |1705068000000|         60.0 |            17.5 |           35.6 |           35.0 |
+--------------+-------------+--------------+-----------------+----------------+----------------+

Window Functions

CrateDB utilizes window functions to effectively manage time series data. With the OVER clause, these functions identify previous and next non-null temperature readings and calculate the mean to fill gaps. The IGNORE NULLS clause ensures null values are excluded, enhancing the accuracy and efficiency of data processing.

For the list of the window functions available, check out the documentation here.

Interpolation of Missing Values

CrateDB uses the LAG and LEAD query functions to find "previous" and "next" non-empty values for variables like temperature, humidity, and wind speed within a data window defined by location and ordered by timestamp.
When a previous or next value is unavailable, typically at the start or end of each data window, the COALESCE clause provides the "current" value, ensuring data continuity in handling complex datasets.
 
SELECT locations.location,
       s.ts,
       vals['temperature'] AS current_temperature,
       LAG(vals['temperature']) IGNORE NULLS OVER w AS prev_temperature,
       LEAD(vals['temperature']) IGNORE NULLS OVER w AS next_temperature
FROM 
       GENERATE_SERIES('2024-01-12 14:30:00', '2024-01-12 14:32:00',   '1 minute'::INTERVAL) s("ts")
CROSS JOIN 
       UNNEST(['San Francisco', 'Sydney']) AS locationns("location")
LEFT JOIN 
       weather_data_json w ON locations.location = w.location AND w.ts = s.ts
WINDOW w 
       AS (PARTITION BY locations.location ORDER BY s.ts)
ORDER BY 
       1, 2 DESC;
       
+---------------+---------------+---------------------+------------------+------------------+
| location      |            ts | current_temperature | prev_temperature | next_temperature |
+---------------+---------------+---------------------+------------------+------------------+
| San Francisco | 1705069920000 |              NULL   |             17.5 |           NULL   |
| San Francisco | 1705069860000 |                17.5 |             16.0 |           NULL   |
| San Francisco | 1705069800000 |                16.0 |           NULL   |             17.5 |
| Sydney        | 1705069920000 |              NULL   |             27.8 |             27.8 |
| Sydney        | 1705069860000 |                27.8 |             29.1 |           NULL   |
| Sydney        | 1705069800000 |                29.1 |           NULL   |             27.8 |
+---------------+---------------+---------------------+------------------+------------------+
SELECT 6 rows in set (0.004 sec)

Moving Averages

Moving averages are a widely used statistical technique in time series analysis, for smoothing out short-term fluctuations and for highlighting long-term trends and cycles. With CrateDB, you can efficiently compute moving averages, such as the last 10 and 20 readings, to gain valuable insights from your data.
 
SELECT ts,
       vals['humidity'] AS current_humidity,
       AVG(vals['humidity']) OVER last_2 AS avg_humidity_last_2_readings,
       AVG(vals['humidity']) OVER last_20 AS avg_humidity_last_20_readings
FROM weather_data_json w
WHERE location = 'San Francisco'
WINDOW 
      last_2 AS (PARTITION BY location ORDER BY ts ROWS BETWEEN 2 PRECEDING AND CURRENT ROW),
      last_20 AS (PARTITION BY location ORDER BY ts ROWS BETWEEN 20 PRECEDING AND CURRENT ROW)
ORDER BY 1, 2 ASC
LIMIT 5;

+---------------+------------------+------------------------------+-------------------------------+
|            ts | current_humidity | avg_humidity_last_2_readings | avg_humidity_last_20_readings |
+---------------+------------------+------------------------------+-------------------------------+
| 1705069800000 |             60.0 |           60.0               |            60.0               |
| 1705069860000 |             60.0 |           60.0               |            60.0               |
| 1717686351524 |             20.0 |           46.666666666666664 |            46.666666666666664 |
| 1717686353859 |             50.0 |           43.333333333333336 |            a                  |
| 1717686355629 |             55.0 |           41.666666666666664 |            49.0               |
+---------------+------------------+------------------------------+-------------------------------+
SELECT 5 rows in set (0.054 sec)

Joins

Moving averages are a widely used statistical technique in time series analysis, for smoothing out short-term fluctuations and for highlighting long-term trends and cycles. With CrateDB, you can efficiently compute moving averages, such as the last 10 and 20 readings, to gain valuable insights from your data.
 
SELECT w.ts,
       w.location,
       s.software_version,
       s.geo_location,
       w.vals['temperature'] AS temperature,
       w.vals['humidity'] AS humidity
FROM UNNEST(['San Francisco', 'Sydney']) AS locations("location")
LEFT JOIN weather_data_json w USING (location)
LEFT JOIN weather_station s ON s.location = locations.location;

+---------------+---------------+----------------+-------------------------------------------+-----------+--------+
|            ts | location      |software_version| geo_location                              |temperature|humidity|
+---------------+---------------+----------------+-------------------------------------------+-----------+--------+
| 1705069800000 | San Francisco | C-5.rev2       | [122.23999997600913, 37.45999999344349]   |      16.0 |   60.0 |
| 1705069860000 | San Francisco | C-5.rev2       | [122.23999997600913, 37.45999999344349]   |      17.5 |   60.0 |
| 1705069800000 | Sydney        | A-1            | [151.11999998800457, -33.520000032149255] |      29.1 |   42.0 |
| 1705069770000 | Sydney        | A-1            | [151.11999998800457, -33.520000032149255] |      28.5 |   41.0 |
| 1705069860000 | Sydney        | A-1            | [151.11999998800457, -33.520000032149255] |      27.8 |   40.0 |
+---------------+---------------+----------------+-------------------------------------------+-----------+--------+
SELECT 5 rows in set (0.003 sec)

Additionally, CrateDB provides a host of other useful functions that are essential for data analysis:

  • Data binning: The DATE_BIN function is useful for grouping data by time-based intervals, e.g., every 10 seconds. It is also a handy tool for harmonizing non-equidistant events to correlate them to equidistant events.
  • Sampling of the dataset: In some use cases, it might be necessary to sample the dataset. This can easily be achieved with a modulo operator on the internal doc id attribute.
  • User Defined Functions: CrateDB allows for domain-specific logic to be put into user-defined functions, making it reusable in an easy and consistent manner. For example, we can calculate the temperature in Fahrenheit from degrees Celsius using a dedicated function.
 
CREATE FUNCTION to_fahrenheit (c DOUBLE)
RETURNS DOUBLE
LANGUAGE JAVASCRIPT AS
'function to_fahrenheit(c) {
    return (c * (9/5)) + 32;
}';

query-explanation

Want to read more?

Whitepaper" Guide for Time Series Data Projects