Skip to content
Blog

Time Series Forecasting with SQL: DATE_BIN, Window Functions, and Rolling Aggregations

Every time series feature pipeline has the same shape: readings land in the database, a job exports a CSV or DataFrame to Python, pandas computes rolling means and standard deviations, and the result is written back as a training dataset. That loop adds latency, adds a failure point, and runs on a schedule rather than on live data. The aggregations (bucketing, rolling averages, deviation scoring) are SQL operations. They belong in the database.

CrateDB's DATE_BIN, window functions, and COUNT(*) FILTER (WHERE ...) cover the majority of what most data engineers write in pandas for time series feature engineering. Here are three patterns and how to compose them into a feature table.

Pattern 1: Time Bucketing with DATE_BIN

DATE_BIN groups timestamps into equal-width intervals without writing bucketing logic in application code. It takes three arguments: the stride (interval width), the source timestamp column, and an origin (the epoch to align bucket boundaries to).

-- Group sensor readings into 5-minute intervals per device
SELECT
    DATE_BIN('5 minutes'::INTERVAL, ts, '2024-01-01') AS bucket,
    device_id,
    AVG(temperature) AS avg_temp,
    MAX(temperature) AS max_temp,
    MIN(temperature) AS min_temp,
    COUNT(*)         AS reading_count
FROM sensor_readings
WHERE ts > NOW() - INTERVAL '24 hours'
GROUP BY bucket, device_id
ORDER BY device_id, bucket;

Every row in the result represents one device for one 5-minute window, which is the direct input format for most time series models, with no reshaping required.

COUNT(*) FILTER (WHERE ...) adds conditional counts within each bucket without a subquery:

-- Count out-of-range readings per bucket alongside totals
SELECT
    DATE_BIN('5 minutes'::INTERVAL, ts, '2024-01-01') AS bucket,
    device_id,
    COUNT(*) AS total_readings,
    COUNT(*) FILTER (WHERE temperature > 80) AS high_temp_count,
    COUNT(*) FILTER (WHERE temperature < 10) AS low_temp_count
FROM sensor_readings
WHERE ts > NOW() - INTERVAL '24 hours'
GROUP BY bucket, device_id
ORDER BY device_id, bucket;

A reading outside the normal range within a 5-minute window is itself a feature; the count of out-of-range readings per bucket tells a model something a plain average does not.

Pattern 2: Rolling Aggregates with Window Functions

Window functions compute a value for each row based on a sliding window of surrounding rows without collapsing the result into groups. For time series, you want one row per reading with the rolling context attached, not one row per window.

The rate of change between consecutive readings — the first derivative of the signal — is the simplest window computation:

-- Delta between consecutive readings per device
SELECT
    device_id,
    ts,
    temperature,
    LAG(temperature) OVER (
        PARTITION BY device_id
        ORDER BY ts
    ) AS prev_temp,
    temperature - LAG(temperature) OVER (
        PARTITION BY device_id
        ORDER BY ts
    ) AS delta
FROM sensor_readings
WHERE ts > NOW() - INTERVAL '1 hour'
ORDER BY device_id, ts;

LAG returns the previous row's value within the partition. LEAD returns the next row, it's useful for labeling: if you know what follows a reading pattern, you can label the current row as a precursor to that event.

A rolling average smooths noise and reveals trends:

-- 10-reading rolling average per device, ordered by time
SELECT
    device_id,
    ts,
    temperature,
    AVG(temperature) OVER (
        PARTITION BY device_id
        ORDER BY ts
        ROWS BETWEEN 9 PRECEDING AND CURRENT ROW
    ) AS rolling_avg_10
FROM sensor_readings
WHERE ts > NOW() - INTERVAL '6 hours'
ORDER BY device_id, ts;

ROWS BETWEEN 9 PRECEDING AND CURRENT ROW defines a 10-row window: the current reading plus the nine that preceded it, in time order within each device partition. Replace AVG with STDDEV to get rolling volatility. This is a separate feature that captures how variable a device's output has been recently, independent of its absolute value.

Pattern 3: Deviation Scoring for Anomaly Flagging

A z-score measures how far a reading sits from the historical mean in units of standard deviation. A CTE computes the baseline statistics; a join applies them to current readings.

-- Flag readings more than 2 standard deviations from the 24-hour mean
WITH device_baseline AS (
    SELECT
        device_id,
        AVG(temperature)    AS mean_temp,
        STDDEV(temperature) AS stddev_temp
    FROM sensor_readings
    WHERE ts > NOW() - INTERVAL '24 hours'
    GROUP BY device_id
)
SELECT
    r.device_id,
    r.ts,
    r.temperature,
    b.mean_temp,
    b.stddev_temp,
    ABS(r.temperature - b.mean_temp)
        / NULLIF(b.stddev_temp, 0) AS z_score
FROM sensor_readings r
JOIN device_baseline b ON r.device_id = b.device_id
WHERE r.ts > NOW() - INTERVAL '1 hour'
  AND ABS(r.temperature - b.mean_temp)
      / NULLIF(b.stddev_temp, 0) > 2
ORDER BY z_score DESC;

NULLIF(b.stddev_temp, 0) avoids a division-by-zero error for devices whose readings have not varied. Any device with zero standard deviation over 24 hours either has a stuck sensor or is running in a completely stable state — worth flagging separately from the anomaly detection path.

This query runs against live data. The baseline window (NOW() - INTERVAL '24 hours') updates every time the query executes. No scheduled job, no stale threshold file.

Composing a Feature Table

The three patterns compose directly into a CTE chain. The result is a feature table: bucketed aggregates, rolling context, and a deviation score, with one row per device per 5-minute bucket.

-- Feature table: bucketed readings with rolling averages and anomaly scores
WITH bucketed AS (
    SELECT
        DATE_BIN('5 minutes'::INTERVAL, ts, '2024-01-01') AS bucket,
        device_id,
        AVG(temperature)    AS avg_temp,
        MAX(temperature)    AS max_temp,
        COUNT(*)            AS reading_count
    FROM sensor_readings
    WHERE ts > NOW() - INTERVAL '7 days'
    GROUP BY bucket, device_id
),
baseline AS (
    SELECT
        device_id,
        AVG(temperature)    AS baseline_mean,
        STDDEV(temperature) AS baseline_stddev
    FROM sensor_readings
    WHERE ts BETWEEN NOW() - INTERVAL '30 days'
                 AND NOW() - INTERVAL '7 days'
    GROUP BY device_id
)
SELECT
    b.bucket,
    b.device_id,
    b.avg_temp,
    b.max_temp,
    b.reading_count,
    AVG(b.avg_temp) OVER (
        PARTITION BY b.device_id
        ORDER BY b.bucket
        ROWS BETWEEN 11 PRECEDING AND CURRENT ROW
    ) AS rolling_avg_12_buckets,
    ABS(b.avg_temp - d.baseline_mean)
        / NULLIF(d.baseline_stddev, 0) AS z_score
FROM bucketed b
JOIN baseline d ON b.device_id = d.device_id
ORDER BY b.device_id, b.bucket;

The baseline window uses the 30-to-7-day range rather than the full 30 days. That separation ensures recent anomalies do not inflate the baseline mean and distort the z-score for the current window — a deliberate modeling choice that matters at scale when sensor behavior shifts over time.

The output of this query is ready to pass to a model training run, a gradient boosting classifier, or a threshold alert. The data-layer work is done.

What This Is and What It Is Not

These SQL patterns compute statistical features. They do not produce a forecast. A forecast requires a trained model (ARIMA, Prophet, gradient boosting) that learns a trend and projects future values. That step runs in Python or R.

What SQL handles well is the preparation layer: the step between raw readings and model-ready input. Moving that step into the database means it runs against live data, updates continuously, and does not require a scheduled export job to stay current. The data engineer owns this step. The data scientist owns what comes next.

For the model training side of the pipeline — how to structure SQL queries as feature engineering within a machine learning workflow — see the time series context guide and the SQL feature engineering for ML post.

Run These Queries on Live Data

DATE_BIN and window functions are available in all CrateDB editions, including CrateDB Cloud, which starts free. The time series data model documentation covers partitioning and sharding strategies for high-cardinality sensor datasets.

Run queries on live data ->