Time Series Forecasting with SQL: DATE_BIN, Window Functions, and Rolling Aggregations

Every time series feature pipeline has the same shape: readings land in the database, a job exports a CSV or DataFrame to Python, pandas computes rolling means and standard deviations, and the result is written back as a training dataset. That loop adds latency, adds a failure point, and runs on a schedule rather than on live data. The aggregations (bucketing, rolling averages, deviation scoring) are SQL operations. They belong in the database.

CrateDB's DATE_BIN, window functions, and COUNT(*) FILTER (WHERE ...) cover the majority of what most data engineers write in pandas for time series feature engineering. Here are three patterns and how to compose them into a feature table.

Pattern 1: Time Bucketing with DATE_BIN

DATE_BIN groups timestamps into equal-width intervals without writing bucketing logic in application code. It takes three arguments: the stride (interval width), the source timestamp column, and an origin (the epoch to align bucket boundaries to).

-- Group sensor readings into 5-minute intervals per device
SELECT
    DATE_BIN('5 minutes'::INTERVAL, ts, '2024-01-01') AS bucket,
    device_id,
    AVG(temperature) AS avg_temp,
    MAX(temperature) AS max_temp,
    MIN(temperature) AS min_temp,
    COUNT(*)         AS reading_count
FROM sensor_readings
WHERE ts > NOW() - INTERVAL '24 hours'
GROUP BY bucket, device_id
ORDER BY device_id, bucket;

Every row in the result represents one device for one 5-minute window, which is the direct input format for most time series models, with no reshaping required.

COUNT(*) FILTER (WHERE ...) adds conditional counts within each bucket without a subquery:

-- Count out-of-range readings per bucket alongside totals
SELECT
    DATE_BIN('5 minutes'::INTERVAL, ts, '2024-01-01') AS bucket,
    device_id,
    COUNT(*) AS total_readings,
    COUNT(*) FILTER (WHERE temperature > 80) AS high_temp_count,
    COUNT(*) FILTER (WHERE temperature < 10) AS low_temp_count
FROM sensor_readings
WHERE ts > NOW() - INTERVAL '24 hours'
GROUP BY bucket, device_id
ORDER BY device_id, bucket;

A reading outside the normal range within a 5-minute window is itself a feature; the count of out-of-range readings per bucket tells a model something a plain average does not.

Pattern 2: Rolling Aggregates with Window Functions

Window functions compute a value for each row based on a sliding window of surrounding rows without collapsing the result into groups. For time series, you want one row per reading with the rolling context attached, not one row per window.

The rate of change between consecutive readings — the first derivative of the signal — is the simplest window computation:

-- Delta between consecutive readings per device
SELECT
    device_id,
    ts,
    temperature,
    LAG(temperature) OVER (
        PARTITION BY device_id
        ORDER BY ts
    ) AS prev_temp,
    temperature - LAG(temperature) OVER (
        PARTITION BY device_id
        ORDER BY ts
    ) AS delta
FROM sensor_readings
WHERE ts > NOW() - INTERVAL '1 hour'
ORDER BY device_id, ts;

LAG returns the previous row's value within the partition. LEAD returns the next row, it's useful for labeling: if you know what follows a reading pattern, you can label the current row as a precursor to that event.

A rolling average smooths noise and reveals trends:

-- 10-reading rolling average per device, ordered by time
SELECT
    device_id,
    ts,
    temperature,
    AVG(temperature) OVER (
        PARTITION BY device_id
        ORDER BY ts
        ROWS BETWEEN 9 PRECEDING AND CURRENT ROW
    ) AS rolling_avg_10
FROM sensor_readings
WHERE ts > NOW() - INTERVAL '6 hours'
ORDER BY device_id, ts;

ROWS BETWEEN 9 PRECEDING AND CURRENT ROW defines a 10-row window: the current reading plus the nine that preceded it, in time order within each device partition. Replace AVG with STDDEV to get rolling volatility. This is a separate feature that captures how variable a device's output has been recently, independent of its absolute value.

Pattern 3: Deviation Scoring for Anomaly Flagging

A z-score measures how far a reading sits from the historical mean in units of standard deviation. A CTE computes the baseline statistics; a join applies them to current readings.

-- Flag readings more than 2 standard deviations from the 24-hour mean
WITH device_baseline AS (
    SELECT
        device_id,
        AVG(temperature)    AS mean_temp,
        STDDEV(temperature) AS stddev_temp
    FROM sensor_readings
    WHERE ts > NOW() - INTERVAL '24 hours'
    GROUP BY device_id
)
SELECT
    r.device_id,
    r.ts,
    r.temperature,
    b.mean_temp,
    b.stddev_temp,
    ABS(r.temperature - b.mean_temp)
        / NULLIF(b.stddev_temp, 0) AS z_score
FROM sensor_readings r
JOIN device_baseline b ON r.device_id = b.device_id
WHERE r.ts > NOW() - INTERVAL '1 hour'
  AND ABS(r.temperature - b.mean_temp)
      / NULLIF(b.stddev_temp, 0) > 2
ORDER BY z_score DESC;

NULLIF(b.stddev_temp, 0) avoids a division-by-zero error for devices whose readings have not varied. Any device with zero standard deviation over 24 hours either has a stuck sensor or is running in a completely stable state — worth flagging separately from the anomaly detection path.

This query runs against live data. The baseline window (NOW() - INTERVAL '24 hours') updates every time the query executes. No scheduled job, no stale threshold file.

Composing a Feature Table

The three patterns compose directly into a CTE chain. The result is a feature table: bucketed aggregates, rolling context, and a deviation score, with one row per device per 5-minute bucket.

-- Feature table: bucketed readings with rolling averages and anomaly scores
WITH bucketed AS (
    SELECT
        DATE_BIN('5 minutes'::INTERVAL, ts, '2024-01-01') AS bucket,
        device_id,
        AVG(temperature)    AS avg_temp,
        MAX(temperature)    AS max_temp,
        COUNT(*)            AS reading_count
    FROM sensor_readings
    WHERE ts > NOW() - INTERVAL '7 days'
    GROUP BY bucket, device_id
),
baseline AS (
    SELECT
        device_id,
        AVG(temperature)    AS baseline_mean,
        STDDEV(temperature) AS baseline_stddev
    FROM sensor_readings
    WHERE ts BETWEEN NOW() - INTERVAL '30 days'
                 AND NOW() - INTERVAL '7 days'
    GROUP BY device_id
)
SELECT
    b.bucket,
    b.device_id,
    b.avg_temp,
    b.max_temp,
    b.reading_count,
    AVG(b.avg_temp) OVER (
        PARTITION BY b.device_id
        ORDER BY b.bucket
        ROWS BETWEEN 11 PRECEDING AND CURRENT ROW
    ) AS rolling_avg_12_buckets,
    ABS(b.avg_temp - d.baseline_mean)
        / NULLIF(d.baseline_stddev, 0) AS z_score
FROM bucketed b
JOIN baseline d ON b.device_id = d.device_id
ORDER BY b.device_id, b.bucket;

The baseline window uses the 30-to-7-day range rather than the full 30 days. That separation ensures recent anomalies do not inflate the baseline mean and distort the z-score for the current window — a deliberate modeling choice that matters at scale when sensor behavior shifts over time.

The output of this query is ready to pass to a model training run, a gradient boosting classifier, or a threshold alert. The data-layer work is done.

What This Is and What It Is Not

These SQL patterns compute statistical features. They do not produce a forecast. A forecast requires a trained model (ARIMA, Prophet, gradient boosting) that learns a trend and projects future values. That step runs in Python or R.

What SQL handles well is the preparation layer: the step between raw readings and model-ready input. Moving that step into the database means it runs against live data, updates continuously, and does not require a scheduled export job to stay current. The data engineer owns this step. The data scientist owns what comes next.

For the model training side of the pipeline — how to structure SQL queries as feature engineering within a machine learning workflow — see the time series context guide and the SQL feature engineering for ML post.

Run These Queries on Live Data

DATE_BIN and window functions are available in all CrateDB editions, including CrateDB Cloud, which starts free. The time series data model documentation covers partitioning and sharding strategies for high-cardinality sensor datasets.

Run queries on live data ->

Frequently Asked Questions

SQL handles the data preparation layer: bucketing raw readings into equal-width intervals, computing rolling averages and standard deviations, and flagging statistical anomalies. In CrateDB, DATE_BIN groups timestamps into fixed intervals and window functions with ROWS BETWEEN compute rolling aggregates per device. The forecasting model itself runs in Python or R and consumes the SQL-produced features.

DATE_BIN groups timestamps into equal-width time intervals. It takes three arguments: the stride (interval width such as 5 minutes or 1 hour), the source timestamp column, and an origin timestamp for bucket alignment. DATE_BIN('5 minutes', ts, '2024-01-01')assigns every timestamp to its nearest 5-minute boundary. Use it in a GROUP BY to produce one row per device per time interval.

Use a window function with a ROWS BETWEEN frame clause. AVG(column) OVER (PARTITION BY device_id ORDER BY ts ROWS BETWEEN 9 PRECEDING AND CURRENT ROW) computes a 10-row rolling average per device in time order. PARTITION BY ensures the window resets at each device boundary. Replace AVG with STDDEV for rolling volatility over the same window.

CrateDB supports LAG, LEAD, AVG, SUM, MIN, MAX, COUNT, STDDEV, ROW_NUMBER, RANK, and DENSE_RANKover window frames.

Window frames are defined with ROWS BETWEEN (row-count-based) or RANGE BETWEEN (value-based).

PARTITION BY groups rows by device_id or entity.

These functions run directly against live data with no pre-aggregation step required.

Anomaly detection in SQL compares a current reading to a statistical baseline and flags deviations beyond a threshold. It is purely statistical, and runs entirely in SQL.

Time series forecasting projects future values based on learned patterns and requires a trained model such as ARIMA, Prophet, or gradient boosting running in Python or R.

SQL covers data preparation for both.

Time Series Forecasting with SQL: DATE_BIN, Window Functions, and Rolling Aggregations

Pattern 1: Time Bucketing with DATE_BIN

Pattern 2: Rolling Aggregates with Window Functions

Pattern 3: Deviation Scoring for Anomaly Flagging

Composing a Feature Table

Run These Queries on Live Data

Frequently Asked Questions

Product

Developers

Company

Community

Time Series Forecasting with SQL: DATE_BIN, Window Functions, and Rolling Aggregations

Pattern 1: Time Bucketing with DATE_BIN

Pattern 2: Rolling Aggregates with Window Functions

Pattern 3: Deviation Scoring for Anomaly Flagging

Composing a Feature Table

Run These Queries on Live Data

Frequently Asked Questions

Related Posts

Why Industrial Teams Are Moving from Flux and InfluxQL to Standard SQL

How to Add a New Sensor Type to Your Industrial Database Without Pipeline Downtime

How to Ingest OPC-UA and MQTT Data into SQL with Telegraf and CrateDB