Every time series feature pipeline has the same shape: readings land in the database, a job exports a CSV or DataFrame to Python, pandas computes rolling means and standard deviations, and the result is written back as a training dataset. That loop adds latency, adds a failure point, and runs on a schedule rather than on live data. The aggregations (bucketing, rolling averages, deviation scoring) are SQL operations. They belong in the database.
CrateDB's DATE_BIN, window functions, and COUNT(*) FILTER (WHERE ...) cover the majority of what most data engineers write in pandas for time series feature engineering. Here are three patterns and how to compose them into a feature table.
Pattern 1: Time Bucketing with DATE_BIN
DATE_BIN groups timestamps into equal-width intervals without writing bucketing logic in application code. It takes three arguments: the stride (interval width), the source timestamp column, and an origin (the epoch to align bucket boundaries to).
-- Group sensor readings into 5-minute intervals per device SELECT DATE_BIN('5 minutes'::INTERVAL, ts, '2024-01-01') AS bucket, device_id, AVG(temperature) AS avg_temp, MAX(temperature) AS max_temp, MIN(temperature) AS min_temp, COUNT(*) AS reading_count FROM sensor_readings WHERE ts > NOW() - INTERVAL '24 hours' GROUP BY bucket, device_id ORDER BY device_id, bucket;
Every row in the result represents one device for one 5-minute window, which is the direct input format for most time series models, with no reshaping required.
COUNT(*) FILTER (WHERE ...) adds conditional counts within each bucket without a subquery:
-- Count out-of-range readings per bucket alongside totals SELECT DATE_BIN('5 minutes'::INTERVAL, ts, '2024-01-01') AS bucket, device_id, COUNT(*) AS total_readings, COUNT(*) FILTER (WHERE temperature > 80) AS high_temp_count, COUNT(*) FILTER (WHERE temperature < 10) AS low_temp_count FROM sensor_readings WHERE ts > NOW() - INTERVAL '24 hours' GROUP BY bucket, device_id ORDER BY device_id, bucket;
A reading outside the normal range within a 5-minute window is itself a feature; the count of out-of-range readings per bucket tells a model something a plain average does not.
Pattern 2: Rolling Aggregates with Window Functions
Window functions compute a value for each row based on a sliding window of surrounding rows without collapsing the result into groups. For time series, you want one row per reading with the rolling context attached, not one row per window.
The rate of change between consecutive readings — the first derivative of the signal — is the simplest window computation:
-- Delta between consecutive readings per device SELECT device_id, ts, temperature, LAG(temperature) OVER ( PARTITION BY device_id ORDER BY ts ) AS prev_temp, temperature - LAG(temperature) OVER ( PARTITION BY device_id ORDER BY ts ) AS delta FROM sensor_readings WHERE ts > NOW() - INTERVAL '1 hour' ORDER BY device_id, ts;
LAG returns the previous row's value within the partition. LEAD returns the next row, it's useful for labeling: if you know what follows a reading pattern, you can label the current row as a precursor to that event.
A rolling average smooths noise and reveals trends:
-- 10-reading rolling average per device, ordered by time SELECT device_id, ts, temperature, AVG(temperature) OVER ( PARTITION BY device_id ORDER BY ts ROWS BETWEEN 9 PRECEDING AND CURRENT ROW ) AS rolling_avg_10 FROM sensor_readings WHERE ts > NOW() - INTERVAL '6 hours' ORDER BY device_id, ts;
ROWS BETWEEN 9 PRECEDING AND CURRENT ROW defines a 10-row window: the current reading plus the nine that preceded it, in time order within each device partition. Replace AVG with STDDEV to get rolling volatility. This is a separate feature that captures how variable a device's output has been recently, independent of its absolute value.
Pattern 3: Deviation Scoring for Anomaly Flagging
A z-score measures how far a reading sits from the historical mean in units of standard deviation. A CTE computes the baseline statistics; a join applies them to current readings.
-- Flag readings more than 2 standard deviations from the 24-hour mean WITH device_baseline AS ( SELECT device_id, AVG(temperature) AS mean_temp, STDDEV(temperature) AS stddev_temp FROM sensor_readings WHERE ts > NOW() - INTERVAL '24 hours' GROUP BY device_id ) SELECT r.device_id, r.ts, r.temperature, b.mean_temp, b.stddev_temp, ABS(r.temperature - b.mean_temp) / NULLIF(b.stddev_temp, 0) AS z_score FROM sensor_readings r JOIN device_baseline b ON r.device_id = b.device_id WHERE r.ts > NOW() - INTERVAL '1 hour' AND ABS(r.temperature - b.mean_temp) / NULLIF(b.stddev_temp, 0) > 2 ORDER BY z_score DESC;
NULLIF(b.stddev_temp, 0) avoids a division-by-zero error for devices whose readings have not varied. Any device with zero standard deviation over 24 hours either has a stuck sensor or is running in a completely stable state — worth flagging separately from the anomaly detection path.
This query runs against live data. The baseline window (NOW() - INTERVAL '24 hours') updates every time the query executes. No scheduled job, no stale threshold file.
Composing a Feature Table
The three patterns compose directly into a CTE chain. The result is a feature table: bucketed aggregates, rolling context, and a deviation score, with one row per device per 5-minute bucket.
-- Feature table: bucketed readings with rolling averages and anomaly scores WITH bucketed AS ( SELECT DATE_BIN('5 minutes'::INTERVAL, ts, '2024-01-01') AS bucket, device_id, AVG(temperature) AS avg_temp, MAX(temperature) AS max_temp, COUNT(*) AS reading_count FROM sensor_readings WHERE ts > NOW() - INTERVAL '7 days' GROUP BY bucket, device_id ), baseline AS ( SELECT device_id, AVG(temperature) AS baseline_mean, STDDEV(temperature) AS baseline_stddev FROM sensor_readings WHERE ts BETWEEN NOW() - INTERVAL '30 days' AND NOW() - INTERVAL '7 days' GROUP BY device_id ) SELECT b.bucket, b.device_id, b.avg_temp, b.max_temp, b.reading_count, AVG(b.avg_temp) OVER ( PARTITION BY b.device_id ORDER BY b.bucket ROWS BETWEEN 11 PRECEDING AND CURRENT ROW ) AS rolling_avg_12_buckets, ABS(b.avg_temp - d.baseline_mean) / NULLIF(d.baseline_stddev, 0) AS z_score FROM bucketed b JOIN baseline d ON b.device_id = d.device_id ORDER BY b.device_id, b.bucket;
The baseline window uses the 30-to-7-day range rather than the full 30 days. That separation ensures recent anomalies do not inflate the baseline mean and distort the z-score for the current window — a deliberate modeling choice that matters at scale when sensor behavior shifts over time.
The output of this query is ready to pass to a model training run, a gradient boosting classifier, or a threshold alert. The data-layer work is done.
What This Is and What It Is Not
These SQL patterns compute statistical features. They do not produce a forecast. A forecast requires a trained model (ARIMA, Prophet, gradient boosting) that learns a trend and projects future values. That step runs in Python or R.
What SQL handles well is the preparation layer: the step between raw readings and model-ready input. Moving that step into the database means it runs against live data, updates continuously, and does not require a scheduled export job to stay current. The data engineer owns this step. The data scientist owns what comes next.
For the model training side of the pipeline — how to structure SQL queries as feature engineering within a machine learning workflow — see the time series context guide and the SQL feature engineering for ML post.
Run These Queries on Live Data
DATE_BIN and window functions are available in all CrateDB editions, including CrateDB Cloud, which starts free. The time series data model documentation covers partitioning and sharding strategies for high-cardinality sensor datasets.