Skip to content
Blog

How to Use SQL for ML Feature Engineering on Time-Series Sensor Data

You have 500,000 sensor readings across 500 devices. You need per-device rolling averages, standard deviations, and fault rates as input features for a fault classifier. The standard approach: load the dataset into pandas, group by device, compute rolling windows. At the feature engineering step, memory usage peaks and the model trains. Then new readings arrive and you need to retrain. That means exporting the dataset again first.

This is not a Python problem. It is a feature engineering architecture problem.

Why rolling features carry more signal than raw readings

A fault classifier trained only on raw sensor values performs poorly in production. The reason is structural: a device reading metric_value = 85 means something different depending on whether it has held steady at 85 for the past 20 readings or climbed from 30 in the past six hours. The raw number cannot tell you which is happening. Rolling statistics can.

The features that consistently outrank raw readings in importance scores for industrial sensor models:

Feature Window What it captures
metric_mean_5 / 10 / 20 5, 10, 20 readings Recent average: the trend baseline
metric_std_5 / 10 / 20 5, 10, 20 readings Volatility: rising variance precedes faults in degrading devices
quality_mean_5 / 10 / 20 5, 10, 20 readings Quality score trend: a leading indicator before metric_value shifts
fault_rate_5 / 10 / 20 5, 10, 20 readings Fraction of recent readings in warning or critical status
metric_delta n/a Point-to-point rate of change: direction and speed right now

When trained on a 500-device, 500,000-reading dataset, fault_rate_5 is the strongest single predictor. If 3 of the last 5 readings for a device were warning or critical, the device is on a fault trajectory with high confidence. Rolling mean and standard deviation rank second and fourth. The raw metric_value at reading time ranks considerably lower.

The point: context over time is the signal. Point-in-time values are noise.

The pandas approach and where it breaks

Here is what building these features looks like in Python against a JSON dataset:

import pandas as pd, json

# Load all 500,000 rows — peaks at ~4 GB RAM (internal benchmark)
with open('dataset/iot_demo_dataset.json', 'r') as f:
    df = pd.DataFrame([json.loads(line) for line in f])

df = df.sort_values(['device_id', 'timestamp'])

# Compute rolling features per device
df = df.groupby('device_id', group_keys=False).apply(lambda g: g.assign(
    metric_mean_5  = g['metric_value'].rolling(5,  min_periods=1).mean(),
    metric_std_5   = g['metric_value'].rolling(5,  min_periods=1).std(),
    metric_mean_10 = g['metric_value'].rolling(10, min_periods=1).mean(),
    metric_std_10  = g['metric_value'].rolling(10, min_periods=1).std(),
    fault_rate_5   = (g['status'].isin(['warning','critical'])
                      .astype(int).rolling(5,  min_periods=1).mean()),
    fault_rate_10  = (g['status'].isin(['warning','critical'])
                      .astype(int).rolling(10, min_periods=1).mean()),
    metric_delta   = g['metric_value'].diff(),
))

This works for a one-time run on a static file. At production scale, the approach breaks in three specific ways: memory, re-export before retraining, and feature freshness.

  pandas rolling approach CrateDB SQL approach
Memory required Loads the full dataset into memory; memory usage scales with dataset size Returns pre-aggregated feature table
Re-export before retraining Every scheduled retraining run must first re-export the dataset; if that step is missed or stale, the model trains on old data silently No: query runs against live data
Feature freshness at training time Features are accurate only as of the last export, not as of training time Current as of query time
Lines of Python transformation code ~40–60 (groupby, rolling, shift, fillna, merge) ~5 (connect, read_sql, done)

The re-export requirement is the highest-risk failure point. A scheduled retraining job that forgets to re-export first trains the model on data that is days or weeks stale. The model's performance degrades silently without surfacing a pipeline error.

Pre-aggregating in CrateDB SQL

CrateDB's DATE_BIN function bins timestamps into fixed-width time windows. Combined with STDDEV, AVG, and the SQL FILTER clause for conditional aggregation, it pre-computes the same rolling statistics in the database, before any data crosses the network.

-- Feature engineering: per-device hourly windows with fault rate
-- DATE_BIN groups each reading into its 1-hour window
-- COUNT(*) FILTER computes fault_rate without a separate join or Python transform
-- Returns one feature row per device per hour: compact, ready for training

SELECT
    device_id,
    device_type,
    plant_id,
    DATE_BIN('1 hour'::INTERVAL, "timestamp", TIMESTAMP '2025-09-01') AS window_start,
    AVG(metric_value) AS metric_mean,
    STDDEV(metric_value) AS metric_std,
    AVG(quality_score) AS quality_mean,
    COUNT(*) FILTER (WHERE status IN ('warning', 'critical'))
        * 1.0 / NULLIF(COUNT(*), 0) AS fault_rate,
    MAX(CASE WHEN status IN ('warning', 'critical') THEN 1 ELSE 0 END) AS had_fault
FROM iot_data
WHERE "timestamp" >= TIMESTAMP '2025-09-01'
GROUP BY device_id, device_type, plant_id, window_start
ORDER BY device_id, window_start;

Three SQL constructs do the work that the pandas code does in 40+ lines:

DATE_BIN('1 hour'::INTERVAL, "timestamp", origin): bins each reading into its one-hour window. The origin anchor aligns bins consistently across devices and across retraining runs. Change '1 hour' to '15 minutes' or '1 day' to adjust window granularity without changing any other part of the query.

STDDEV(metric_value): standard deviation per window per device. Volatile devices show rising STDDEV in the windows before a fault. This maps directly to metric_std_5 and metric_std_10 in the pandas approach, expressed over time windows rather than rolling row counts.

COUNT(*) FILTER (WHERE status IN ('warning', 'critical')): counts only the readings in a fault state within that window, then divides by total readings to produce a fault rate. This is the window-based equivalent of fault_rate_5. The FILTER clause is standard SQL and avoids a separate CASE WHEN subquery or a Python-side computation.

The NULLIF(COUNT(*), 0) guard prevents division by zero on windows that contain only one reading.

The query returns one row per device per hour. Each row is a feature vector. No rolling transform needed in Python. had_fault is the training target: 1 if any reading in that window was in warning or critical status.

From SQL feature table to trained model

pip install crate sqlalchemy pandas xgboost scikit-learn
import pandas as pd
from sqlalchemy import create_engine
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split

# Point at your CrateDB instance
engine = create_engine("crate://localhost:4200")
# CrateDB Cloud: "crate://admin:@.cratedb.net:4200"

# Pull the feature table: compact, pre-aggregated
df = pd.read_sql(sql, engine)  # sql = the query above

# Train
X = df[['metric_mean', 'metric_std', 'quality_mean', 'fault_rate']]
y = df['had_fault']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
model = XGBClassifier(scale_pos_weight=(y_train == 0).sum() / (y_train == 1).sum())
model.fit(X_train, y_train)

The DataFrame arrives compact. At 500 devices with hourly readings over 90 days, the feature table contains 500 × 24 × 90 = 1,080,000 rows, one per device-hour window. That is the same information derived from 500,000 raw readings, returned as a pre-aggregated DataFrame that fits in memory without issue.

Scheduled retraining without the export step

Scope the SQL query by time window and the pattern becomes a scheduled retraining pipeline:

-- Same feature query scoped to the last 90 days
-- Run on a cron schedule: no export required

SELECT
    device_id,
    device_type,
    plant_id,
    DATE_BIN('1 hour'::INTERVAL, "timestamp", TIMESTAMP '2025-09-01') AS window_start,
    AVG(metric_value) AS metric_mean,
    STDDEV(metric_value) AS metric_std,
    AVG(quality_score) AS quality_mean,
    COUNT(*) FILTER (WHERE status IN ('warning', 'critical'))
        * 1.0 / NULLIF(COUNT(*), 0) AS fault_rate,
    MAX(CASE WHEN status IN ('warning', 'critical') THEN 1 ELSE 0 END) AS had_fault
FROM iot_data
WHERE "timestamp" >= NOW() - INTERVAL '90' DAY
GROUP BY device_id, device_type, plant_id, window_start
ORDER BY device_id, window_start;

A weekly retraining job runs this query, trains the model on the fresh feature windows, and saves the updated model file. No stale snapshot. No export job to schedule separately. The model reflects the fleet's current fault patterns as of training time, not the fleet's behavior as of the last time someone ran an export.

CrateDB distributes the aggregation across shards before returning results. The data ingestion architecture that feeds iot_data in real time (Telegraf, Kafka, MQTT) is covered in the IoT Analytics Architecture Guide. For multi-plant deployments, the cross-plant query pattern from the Cross-Plant Visibility guide extends to training a per-plant or plant-group model using the same DATE_BIN + FILTER pattern with a WHERE plant_id = 'PLANT_007' filter.

Taking the trained model to production

Feature engineering in SQL removes the export step and the memory bottleneck from the training side. The inference side has a different problem: when a new reading arrives, the model needs the last 50 readings for that device as rolling-window context before it can score. That context lives in CrateDB, and because iot_data is CLUSTERED BY (device_id), fetching those 50 rows hits a single shard and returns in single-digit milliseconds.

The full inference architecture (FastAPI service, per-device scoring endpoint, fault_probability written back to CrateDB and JOIN-able against live sensor readings) is in Real-Time ML Inference on Live Sensor Data: Architecture with CrateDB and FastAPI.

For the anomaly detection model (Isolation Forest trained on healthy-fleet readings to score statistical deviation independent of fault labels), see Anomaly Detection on Industrial Sensor Data: Beyond Threshold Alerts.

The database architecture that supports both pipelines, including the fault_predictions table schema and the predictive maintenance SQL patterns that run continuously against live data, is in the Predictive Maintenance Database Architecture guide.

Feature engineering that lives in SQL rather than Python removes the export step, the memory bottleneck, and the freshness gap between sensor stream and training data. Run queries on live sensor data to see how DATE_BIN and conditional aggregates perform on a working dataset.

 

A raw sensor reading taken at a single point in time carries no context about what the device was doing before that moment. A device at metric_value = 85 that has been stable for 20 readings is in a different state than one that climbed from 30 in six hours. Rolling statistics (mean, standard deviation, fault rate over the last 5, 10, and 20 readings) capture that trajectory. When trained on a 500-device industrial sensor dataset, the rolling fault rate over the last 5 readings (fault_rate_5) is the strongest single predictor of an upcoming fault, outranking the raw metric value by a significant margin.

DATE_BIN is a CrateDB SQL function that groups timestamps into fixed-width time windows. For ML feature engineering, it replaces the per-device rolling window computation done in pandas: DATE_BIN('1 hour'::INTERVAL, "timestamp", origin) assigns each sensor reading to its one-hour bucket, and aggregate functions like AVG and STDDEV then compute per-device statistics within each bucket. Changing the interval from '1 hour' to '15 minutes' or '1 day' adjusts the feature granularity without any other changes to the query.

Loading a large sensor dataset into pandas for rolling window computation requires the full raw dataset to be in memory at once — memory usage scales with the number of rows. Pushing the same computation into CrateDB SQL using DATE_BIN and aggregate functions returns a pre-aggregated feature table rather than raw rows, because the database performs the aggregation before any data crosses the network. The result set is compact regardless of how large the underlying dataset is.

Store the sensor data in CrateDB and query it directly using a time-scoped SQL feature query instead of reading from a file. Scoping the query with WHERE "timestamp" >= NOW() - INTERVAL '90' DAY means a scheduled retraining job always pulls the last 90 days of feature windows from live data with no separate export step. The model trains on fresh data every time the job runs, and there is no stale snapshot to manage.

COUNT() FILTER (WHERE condition) is a standard SQL conditional aggregate that counts only the rows matching the condition within each GROUP BY window. In a sensor feature query, COUNT() FILTER (WHERE status IN ('warning', 'critical')) * 1.0 / NULLIF(COUNT(), 0) produces the fault rate for a device in a given time window: the fraction of readings that were in a fault state. It replaces a pandas isin().astype(int).rolling().mean() chain in a single SQL expression, without a separate subquery or Python-side computation.

A pandas rolling window operates on ordered rows within a DataFrame: it looks back N rows per device and computes statistics over that sliding window. A CrateDB DATE_BIN aggregate groups readings into fixed calendar windows (e.g., one-hour buckets) and computes statistics over all readings in each bucket. The pandas approach requires loading all raw readings into memory first; the CrateDB approach returns one pre-aggregated row per device per time window, which is compact enough to pass directly to a model without any further transformation in Python.

Install the crate and sqlalchemy packages, then create a SQLAlchemy engine pointing at your CrateDB instance: create_engine("crate://localhost:4200") for a local or Docker deployment, or create_engine("crate://admin:{password}@{cluster}.cratedb.net:4200") for CrateDB Cloud. Pass the feature engineering SQL query to pandas.read_sql(sql, engine) and the result is a DataFrame with one pre-aggregated feature row per device per time window, ready for train_test_split and model.fit() with no additional transformation.

CrateDB distributes the aggregation across shards before returning results to the client. For a 500-device dataset with hourly readings over 90 days, the raw data contains millions of individual sensor readings. The DATE_BIN + GROUP BY query computes averages, standard deviations, and fault rates for each device-hour window inside the database, returning roughly 1,080,000 compact rows (500 devices × 24 hours × 90 days) rather than transferring millions of raw rows to Python for in-memory processing. The aggregation work happens where the data lives, not in the application layer.

The query uses MAX(CASE WHEN status IN ('warning', 'critical') THEN 1 ELSE 0 END) AS had_fault as the training target. This produces 1 for any device-hour window that contained at least one reading in warning or critical status, and 0 for windows where all readings were healthy. The model learns to predict fault-incoming windows from the feature columns (metric_mean, metric_std, quality_mean, fault_rate) computed in the same query.