Live Stream on Jan 23rd: Unlocking Real Time Insights in the Renewable Energy Sector with CrateDB

Register now
Skip to content
Data models > Time series

Time Series Data Decomposition

Anomaly detection in time series data is a crucial technique aimed at identifying patterns or outliers that significantly deviate from the norm within a given dataset over time. This methodology employs statistical and machine learning algorithms to analyze large datasets over specific time intervals, examining patterns, trends, and seasonality to pinpoint anomalies. These detected deviations could signify an error or other significant events requiring attention.

This technique finds wide-ranging applications across various fields and industries:

  • In cybersecurity, it helps identify unusual network activity patterns, potentially indicating a breach;
  • In the finance sector, anomaly detection is pivotal for identifying fraudulent activities in credit card transactions;
  • In the realm of IoT, it is employed for detecting malfunctioning sensors and machines;
  • In healthcare, anomaly detection is useful for monitoring unusual patient vital signs, and predictive maintenance, aiding in the early identification of abnormal machine behavior to prevent system failures.

Practical Application of Data Decomposition with PyCaret and CrateDB

In the following Jupyter Notebook, we illustrate anomaly detection with CrateDB and PyCaret, using temperature readings from the Numenta Anomaly Benchmark dataset. The objective of this exercise is to detect anomalies that could indicate equipment malfunctions, using a practical example that simulates real-world machine measurements.

 

Step 1.

Repeat Step 1 from the previous chapters to set up the CrateDB connection.

Step 2. Loading Time Series Data with SELECT DATE_BIN

We import the data into CrateDB and aggregate it for anomaly detection, focusing on evenly spaced time intervals. The DATE_BIN function in CrateDB groups data into 5-minute intervals, and the average value within each interval is calculated. The timestamp is converted into Python datetime objects.

query = "SELECT DATE_BIN('5 min'::INTERVAL, timestamp, 0) AS timestamp, AVG(value) AS avg_value FROM machine_data GROUP BY timestamp ORDER BY timestamp ASC;"  

with engine.connect() as conn:
    df = pd.read_sql(sql=sa.text(query), con=conn)

df['timestamp'] = df['timestamp'].transform(lambda x: datetime.fromtimestamp(x/1000))

df = df.set_index('timestamp')

df.head()

Step 3. Plotting Time Series Data

To understand our time series data and identify any apparent anomalies, we plot the temperature readings over time. Notable anomalies, such as periods of planned shutdowns and instances of catastrophic machine failure, are recorded and shown as blue-shaded areas in the plot.

# Known anomalies in the data
anomalies = [
    ["2013-12-15 17:50:00.000000", "2013-12-17 17:00:00.000000"],
    ["2014-01-27 14:20:00.000000", "2014-01-29 13:30:00.000000"],
    ["2014-02-07 14:55:00.000000", "2014-02-09 14:05:00.000000"]
]

plt.figure(figsize=(12,7))
line = plt.plot(df.index, df['avg_value'], linestyle='solid', color='black', label='Temperature')

# Highlight anomalies
ctr = 0
for timeframe in anomalies:
   ctr += 1
plt.axvspan(pd.to_datetime(timeframe[0]),
 pd.to_datetime(timeframe[1]), color='blue', alpha=0.3,
   label=f'Anomaly {ctr}')

# Formatting x-axis for better readability
plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%Y/%m/%d'))
plt.gca().xaxis.set_major_locator(mdates.DayLocator(interval=7))
plt.gcf().autofmt_xdate()

# Rotate & align the x labels for a better view
plt.title('Temperature Over Time', fontsize=20, fontweight='bold', pad=30)
plt.ylabel('Temperature')

# Add legend to the right
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')

plt.tight_layout()
plt.show()

temperature_over_timeStep 4. Defining the Anomaly Detection Model

Next, we set up the environment for the machine learning workflow in PyCaret. We create a transformation pipeline with the setup() function and specify a session_id for results tracing. We choose the Minimum Covariance Determinant (MCD) model from the available algorithms for its effectiveness in identifying outliers.

s = setup(df, session_id = 123)
models()

Step 5. Running the Unsupervised Anomaly Detection Model

The create_model() function is used to instantiate and train an MCD model, with the fraction parameter adjusted to specify the expected proportion of outliers. After training, the model is applied to label the anomalies. The assign_model() function enriches our DataFrame with anomaly labels and scores.

mcd = create_model('mcd',fraction=0.025)
mcd_results = assign_model(mcd)
mcd_results[mcd_results['Anomaly'] == 1].head()

Step 6. Plotting the Results

Lastly, we use the Plotly library to plot all readings and highlight the anomalies. The anomalies flagged by the model are shown as red spots.

# plot value on y-axis and date on x-axis
pio.renderers.default = 'png'
fig = px.line(mcd_results, x=mcd_results.index, y="avg_value", title='MACHINE DATA - UNSUPERVISED ANOMALY DETECTION', template = 'plotly_dark')

# create list of outlier_dates
outlier_dates = mcd_results[mcd_results['Anomaly'] == 1].index

# obtain y value of anomalies to plot
y_values = [mcd_results.loc[i]['avg_value'] for i in outlier_dates]

fig.add_trace(go.Scatter(x=outlier_dates, y=y_values, mode = 'markers',
                name = 'Anomaly',
                marker=dict(color='red',size=10)))

fig.show()

machine_date_unsupervised_anomaly_detection

The model effectively identified several anomalies, despite some false positives. Its performance could be improved with further tuning, or by using a different model.

Want to read more?

Whitepaper" Guide for Time Series Data Projects