Anomaly detection in time series data is a crucial technique aimed at identifying patterns or outliers that significantly deviate from the norm within a given dataset over time. This methodology employs statistical and machine learning algorithms to analyze large datasets over specific time intervals, examining patterns, trends, and seasonality to pinpoint anomalies. These detected deviations could signify an error or other significant events requiring attention.
This technique finds wide-ranging applications across various fields and industries:
- In cybersecurity, it helps identify unusual network activity patterns, potentially indicating a breach;
- In the finance sector, anomaly detection is pivotal for identifying fraudulent activities in credit card transactions;
- In the realm of IoT, it is employed for detecting malfunctioning sensors and machines;
- In healthcare, anomaly detection is useful for monitoring unusual patient vital signs, and predictive maintenance, aiding in the early identification of abnormal machine behavior to prevent system failures.
Practical Application of Data Decomposition with PyCaret and CrateDB
Step 1.
Repeat Step 1 from the previous chapters to set up the CrateDB connection.
Step 2. Loading Time Series Data with SELECT DATE_BIN
We import the data into CrateDB and aggregate it for anomaly detection, focusing on evenly spaced time intervals. The DATE_BIN function in CrateDB groups data into 5-minute intervals, and the average value within each interval is calculated. The timestamp is converted into Python datetime objects.
query = "SELECT DATE_BIN('5 min'::INTERVAL, timestamp, 0) AS timestamp, AVG(value) AS avg_value FROM machine_data GROUP BY timestamp ORDER BY timestamp ASC;" with engine.connect() as conn: df = pd.read_sql(sql=sa.text(query), con=conn) df['timestamp'] = df['timestamp'].transform(lambda x: datetime.fromtimestamp(x/1000)) df = df.set_index('timestamp') df.head()
Step 3. Plotting Time Series Data
To understand our time series data and identify any apparent anomalies, we plot the temperature readings over time. Notable anomalies, such as periods of planned shutdowns and instances of catastrophic machine failure, are recorded and shown as blue-shaded areas in the plot.
# Known anomalies in the data anomalies = [ ["2013-12-15 17:50:00.000000", "2013-12-17 17:00:00.000000"], ["2014-01-27 14:20:00.000000", "2014-01-29 13:30:00.000000"], ["2014-02-07 14:55:00.000000", "2014-02-09 14:05:00.000000"] ] plt.figure(figsize=(12,7)) line = plt.plot(df.index, df['avg_value'], linestyle='solid', color='black', label='Temperature') # Highlight anomalies ctr = 0 for timeframe in anomalies: ctr += 1
plt.axvspan(pd.to_datetime(timeframe[0]),
pd.to_datetime(timeframe[1]), color='blue', alpha=0.3,
label=f'Anomaly {ctr}')
# Formatting x-axis for better readability
plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%Y/%m/%d'))
plt.gca().xaxis.set_major_locator(mdates.DayLocator(interval=7))
plt.gcf().autofmt_xdate()
# Rotate & align the x labels for a better view
plt.title('Temperature Over Time', fontsize=20, fontweight='bold', pad=30)
plt.ylabel('Temperature')
# Add legend to the right
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()
Step 4. Defining the Anomaly Detection Model
Next, we set up the environment for the machine learning workflow in PyCaret. We create a transformation pipeline with the setup() function and specify a session_id for results tracing. We choose the Minimum Covariance Determinant (MCD) model from the available algorithms for its effectiveness in identifying outliers.
s = setup(df, session_id = 123) models()
Step 5. Running the Unsupervised Anomaly Detection Model
The create_model() function is used to instantiate and train an MCD model, with the fraction parameter adjusted to specify the expected proportion of outliers. After training, the model is applied to label the anomalies. The assign_model() function enriches our DataFrame with anomaly labels and scores.
mcd = create_model('mcd',fraction=0.025) mcd_results = assign_model(mcd) mcd_results[mcd_results['Anomaly'] == 1].head()
Step 6. Plotting the Results
Lastly, we use the Plotly library to plot all readings and highlight the anomalies. The anomalies flagged by the model are shown as red spots.
# plot value on y-axis and date on x-axis pio.renderers.default = 'png' fig = px.line(mcd_results, x=mcd_results.index, y="avg_value", title='MACHINE DATA - UNSUPERVISED ANOMALY DETECTION', template = 'plotly_dark') # create list of outlier_dates outlier_dates = mcd_results[mcd_results['Anomaly'] == 1].index # obtain y value of anomalies to plot y_values = [mcd_results.loc[i]['avg_value'] for i in outlier_dates] fig.add_trace(go.Scatter(x=outlier_dates, y=y_values, mode = 'markers', name = 'Anomaly', marker=dict(color='red',size=10))) fig.show()
The model effectively identified several anomalies, despite some false positives. Its performance could be improved with further tuning, or by using a different model.