Anomaly detection in time series data involves identifying unusual patterns or outliers that deviate significantly from the norm within a dataset over time.
This video explores how anomaly detection in time series data stored in CrateDB can be efficiently executed using PyCaret by leveraging its simple and efficient setup to automatically train and evaluate multiple models.
In this Jupyter Notebook, we explore the integration of CrateDB and PyCaret for anomaly detection in machine data. Our focus is on leveraging the strengths of both tools: CrateDB’s ability to manage large-scale datasets and PyCaret’s streamlined approach to applying machine learning techniques. We will rely on the Numenta Anomaly Benchmark dataset, specifically temperature readings from a machine room, and demonstrate how to extract data from CrateDB and apply PyCaret’s anomaly detection algorithms. The objective is to detect anomalies that could indicate equipment malfunctions, using a practical example that simulates real-world machine measurements.
As a first step, you need to install the necessary dependencies by running the provided pip install
command in your Jupyter Notebook. For those using a cloud environment like Google Colab, remember to specify the absolute path to the requirements.txt
file.
Once the installation is complete, we need to import the required libraries. These include pandas for data manipulation, SQLAlchemy for interacting with CrateDB, PyCaret for anomaly detection, and various plotting libraries for visualization, such as Plotly and Matplotlib.
Before writing data to CrateDB, you'll need to adjust the CONNECTION_STRING
variable to match your CrateDB instance's credentials. The provided code assumes a local CrateDB setup but can be easily adapted for CrateDB Cloud or other configurations.
We will import the data into CrateDB and once the dataset is successfully loaded into CrateDB, the next step is to access and aggregate the data in a way that is most suitable for anomaly detection. Our focus will be on processing the data into evenly spaced time intervals. We use CrateDB's DATE_BIN
function to group data into 5-minute buckets, calculating the average value within each interval. Furthermore, the timestamp is converted into Python datetime
objects.
To better understand the characteristics of our time series dataset and identify any apparent anomalies, we create a plot of the temperature readings over time. We also took record of the periods where anomalies were observed on the machines. The anomalies, represented as blue-shaded areas in our plot, include periods of planned shutdowns and instances of catastrophic machine failure. We will use them later to evaluate the model. You can also see that there are some unusual spikes in the data, which make anomalies hard to differentiate from ordinary measurements.
In the next step, we initialize the environment for the machine learning workflow. This is easily done in PyCaret by creating the transformation pipeline with the setup()
function and specifying a session_id
for tracing of results. Inspecting the available models via the models()
function reveals a variety of algorithms suited for anomaly detection. For our purposes, we select the Minimum Covariance Determinant (or short MCD) model, known for its effectiveness in identifying outliers.
We use the create_model()
function to instantiate and train an MCD model, adjusting the fraction parameter to specify the proportion of outliers we expect in the dataset. After training, we apply the model to our dataset to label the anomalies. The assign_model()
function enriches our DataFrame with anomaly labels and scores, facilitating the identification of anomalous data points.
Finally, we visually assess the effectiveness of our anomaly detection model by plotting the temperature readings over time.
The red spots correspond to the anomalies flagged by the model.
This visualization enables a direct comparison between the anomalies detected by the model and those initially observed, offering insights into the model's accuracy and potential areas for tuning.
Our exploration emphasizes the potential for advanced analytics in monitoring and maintaining system health.
With the demonstrated approach, we can easily create insights into machine data and detect anomalies with just a few lines of code, enabling developers to create new insights and applications within just a few hours.