Time series forecasting stands at the heart of decision making in many industries, whether it is forecasting sales, supply chain analysis, predicting stock market trends, or anticipating weather patterns. In today’s video, we will show a practical example how to use an AutoML approach, leveraging the power of CrateDB, PyCaret, and MLFlow to transform data into insightful predictions.
Choosing PyCaret, CrateDB, and MLFlow for creating a time series forecasting model is driven by their unique features and strengths, especially when dealing with large datasets and aiming for efficient development cycles. PyCaret stands out for its AutoML capabilities, automating the tedious process of model selection and hyperparameter tuning. With just a few lines of code, we can compare, evaluate, and refine dozens of models to find the best performer. MLflow helps to operationalize your machine learning initiatives. It tracks models, parameters, and outcomes, and enables to iterate rapidly and efficiently towards the optimal forecasting solution while also monitoring your model in production use.
Before we start, we need to ensure our toolbox is ready. This involves installing all necessary Python libraries. If you're using an environment like Google Colab, make sure to use the absolute URL for the requirements.txt file. Next, we need a running CrateDB instance. The quickest way to get started is with the CrateDB cloud offering. It's free to sign up, and you can deploy a cluster with a few clicks. Please choose the correct connection string, depending on whether you're connecting to CrateDB Cloud or a local instance. If you are running CrateDB Cloud you will need to specify the username, password, and hostname of your cluster.
With our database connectivity set up, we'll import the necessary Python modules, and pull and merge the two datasets. We also introduce a new column, ‘total_sales’, calculated by multiplying the number of items sold by their unit price. The prepared dataset is then imported into a new table in CrateDB. This makes our data accessible and ready for the analysis we aim to perform. Having set our data foundation, we shift gears towards model creation and ensure that our machine learning experiments are trackable by configuring MLflow to use our CrateDB instance. This integration is crucial for maintaining a clear record of our modelling journey.
As a first step, we want to understand our data. By plotting our total sales over time, alongside a trendline, we gain insights into the underlying patterns and fluctuations in sales. The plot reveals some interesting insights about the sales operation of the company. The blue line represents the total sales, and it shows considerable fluctuations throughout the period. There are notable peaks and troughs, suggesting periods of high sales followed by periods of lower sales. The orange trendline provides an overview of the general sales trend over the years. It appears that there's a gradual upward trend in sales from 2014 to 2021. This indicates that, despite the periodic fluctuations, the overall sales have been on the rise over the years. To understand the key components of our time series data we outlined here more detailed observations.
Using PyCaret, we train a model to predict future sales from past data. First, we tell PyCaret what we want to predict: the total sales. We use the month to organize our data over time. We also decide to keep track of our work with MLflow and plan to predict sales for the next 12 months. Next, PyCaret compares 28 models to find which one predicts best. It turns out that the Extra Trees model, which adjusts for seasonal changes, is the top choice based on the Mean Absolute Scaled Error metric.
However, your exact numbers might vary a bit in each run, because PyCaret uses different data splits to test the model's accuracy. This method, called cross-validation, helps make sure our model is reliable and doesn’t just memorize the data.
Before continuing, it's best to have a look at the performance of the models. The above plots show that all 3 of the selected models seem to predict quite well, with the first one - at least at a glance - performing probably best. Now, we optimize these top models via hyperparameter tuning – which is just one line of code in PyCaret. The hyperparameter tuning tries to improve the performance of the models and needs to be carefully reviewed, also to avoid overfitting of the model.
Now, let’s move to a powerful technique in modern machine learning: model blending, which combines the results of multiple models for the final prediction to outperform the capabilities of a single model. As you can observe, we've trained 28 models, tuned, and blended the top performing 3 models. In PyCaret, it's all done with just a few lines of code, which would have taken hours or days in a manual approach. This makes AutoML not only powerful but also easy to apply to many forecasting tasks.
In the final step, we train our created model on the entire dataset. Until now, we've kept some data aside for testing. After that, it can be used to make predictions on new, unseen data. The predict_model method gives us predictions for our data, helping us to see how our model thinks sales will look in the future.
In this video, we turned data into predictions, showing how easy and powerful an AutoML approach can be. With just a few steps, anyone can turn historical data into future insights, making time series forecasting easy and available to many users – even without deep machine learning and programming skills.