Time series forecasting involves making predictions using historical time-stamped data. This method stands at the heart of decision making in many industries, whether it is forecasting sales, supply chain analysis, predicting stock market trends, or anticipating weather patterns.
Practical Application of Time Series Forecasting with AutoML, MLflow, PyCaret, and CrateDB
The following Jupyter Notebook example presents a practical demonstration of an AutoML approach using CrateDB, PyCaret, and MLflow to transform raw data into actionable predictions. The choice of PyCaret, CrateDB, and MLflow for creating a time series forecasting model is driven by their unique strengths, particularly when managing large datasets and aiming for rapid development cycles.
Step 1
To follow this tutorial, similarly to previous examples, repeat Step 1 to ensure your connection with CrateDB. Then, follow the next steps.
Step 2. Getting started and Importing modules
First, we ensure our toolbox is ready by installing all necessary Python libraries.
import pandas as pd import sqlalchemy as sa import os import plotly import plotly.graph_objects as go import mlflow_cratedb # Required to enable the CrateDB MLflow adapter. from dotenv import load_dotenv from pycaret.time_series import setup, compare_models, tune_model,\ blend_models, finalize_model, save_model, predict_model, plot_model if os.path.isfile(".env"): load_dotenv(".env", override=True)
Step 3. Creating demo data
After setting up the database connectivity, we import the necessary Python modules and merge the datasets. We introduce a ‘total_sales’ column, calculated by multiplying the number of items sold by their unit price. The prepared dataset is then imported into a CrateDB table, making our data accessible and ready for analysis.
target_data = pd.read_csv( "https://data.4tu.nl/file/539debdb-a325-412d-b024-593f70cba15b/a801f5d4-5dfe-412a-ace2-a64f93ad0010" ) related_data = pd.read_csv( "https://data.4tu.nl/file/539debdb-a325-412d-b024-593f70cba15b/f2bd27bd-deeb-4933-bed7-29325ee05c2e", header=None, ) related_data.columns = ["item", "org", "date", "unit_price"] data = target_data.merge(related_data, on=["item", "org", "date"]) data["total_sales"] = data["unit_price"] * data["quantity"] data["date"] = pd.to_datetime(data["date"]) # Connect to database. engine = sa.create_engine(DBURI_DATA, echo=bool(os.environ.get("DEBUG"))) # Import data. data.to_sql( "sales_data_for_forecast", engine, index=False, chunksize=1000, if_exists="replace", ) # CrateDB is eventually consistent, so synchronize write operations. with engine.connect() as conn: conn.execute(sa.text("REFRESH TABLE sales_data_for_forecast"))
Step 4. Model creation
Our first step towards model creation is understanding our data. By plotting total sales over time alongside a trendline, we reveal underlying sales patterns and fluctuations. PyCaret is then used to train a model to predict future sales using past data. We use the month for time organization and plan to predict sales for the next 12 months. PyCaret compares 28 models to find the best predictor.
query = """ SELECT DATE_TRUNC('month', DATE) AS MONTH, SUM(total_sales) AS total_sales from sales_data_for_forecast group by month order by month """ with engine.connect() as conn: with conn.execute(sa.text(query)) as cursor: data = pd.DataFrame(cursor.fetchall(), columns=cursor.keys()) data["month"] = pd.to_datetime(data['month'], unit='ms') # Configure MLflow to use CrateDB. os.environ["MLFLOW_TRACKING_URI"] = DBURI_MLFLOW
# Set plotly to PNG render mode, to render static PNGs instead of interactive plots. # Note: Remove this line to get back interactivity. plotly.io.renderers.default = 'png' df = data.copy() fig = go.Figure() fig.add_trace(go.Scatter(x=df["month"], y=df["total_sales"], mode='lines+markers', name='Total Sales')) fig.update_traces(line=dict(color='royalblue', width=2), selector=dict(mode='lines+markers')) fig.add_trace(go.Scatter(x=df["month"], y=df["total_sales"].rolling(window=12).mean(), mode='lines', name='Trendline')) fig.update_layout(title="Total sales Over Time with Trendline", xaxis_title="Period", yaxis_title="Total Sales", height=1000, width=1500) fig.show()
Step 5. Model training
Then, we train and blend the top-performing models using PyCaret. This demonstrates the power of model blending, a modern machine learning technique that combines the results of multiple models for the final prediction, outperforming the capabilities of a single model.
s = setup(data, fh=15, target="total_sales", index="month", log_experiment=True)
# Invoke PyCaret's main workhorse function, `compare_models`. # - "sort" defines which metric is used to rate the models. # - "n_select" defines how many models are selected. # - "exclude" optionally defines which models are excluded from the comparison. # - "include" optionally defines which models are included in the comparison. ( # all available models are included by default) # - "fold" defines the number of folds to use for cross-validation. # Note: This is only relevant if we are executing automated tests if "PYTEST_CURRENT_TEST" in os.environ: best_models = compare_models(sort="MASE", include=["ets", "et_cds_dt", "naive"], n_select=3) # If we are not in an automated test, compare all available models else: best_models = compare_models(sort="MASE", n_select=3)
Step 6. Predictions on unseen data
In the final step, we train our created model on the entire dataset, then apply it to make predictions on new, unseen data. The predict_model method provides predictions for our data, offering a glimpse into our model's future sales forecasts.
final_model = finalize_model(best_model) if not os.path.exists("model"): os.makedirs("model") # Save the model to disk _ = save_model(final_model, "model/timeseriesforecast_model") predict_model(final_model, s.X_test)
We have turned data into predictions, showcasing the ease and power of an AutoML approach. With just a few steps, historical data can be transformed into future insights, making time series forecasting accessible to many users, even without deep machine learning and programming skills.