Live Stream on Jan 23rd: Unlocking Real Time Insights in the Renewable Energy Sector with CrateDB

Register now
Skip to content
Data models > Time series

AutoML for Time Series Forecasting

Time series forecasting involves making predictions using historical time-stamped data. This method stands at the heart of decision making in many industries, whether it is forecasting sales, supply chain analysis, predicting stock market trends, or anticipating weather patterns.

Practical Application of Time Series Forecasting with AutoML, MLflow, PyCaret, and CrateDB

The following Jupyter Notebook example presents a practical demonstration of an AutoML approach using CrateDB, PyCaret, and MLflow to transform raw data into actionable predictions. The choice of PyCaret, CrateDB, and MLflow for creating a time series forecasting model is driven by their unique strengths, particularly when managing large datasets and aiming for rapid development cycles.

Step 1

To follow this tutorial, similarly to previous examples, repeat Step 1 to ensure your connection with CrateDB. Then, follow the next steps.

Step 2. Getting started and Importing modules

First, we ensure our toolbox is ready by installing all necessary Python libraries.

import pandas as pd
import sqlalchemy as sa
import os
import plotly
import plotly.graph_objects as go
import mlflow_cratedb  # Required to enable the CrateDB MLflow adapter.
from dotenv import load_dotenv
from pycaret.time_series import setup, compare_models, tune_model,\
    blend_models, finalize_model, save_model, predict_model, plot_model

if os.path.isfile(".env"):
    load_dotenv(".env", override=True)

Step 3. Creating demo data

After setting up the database connectivity, we import the necessary Python modules and merge the datasets. We introduce a ‘total_sales’ column, calculated by multiplying the number of items sold by their unit price. The prepared dataset is then imported into a CrateDB table, making our data accessible and ready for analysis.

target_data = pd.read_csv(
    "https://data.4tu.nl/file/539debdb-a325-412d-b024-593f70cba15b/a801f5d4-5dfe-412a-ace2-a64f93ad0010"
)
related_data = pd.read_csv(
    "https://data.4tu.nl/file/539debdb-a325-412d-b024-593f70cba15b/f2bd27bd-deeb-4933-bed7-29325ee05c2e",
    header=None,
)
related_data.columns = ["item", "org", "date", "unit_price"]
data = target_data.merge(related_data, on=["item", "org", "date"])
data["total_sales"] = data["unit_price"] * data["quantity"]
data["date"] = pd.to_datetime(data["date"])

# Connect to database.
engine = sa.create_engine(DBURI_DATA, echo=bool(os.environ.get("DEBUG")))

# Import data.
data.to_sql(
    "sales_data_for_forecast",
    engine,
    index=False,
    chunksize=1000,
    if_exists="replace",
)

# CrateDB is eventually consistent, so synchronize write operations.
with engine.connect() as conn:
    conn.execute(sa.text("REFRESH TABLE sales_data_for_forecast"))

Step 4. Model creation

Our first step towards model creation is understanding our data. By plotting total sales over time alongside a trendline, we reveal underlying sales patterns and fluctuations. PyCaret is then used to train a model to predict future sales using past data. We use the month for time organization and plan to predict sales for the next 12 months. PyCaret compares 28 models to find the best predictor.

query = """
    SELECT
        DATE_TRUNC('month', DATE) AS MONTH,
        SUM(total_sales) AS total_sales
    from sales_data_for_forecast
    group by month
    order by month
"""

with engine.connect() as conn:
    with conn.execute(sa.text(query)) as cursor:
        data = pd.DataFrame(cursor.fetchall(), columns=cursor.keys())

data["month"] = pd.to_datetime(data['month'], unit='ms')

# Configure MLflow to use CrateDB.
os.environ["MLFLOW_TRACKING_URI"] = DBURI_MLFLOW
# Set plotly to PNG render mode, to render static PNGs instead of interactive plots.
# Note: Remove this line to get back interactivity.
plotly.io.renderers.default = 'png'

df = data.copy()
fig = go.Figure()
fig.add_trace(go.Scatter(x=df["month"], y=df["total_sales"], mode='lines+markers', name='Total Sales'))

fig.update_traces(line=dict(color='royalblue', width=2), selector=dict(mode='lines+markers'))
fig.add_trace(go.Scatter(x=df["month"], y=df["total_sales"].rolling(window=12).mean(), mode='lines', name='Trendline'))
fig.update_layout(title="Total sales Over Time with Trendline", xaxis_title="Period", yaxis_title="Total Sales", height=1000, width=1500)

fig.show()

total_sales_over_time

Step 5. Model training

Then, we train and blend the top-performing models using PyCaret. This demonstrates the power of model blending, a modern machine learning technique that combines the results of multiple models for the final prediction, outperforming the capabilities of a single model.

s = setup(data, fh=15, target="total_sales", index="month", log_experiment=True)
# Invoke PyCaret's main workhorse function, `compare_models`.

# - "sort" defines which metric is used to rate the models.
# - "n_select" defines how many models are selected.
# - "exclude" optionally defines which models are excluded from the comparison.
# - "include" optionally defines which models are included in the comparison. (
#    all available models are included by default)
# - "fold" defines the number of folds to use for cross-validation.

# Note: This is only relevant if we are executing automated tests
if "PYTEST_CURRENT_TEST" in os.environ:
    best_models = compare_models(sort="MASE",
                                 include=["ets", "et_cds_dt", 
                                 "naive"],
                                 n_select=3)
# If we are not in an automated test, compare all available models
else:
    best_models = compare_models(sort="MASE", n_select=3)

Step 6. Predictions on unseen data

In the final step, we train our created model on the entire dataset, then apply it to make predictions on new, unseen data. The predict_model method provides predictions for our data, offering a glimpse into our model's future sales forecasts.

final_model = finalize_model(best_model)

if not os.path.exists("model"):
    os.makedirs("model")

# Save the model to disk
_ = save_model(final_model, "model/timeseriesforecast_model")

predict_model(final_model, s.X_test)

We have turned data into predictions, showcasing the ease and power of an AutoML approach. With just a few steps, historical data can be transformed into future insights, making time series forecasting accessible to many users, even without deep machine learning and programming skills.

Want to read more?

Whitepaper" Guide for Time Series Data Projects