AutoML for Time Series Forecasting

Time series forecasting involves making predictions using historical time-stamped data. This method stands at the heart of decision making in many industries, whether it is forecasting sales, supply chain analysis, predicting stock market trends, or anticipating weather patterns.

Practical Application of Time Series Forecasting with AutoML, MLflow, PyCaret, and CrateDB

The following Jupyter Notebook example presents a practical demonstration of an AutoML approach using CrateDB, PyCaret, and MLflow to transform raw data into actionable predictions. The choice of PyCaret, CrateDB, and MLflow for creating a time series forecasting model is driven by their unique strengths, particularly when managing large datasets and aiming for rapid development cycles.

Step 1

To follow this tutorial, similarly to previous examples, repeat Step 1 to ensure your connection with CrateDB. Then, follow the next steps.

Step 2. Getting started and Importing modules

First, we ensure our toolbox is ready by installing all necessary Python libraries.

import pandas as pd
import sqlalchemy as sa
import os
import plotly
import plotly.graph_objects as go
import mlflow_cratedb  # Required to enable the CrateDB MLflow adapter.
from dotenv import load_dotenv
from pycaret.time_series import setup, compare_models, tune_model,\
    blend_models, finalize_model, save_model, predict_model, plot_model

if os.path.isfile(".env"):
    load_dotenv(".env", override=True)

target_data = pd.read_csv(
    "https://data.4tu.nl/file/539debdb-a325-412d-b024-593f70cba15b/a801f5d4-5dfe-412a-ace2-a64f93ad0010"
)
related_data = pd.read_csv(
    "https://data.4tu.nl/file/539debdb-a325-412d-b024-593f70cba15b/f2bd27bd-deeb-4933-bed7-29325ee05c2e",
    header=None,
)
related_data.columns = ["item", "org", "date", "unit_price"]
data = target_data.merge(related_data, on=["item", "org", "date"])
data["total_sales"] = data["unit_price"] * data["quantity"]
data["date"] = pd.to_datetime(data["date"])

# Connect to database.
engine = sa.create_engine(DBURI_DATA, echo=bool(os.environ.get("DEBUG")))

# Import data.
data.to_sql(
    "sales_data_for_forecast",
    engine,
    index=False,
    chunksize=1000,
    if_exists="replace",
)

# CrateDB is eventually consistent, so synchronize write operations.
with engine.connect() as conn:
    conn.execute(sa.text("REFRESH TABLE sales_data_for_forecast"))

query = """
    SELECT
        DATE_TRUNC('month', DATE) AS MONTH,
        SUM(total_sales) AS total_sales
    from sales_data_for_forecast
    group by month
    order by month
"""

with engine.connect() as conn:
    with conn.execute(sa.text(query)) as cursor:
        data = pd.DataFrame(cursor.fetchall(), columns=cursor.keys())

data["month"] = pd.to_datetime(data['month'], unit='ms')

# Configure MLflow to use CrateDB.
os.environ["MLFLOW_TRACKING_URI"] = DBURI_MLFLOW

# Set plotly to PNG render mode, to render static PNGs instead of interactive plots.
# Note: Remove this line to get back interactivity.
plotly.io.renderers.default = 'png'

df = data.copy()
fig = go.Figure()
fig.add_trace(go.Scatter(x=df["month"], y=df["total_sales"], mode='lines+markers', name='Total Sales'))

fig.update_traces(line=dict(color='royalblue', width=2), selector=dict(mode='lines+markers'))
fig.add_trace(go.Scatter(x=df["month"], y=df["total_sales"].rolling(window=12).mean(), mode='lines', name='Trendline'))
fig.update_layout(title="Total sales Over Time with Trendline", xaxis_title="Period", yaxis_title="Total Sales", height=1000, width=1500)

fig.show()

# Invoke PyCaret's main workhorse function, `compare_models`.

# - "sort" defines which metric is used to rate the models.
# - "n_select" defines how many models are selected.
# - "exclude" optionally defines which models are excluded from the comparison.
# - "include" optionally defines which models are included in the comparison. (
#    all available models are included by default)
# - "fold" defines the number of folds to use for cross-validation.

# Note: This is only relevant if we are executing automated tests
if "PYTEST_CURRENT_TEST" in os.environ:
    best_models = compare_models(sort="MASE",
                                 include=["ets", "et_cds_dt", 
                                 "naive"],
                                 n_select=3)
# If we are not in an automated test, compare all available models
else:
    best_models = compare_models(sort="MASE", n_select=3)

final_model = finalize_model(best_model)

if not os.path.exists("model"):
    os.makedirs("model")

# Save the model to disk
_ = save_model(final_model, "model/timeseriesforecast_model")

predict_model(final_model, s.X_test)

AutoML for Time Series Forecasting

Practical Application of Time Series Forecasting with AutoML, MLflow, PyCaret, and CrateDB

Want to read more?

Company

Ecosystem

Contact