Skip to content

Forecasting Workflow

In this tutorial, we will evaluate two forecasters using temporal cross-validation, search for the best hyperparameters, and inspect residuals to diagnose model weaknesses.

Try it interactively

Forecasting Workflow

Evaluate forecasters with cross-validation, search hyperparameters with GridSearchCV, and inspect residuals to diagnose model weaknesses.

ViewOpen in marimo

Prerequisites

Setup

We use the monthly tourism dataset: 187 months of visitor arrivals to a single Australian region (T1). First, load the data and define a 12-month forecasting horizon:

from yohou.datasets import fetch_tourism_monthly
from yohou.model_selection import train_test_split

bunch = fetch_tourism_monthly()
y = (
    bunch.frame
    .select("time", "T1__tourists")
    .drop_nulls()
    .rename({"T1__tourists": "tourists"})
)

forecasting_horizon = 12
y_train, y_test = train_test_split(y, test_size=forecasting_horizon)

Next, fit a SeasonalNaive baseline:

from yohou.point import SeasonalNaive

baseline = SeasonalNaive(seasonality=12)
baseline.fit(y_train, forecasting_horizon=forecasting_horizon)
y_pred_baseline = baseline.predict(forecasting_horizon=forecasting_horizon)

Now build a Ridge pipeline with SeasonalDifferencing and lag features. If the pipeline looks unfamiliar, see Getting Started for a step-by-step walkthrough:

from sklearn.linear_model import Ridge
from yohou.compose import FeaturePipeline
from yohou.point import PointReductionForecaster
from yohou.preprocessing import LagTransformer
from yohou.stationarity import SeasonalDifferencing

forecaster = PointReductionForecaster(
    estimator=Ridge(),
    target_transformer=SeasonalDifferencing(seasonality=12),
    feature_transformer=FeaturePipeline([
        ("lags", LagTransformer(lag=[1, 2, 3, 12])),
    ]),
)
forecaster.fit(y_train, forecasting_horizon=forecasting_horizon)
y_pred_ridge = forecaster.predict(forecasting_horizon=forecasting_horizon)

Score with Multiple Metrics

Now score both models on the single train/test split. Scorers in Yohou are stateful: call fit(y_train) first so that scale-dependent metrics like MeanAbsoluteScaledError can normalise correctly:

from yohou.metrics import MeanAbsoluteError, MeanAbsoluteScaledError

mae = MeanAbsoluteError()
mae.fit(y_train)
mase = MeanAbsoluteScaledError(seasonality=12)
mase.fit(y_train)

for name, y_pred in [("SeasonalNaive", y_pred_baseline), ("Ridge", y_pred_ridge)]:
    print(f"{name:15s}  MAE={mae.score(y_test, y_pred):.2f}  MASE={mase.score(y_test, y_pred):.2f}")
SeasonalNaive    MAE=302.05  MASE=1.65
Ridge            MAE=214.35  MASE=1.17

Notice that both MASE values are above 1.0, meaning neither model outperforms the seasonal naive baseline on this single holdout. Cross-validation across multiple folds will tell us whether this pattern holds.

ExpandingWindowSplitter creates multiple temporal train/test folds by growing the training window. GridSearchCV evaluates each parameter combination across all folds and selects the best:

from yohou.model_selection import ExpandingWindowSplitter, GridSearchCV

cv = ExpandingWindowSplitter(n_splits=3, test_size=forecasting_horizon)

search = GridSearchCV(
    forecaster=forecaster,
    param_grid={"estimator__alpha": [0.1, 1.0, 10.0, 100.0]},
    scoring=MeanAbsoluteScaledError(seasonality=12),
    cv=cv,
)
search.fit(y, forecasting_horizon=forecasting_horizon)

print(f"Best params:  {search.best_params_}")
print(f"CV MASE:      {-search.best_score_:.2f}")
Best params:  {'estimator__alpha': 0.1}
CV MASE:      0.87

Notice that best_score_ is negative. Yohou follows scikit-learn's convention of negating scores so that higher is always better. Negate it to recover the actual MASE.

The CV MASE of 0.87 is below 1.0, confirming that Ridge consistently outperforms the seasonal naive baseline across all three folds. The single holdout was harder than average.

Inspect Residuals

Let's refit the best forecaster from the search on the training data and inspect what the model gets wrong with plot_residuals:

from yohou.plotting import plot_residuals

best = search.best_forecaster_
best.fit(y_train, forecasting_horizon=forecasting_horizon)
y_pred_tuned = best.predict(forecasting_horizon=forecasting_horizon)

plot_residuals(y_pred_tuned, y_test, title="Residuals: Ridge (Tuned)")

You should see a scatter of residuals over the test period. If the residuals cluster near zero with no obvious pattern, the model is capturing the main signal. Spikes at seasonal lags or a visible trend suggest missing structure. See Residual Diagnostics for a full interpretation guide.

Compare Models Visually

Now plot both forecasts against the actual test values:

from yohou.plotting import plot_forecast, plot_score_summary

plot_forecast(
    y_test,
    {"SeasonalNaive": y_pred_baseline, "Ridge (Tuned)": y_pred_tuned},
    y_train=y_train,
    n_history=36,
    title="Model Comparison: Tourism Forecast",
    y_label="Monthly visitors",
)

plot_score_summary(
    {"MAE": mae, "MASE": mase},
    y_test,
    {"SeasonalNaive": y_pred_baseline, "Ridge (Tuned)": y_pred_tuned},
    title="Score Comparison",
)

Notice how plot_forecast overlays predicted and actual values so you can spot where each model over- or under-shoots. plot_score_summary condenses the comparison into a single bar chart.

What You Built

You have completed the full evaluation workflow:

Next Steps