How to Evaluate Forecast Accuracy¶

This guide shows you how to measure and compare forecast performance using yohou's scorers, cross-validation, and baseline comparisons.

Prerequisites¶

Yohou installed (Getting Started)
Familiarity with the fit-predict workflow (Getting Started)

Try it interactively¶

Cross-Validation for Time Series

Evaluate forecasters with cross_val_score, cross_validate, and cross_val_predict using temporal splitters.

View · Open in marimo
How to Aggregate Scorer Results

Demonstrate all scorer aggregation strategies (stepwise, vintagewise, componentwise, groupwise, coveragewise, all) on panel data with weighted group aggregation.

View · Open in marimo
How to Evaluate Interval Forecasts

Evaluate prediction intervals with EmpiricalCoverage, IntervalScore, MeanIntervalWidth, PinballLoss, and CalibrationError across coverage levels.

View · Open in marimo
How to Use Point Forecast Metrics

Compare MAE, MAPE, MASE, RMSE, and other point metrics across multiple forecasters with componentwise and groupwise aggregation.

View · Open in marimo

1. Score a Single Forecast¶

Every scorer follows a two-step pattern: fit on training data (to set internal state such as the training mean for scaled metrics), then score with the test set and predictions:

from sklearn.linear_model import Ridge
from yohou.point import PointReductionForecaster
from yohou.metrics import MeanAbsoluteError
from yohou.datasets import fetch_electricity_demand
from yohou.model_selection import train_test_split

data = fetch_electricity_demand()
y = data.frame.select("time", "vic__demand").drop_nulls()

y_train, y_test = train_test_split(y, test_size=48)

forecaster = PointReductionForecaster(estimator=Ridge())
forecaster.fit(y_train, forecasting_horizon=24)
y_pred = forecaster.predict()

scorer = MeanAbsoluteError()
scorer.fit(y_train)
mae = scorer.score(y_test, y_pred)

If you need to compare across series with different scales, use MeanAbsoluteScaledError instead. See Forecast Accuracy for guidance on choosing the right metric, and the metrics API reference for the complete list.

2. Evaluate with Cross-Validation¶

Use cross_validate with a temporal splitter to get robust estimates across multiple train-test folds:

from yohou.model_selection import cross_validate, ExpandingWindowSplitter
from yohou.metrics import MeanAbsoluteError

cv = ExpandingWindowSplitter(n_splits=5, test_size=14)

results = cross_validate(
    forecaster=PointReductionForecaster(estimator=Ridge()),
    y=y,
    scoring=MeanAbsoluteError(),
    cv=cv,
    forecasting_horizon=14,
)

print(results)  # DataFrame with split, test_score, fit_time, score_time
print(f"Mean MAE: {results['test_score'].mean():.2f}")

To collect predictions per forecast origin rather than aggregate scores, see Multi-vintage Scoring.

3. Compare Against a Naive Baseline¶

Evaluate a SeasonalNaive forecaster on the same splits to confirm your model outperforms simple benchmarks:

from yohou.point import SeasonalNaive
from yohou.model_selection import cross_val_score

cv = ExpandingWindowSplitter(n_splits=5, test_size=14)
scorer = MeanAbsoluteError()

model_scores = cross_val_score(
    PointReductionForecaster(estimator=Ridge()),
    y,
    scoring=scorer,
    cv=cv,
    forecasting_horizon=14,
)

baseline_scores = cross_val_score(
    SeasonalNaive(seasonality=7),
    y,
    scoring=scorer,
    cv=cv,
    forecasting_horizon=14,
)

print(f"Model MAE: {model_scores['score'].mean():.2f}")
print(f"Baseline MAE: {baseline_scores['score'].mean():.2f}")

4. Obtain Out-of-Fold Predictions¶

Use cross_val_predict to collect predictions from each fold rather than scores. The returned DataFrame contains a split column identifying which fold produced each prediction, which is useful for diagnostics and visualization:

from yohou.model_selection import cross_val_predict

predictions = cross_val_predict(
    PointReductionForecaster(estimator=Ridge()),
    y,
    cv=ExpandingWindowSplitter(n_splits=5, test_size=14),
    forecasting_horizon=14,
)
print(predictions.head())

For interval forecasts, pass method="predict_interval" and the desired coverage_rates:

interval_predictions = cross_val_predict(
    interval_forecaster,
    y,
    cv=ExpandingWindowSplitter(n_splits=5, test_size=14),
    forecasting_horizon=14,
    method="predict_interval",
    coverage_rates=[0.90],
)

5. Use Multiple Metrics Simultaneously¶

Pass a dictionary of scorers to evaluate on several metrics at once:

from sklearn.linear_model import Ridge
from yohou.metrics import MeanAbsoluteError, RootMeanSquaredError, MeanAbsoluteScaledError
from yohou.model_selection import GridSearchCV, ExpandingWindowSplitter
from yohou.point import PointReductionForecaster

search = GridSearchCV(
    forecaster=PointReductionForecaster(estimator=Ridge()),
    param_grid={"estimator__alpha": [0.1, 1.0, 10.0]},
    scoring={
        "mae": MeanAbsoluteError(),
        "rmse": RootMeanSquaredError(),
        "mase": MeanAbsoluteScaledError(),
    },
    refit="mae",
    cv=ExpandingWindowSplitter(n_splits=5, test_size=14),
)
search.fit(y, forecasting_horizon=14)

The refit parameter specifies which metric determines the best configuration. All metrics appear in cv_results_.

6. Evaluate Interval Forecasts¶

Use EmpiricalCoverage to check whether intervals contain the true values at the claimed rate, and IntervalScore to penalize both under-coverage and unnecessarily wide intervals:

from yohou.metrics import EmpiricalCoverage, IntervalScore
from yohou.interval import SplitConformalForecaster

interval_forecaster = SplitConformalForecaster(
    point_forecaster=PointReductionForecaster(estimator=Ridge()),
)
interval_forecaster.fit(y_train, forecasting_horizon=24, coverage_rates=[0.90])
y_pred_interval = interval_forecaster.predict_interval()

coverage_scorer = EmpiricalCoverage()
coverage_scorer.fit(y_train)
print(coverage_scorer.score(y_test, y_pred_interval))

interval_scorer = IntervalScore()
interval_scorer.fit(y_train)
print(interval_scorer.score(y_test, y_pred_interval))

A well-calibrated 90% interval should achieve empirical coverage close to 0.9. If coverage is substantially lower, the intervals are too narrow. See Produce Prediction Intervals for the full interval forecasting workflow.

7. Apply Time Weighting¶

Weight recent errors more heavily by constructing the scorer with an ExponentialDecayWeighter. Weighting is a constructor parameter, so it is part of the scorer's configuration rather than a per-call argument:

from yohou.metrics import MeanAbsoluteError
from yohou.weighting import ExponentialDecayWeighter

weighted_scorer = MeanAbsoluteError(time_weighter=ExponentialDecayWeighter(half_life=365))
weighted_scorer.fit(y_train)
weighted_mae = weighted_scorer.score(y_test, y_pred)

Because the weighter lives on __init__, its parameters are tunable (time_weighter__half_life) and the weighted scorer can be used directly as a cross-validation objective. See Time Weighting for the full guide.

8. Evaluate Classification Forecasts¶

For class-probability forecasts, use proper scoring rules such as LogLoss and BrierScore. See Forecast with Class Probabilities for the full classification workflow and scoring examples.

9. Score Panel Forecasts¶

Scorers handle panel data automatically. Use the aggregation method groupwise (set via aggregation_method="groupwise") to get one score per group so you can spot underperforming entities:

from yohou.metrics import MeanAbsoluteError

scorer = MeanAbsoluteError(aggregation_method="groupwise")
scorer.fit(y_train)
scores = scorer.score(y_test, y_pred)  # one row per group

See Work with Panel Data for the full panel forecasting workflow and Forecast Accuracy for aggregation mode details.