Skip to content

How to Evaluate Forecast Accuracy

This guide shows you how to measure and compare forecast performance using yohou's scorers, cross-validation, and baseline comparisons.

Prerequisites

Try it interactively

How to Aggregate Scorer Results

Demonstrate all scorer aggregation strategies (stepwise, vintagewise, componentwise, groupwise, coveragewise, all) on panel data with weighted group aggregation.

ViewOpen in marimo
Cross-Validation for Time Series

Evaluate forecasters with cross_val_score, cross_validate, and cross_val_predict using temporal splitters.

ViewOpen in marimo
How to Evaluate Interval Forecasts

Evaluate prediction intervals with EmpiricalCoverage, IntervalScore, MeanIntervalWidth, PinballLoss, and CalibrationError across coverage levels.

ViewOpen in marimo
How to Use Point Forecast Metrics

Compare MAE, MAPE, MASE, RMSE, and other point metrics across multiple forecasters with componentwise and groupwise aggregation.

ViewOpen in marimo

1. Score a Single Forecast

Every scorer follows a two-step pattern: fit on training data (to set internal state such as the training mean for scaled metrics), then score with the test set and predictions:

from sklearn.linear_model import Ridge
from yohou.point import PointReductionForecaster
from yohou.metrics import MeanAbsoluteError
from yohou.datasets import fetch_electricity_demand
from yohou.model_selection import train_test_split

data = fetch_electricity_demand()
y = data.frame.select("time", "vic__demand").drop_nulls()

y_train, y_test = train_test_split(y, test_size=48)

forecaster = PointReductionForecaster(estimator=Ridge())
forecaster.fit(y_train, forecasting_horizon=24)
y_pred = forecaster.predict()

scorer = MeanAbsoluteError()
scorer.fit(y_train)
mae = scorer.score(y_test, y_pred)

If you need to compare across series with different scales, use MeanAbsoluteScaledError instead. See Forecast Accuracy for guidance on choosing the right metric, and the metrics API reference for the complete list.

2. Evaluate with Cross-Validation

Use cross_validate with a temporal splitter to get robust estimates across multiple train-test folds:

from yohou.model_selection import cross_validate, ExpandingWindowSplitter
from yohou.metrics import MeanAbsoluteError

cv = ExpandingWindowSplitter(n_splits=5, test_size=14)

results = cross_validate(
    forecaster=PointReductionForecaster(estimator=Ridge()),
    y=y,
    scoring=MeanAbsoluteError(),
    cv=cv,
    forecasting_horizon=14,
)

print(results)  # DataFrame with split, test_score, fit_time, score_time
print(f"Mean MAE: {results['test_score'].mean():.2f}")

3. Compare Against a Naive Baseline

Evaluate a SeasonalNaive forecaster on the same splits to confirm your model outperforms simple benchmarks:

from yohou.point import SeasonalNaive
from yohou.model_selection import cross_val_score

cv = ExpandingWindowSplitter(n_splits=5, test_size=14)
scorer = MeanAbsoluteError()

model_scores = cross_val_score(
    PointReductionForecaster(estimator=Ridge()),
    y,
    scoring=scorer,
    cv=cv,
    forecasting_horizon=14,
)

baseline_scores = cross_val_score(
    SeasonalNaive(seasonality=7),
    y,
    scoring=scorer,
    cv=cv,
    forecasting_horizon=14,
)

print(f"Model MAE: {model_scores['score'].mean():.2f}")
print(f"Baseline MAE: {baseline_scores['score'].mean():.2f}")

4. Obtain Out-of-Fold Predictions

Use cross_val_predict to collect predictions from each fold rather than scores. The returned DataFrame contains a split column identifying which fold produced each prediction, which is useful for diagnostics and visualization:

from yohou.model_selection import cross_val_predict

predictions = cross_val_predict(
    PointReductionForecaster(estimator=Ridge()),
    y,
    cv=ExpandingWindowSplitter(n_splits=5, test_size=14),
    forecasting_horizon=14,
)
print(predictions.head())

For interval forecasts, pass method="predict_interval" and the desired coverage_rates:

interval_predictions = cross_val_predict(
    interval_forecaster,
    y,
    cv=ExpandingWindowSplitter(n_splits=5, test_size=14),
    forecasting_horizon=14,
    method="predict_interval",
    coverage_rates=[0.90],
)

5. Use Multiple Metrics Simultaneously

Pass a dictionary of scorers to evaluate on several metrics at once:

from yohou.metrics import MeanAbsoluteError, RootMeanSquaredError, MeanAbsoluteScaledError

search = GridSearchCV(
    forecaster=PointReductionForecaster(estimator=Ridge()),
    param_grid={"estimator__alpha": [0.1, 1.0, 10.0]},
    scoring={
        "mae": MeanAbsoluteError(),
        "rmse": RootMeanSquaredError(),
        "mase": MeanAbsoluteScaledError(),
    },
    refit="mae",
    cv=ExpandingWindowSplitter(n_splits=5, test_size=14),
)
search.fit(y, forecasting_horizon=14)

The refit parameter specifies which metric determines the best configuration. All metrics appear in cv_results_.

6. Evaluate Interval Forecasts

Use EmpiricalCoverage to check whether intervals contain the true values at the claimed rate, and IntervalScore to penalize both under-coverage and unnecessarily wide intervals:

from yohou.metrics import EmpiricalCoverage, IntervalScore
from yohou.interval import SplitConformalForecaster

interval_forecaster = SplitConformalForecaster(
    point_forecaster=PointReductionForecaster(estimator=Ridge()),
)
interval_forecaster.fit(y_train, forecasting_horizon=24, coverage_rates=[0.90])
y_pred_interval = interval_forecaster.predict_interval()

coverage_scorer = EmpiricalCoverage()
coverage_scorer.fit(y_train)
print(coverage_scorer.score(y_test, y_pred_interval))

interval_scorer = IntervalScore()
interval_scorer.fit(y_train)
print(interval_scorer.score(y_test, y_pred_interval))

A well-calibrated 90% interval should achieve empirical coverage close to 0.9. If coverage is substantially lower, the intervals are too narrow. See Produce Prediction Intervals for the full interval forecasting workflow.

7. Apply Time Weighting

Weight recent errors more heavily using exponential_decay_weight:

from yohou.utils.weighting import exponential_decay_weight

weight_fn = exponential_decay_weight(half_life=365)
weighted_mae = scorer.score(y_test, y_pred, time_weight=weight_fn)

See Time Weighting for the full guide on weight functions.

8. Evaluate Classification Forecasts

For class-probability forecasts, use proper scoring rules such as LogLoss and BrierScore. See Forecast with Class Probabilities for the full classification workflow and scoring examples.

9. Score Panel Forecasts

Scorers handle panel data automatically. Use aggregation_method="groupwise" to get one score per group so you can spot underperforming entities:

from yohou.metrics import MeanAbsoluteError

scorer = MeanAbsoluteError(aggregation_method="groupwise")
scorer.fit(y_train)
scores = scorer.score(y_test, y_pred)  # one row per group

See Work with Panel Data for the full panel forecasting workflow and Forecast Accuracy for aggregation mode details.

See Also