Skip to content

IntervalScore

yohou.metrics.interval.IntervalScore

Bases: BaseIntervalScorer

Interval Score (Winkler Score) for prediction intervals.

Combines interval width with penalties for observations falling outside the interval. Balances sharpness and coverage in a single metric.

The interval score for coverage rate α is:

\[\\text{IS}(\\alpha) = \\frac{1}{n}\\sum_{i=1}^{n} \\left[|U_i - L_i| + \\frac{2}{\\alpha}(L_i - y_i)\\mathbb{1}(y_i < L_i) + \\frac{2}{\\alpha}(y_i - U_i)\\mathbb{1}(y_i > U_i)\\right]\]

where the first term is interval width and the second/third terms are penalties for under/over-coverage.

Parameters

Name Type Description Default
aggregation_method list of str or str

Dimensions to collapse when aggregating scores. Orthogonal modes:

  • "stepwise": Collapse the forecasting-step dimension.
  • "vintagewise": Collapse the vintage/observed-time dimension.
  • "componentwise": Collapse components, return per-timestep scores.
  • "groupwise": Collapse panel groups (panel data only).
  • "coveragewise": Collapse coverage rates (return average interval score).

  • "all": Collapse all dimensions (returns scalar).

"all"
coverage_rates list of float, dict of float to float, or None

Coverage rate filter (list) or filter with weights (dict).

None
groups list of str, dict of str to float, or None

Panel group filter (list) or filter with weights (dict).

None
components list of str, dict of str to float, or None

Component filter (list) or filter with weights (dict).

None

Attributes

Name Type Description
lower_is_better bool

True for interval score (lower is better).

Examples

>>> import polars as pl
>>> from datetime import datetime
>>> from yohou.metrics import IntervalScore
>>> y_true = pl.DataFrame({"time": [datetime(2020, 1, 1), datetime(2020, 1, 2)], "value": [10.0, 20.0]})
>>> y_pred = pl.DataFrame({
...     "vintage_time": [datetime(2019, 12, 31)] * 2,
...     "time": [datetime(2020, 1, 1), datetime(2020, 1, 2)],
...     "value_lower_0.9": [8.0, 18.0],
...     "value_upper_0.9": [12.0, 22.0],
... })
>>> scorer = IntervalScore()
>>> _ = scorer.fit(y_true)
>>> scorer.score(y_true, y_pred)
4.0

Notes

  • Lower is better
  • Penalizes both wide intervals and poor coverage
  • Scale-dependent (same units as target)
  • Also known as "Winkler score" in literature
  • Widely used in forecasting competitions (M4, M5)

See Also

Source Code

Show/Hide source
class IntervalScore(BaseIntervalScorer):
    r"""Interval Score (Winkler Score) for prediction intervals.

    Combines interval width with penalties for observations falling outside
    the interval. Balances sharpness and coverage in a single metric.

    The interval score for coverage rate α is:

    $$\\text{IS}(\\alpha) = \\frac{1}{n}\\sum_{i=1}^{n} \\left[|U_i - L_i| + \\frac{2}{\\alpha}(L_i - y_i)\\mathbb{1}(y_i < L_i) + \\frac{2}{\\alpha}(y_i - U_i)\\mathbb{1}(y_i > U_i)\\right]$$

    where the first term is interval width and the second/third terms are penalties
    for under/over-coverage.

    Parameters
    ----------
    aggregation_method : list of str or str, default="all"
        Dimensions to collapse when aggregating scores. Orthogonal modes:

        - "stepwise": Collapse the forecasting-step dimension.
        - "vintagewise": Collapse the vintage/observed-time dimension.
        - "componentwise": Collapse components, return per-timestep scores.
        - "groupwise": Collapse panel groups (panel data only).
        - "coveragewise": Collapse coverage rates (return average interval score).

        - "all": Collapse all dimensions (returns scalar).
    coverage_rates : list of float, dict of float to float, or None, default=None
        Coverage rate filter (list) or filter with weights (dict).
    groups : list of str, dict of str to float, or None, default=None
        Panel group filter (list) or filter with weights (dict).
    components : list of str, dict of str to float, or None, default=None
        Component filter (list) or filter with weights (dict).

    Attributes
    ----------
    lower_is_better : bool
        True for interval score (lower is better).

    Examples
    --------
    >>> import polars as pl
    >>> from datetime import datetime
    >>> from yohou.metrics import IntervalScore
    >>> y_true = pl.DataFrame({"time": [datetime(2020, 1, 1), datetime(2020, 1, 2)], "value": [10.0, 20.0]})
    >>> y_pred = pl.DataFrame({
    ...     "vintage_time": [datetime(2019, 12, 31)] * 2,
    ...     "time": [datetime(2020, 1, 1), datetime(2020, 1, 2)],
    ...     "value_lower_0.9": [8.0, 18.0],
    ...     "value_upper_0.9": [12.0, 22.0],
    ... })
    >>> scorer = IntervalScore()
    >>> _ = scorer.fit(y_true)
    >>> scorer.score(y_true, y_pred)
    4.0

    Notes
    -----
    - Lower is better
    - Penalizes both wide intervals and poor coverage
    - Scale-dependent (same units as target)
    - Also known as "Winkler score" in literature
    - Widely used in forecasting competitions (M4, M5)

    See Also
    --------
    - [`EmpiricalCoverage`][yohou.metrics.interval.EmpiricalCoverage] : Coverage-only metric
    - [`MeanIntervalWidth`][yohou.metrics.interval.MeanIntervalWidth] : Width-only metric
    - [`PinballLoss`][yohou.metrics.interval.PinballLoss] : Asymmetric quantile-based metric

    """

    _parameter_constraints: dict = {
        **BaseIntervalScorer._parameter_constraints,
    }

    _metric_name = "interval_score"

    def __init__(
        self,
        aggregation_method: list[str] | str = "all",
        coverage_rates: list[float] | dict[float, float] | None = None,
        groups: list[str] | dict[str, float] | None = None,
        components: list[str] | dict[str, float] | None = None,
    ) -> None:
        agg_list = aggregation_method
        if aggregation_method == "all":
            agg_list = ["stepwise", "vintagewise", "componentwise", "groupwise", "coveragewise"]

        super().__init__(
            aggregation_method=agg_list,
            coverage_rates=coverage_rates,
            groups=groups,
            components=components,
        )

    def _compute_raw_scores(self, y_truth, y_pred, coverage_rates, target_columns):
        """Compute per-row interval score values."""
        frames = []
        for rate in coverage_rates:
            if rate == 0:
                raise ValueError(
                    "IntervalScore is undefined for coverage_rate=0 "
                    "(the penalty term requires division by the coverage rate)."
                )
            rate_data = {}
            for col in target_columns:
                lower_col = f"{col}_lower_{rate}"
                upper_col = f"{col}_upper_{rate}"
                if lower_col in y_pred.columns and upper_col in y_pred.columns:
                    width = (y_pred[upper_col] - y_pred[lower_col]).abs()

                    lower_penalty = (
                        pl
                        .when(y_truth[col] < y_pred[lower_col])
                        .then((2.0 / rate) * (y_pred[lower_col] - y_truth[col]))
                        .otherwise(0.0)
                    )
                    upper_penalty = (
                        pl
                        .when(y_truth[col] > y_pred[upper_col])
                        .then((2.0 / rate) * (y_truth[col] - y_pred[upper_col]))
                        .otherwise(0.0)
                    )

                    lower_penalty_series = y_pred.select(lower_penalty.alias("lp"))["lp"]
                    upper_penalty_series = y_pred.select(upper_penalty.alias("up"))["up"]

                    rate_data[col] = width + lower_penalty_series + upper_penalty_series
            frames.append(pl.DataFrame(rate_data).with_columns(pl.lit(rate).alias("coverage_rate")))
        return pl.concat(frames)

Tutorials

The following example notebooks use this component:

  • How to Use Conformity Scorers


    Evaluation-Search

    Compare Residual, AbsoluteResidual, GammaResidual, and AbsoluteGammaResidual conformity scorers with coverage/width analysis and DistanceSimilarity interaction.

    View · Open in marimo

  • How to Evaluate Interval Forecasts


    Evaluation-Search

    Evaluate prediction intervals with EmpiricalCoverage, IntervalScore, MeanIntervalWidth, PinballLoss, and CalibrationError across coverage levels.

    View · Open in marimo

  • How to Search Interval Forecaster Hyperparameters


    Evaluation-Search

    Tune interval forecaster parameters directly with interval metrics in GridSearchCV, including mixed point+interval multimetric search.

    View · Open in marimo

  • How to Forecast Intervals with CatBoost Multiquantile


    Forecasting-Models

    Use IntervalReductionForecaster with CatBoost's native multiquantile objective for simultaneous lower and upper bound estimation.

    View · Open in marimo

  • How to Use Distance-Based Similarity for Intervals


    Forecasting-Models

    Adaptive prediction intervals via similarity-weighted conformal prediction using DistanceSimilarity with configurable distance metrics and bandwidths.

    View · Open in marimo

  • How to Build Interval Forecasts with Reduction


    Forecasting-Models

    Wrap any quantile-capable sklearn estimator with IntervalReductionForecaster to produce calibrated prediction intervals across multiple horizons.

    View · Open in marimo

  • Conformal Prediction Intervals


    Getting-Started

    Build distribution-free prediction intervals with SplitConformalForecaster using calibration holdouts and configurable conformity scoring functions.

    View · Open in marimo