IntervalScore¶

`yohou.metrics.interval.IntervalScore` ¶

Bases: BaseIntervalScorer

Interval Score (Winkler Score) for prediction intervals.

Combines interval width with penalties for observations falling outside the interval. Balances sharpness and coverage in a single metric.

The interval score for coverage rate α is:

\[\\text{IS}(\\alpha) = \\frac{1}{n}\\sum_{i=1}^{n} \\left[|U_i - L_i| + \\frac{2}{\\alpha}(L_i - y_i)\\mathbb{1}(y_i < L_i) + \\frac{2}{\\alpha}(y_i - U_i)\\mathbb{1}(y_i > U_i)\\right]\]

where the first term is interval width and the second/third terms are penalties for under/over-coverage.

Parameters¶

Name	Type	Description	Default
`aggregation_method`	`list of str or str`	Dimensions to collapse when aggregating scores. Orthogonal modes: "stepwise": Collapse the forecasting-step dimension. "vintagewise": Collapse the vintage/observed-time dimension. "componentwise": Collapse components, return per-timestep scores. "groupwise": Collapse panel groups (panel data only). "coveragewise": Collapse coverage rates (return average interval score). "all": Collapse all dimensions (returns scalar).	`"all"`
`coverage_rates`	`list of float, dict of float to float, or None`	Coverage rate filter (list) or filter with weights (dict).	`None`
`groups`	`list of str, dict of str to float, or None`	Panel group filter (list) or filter with weights (dict).	`None`
`components`	`list of str, dict of str to float, or None`	Component filter (list) or filter with weights (dict).	`None`

Attributes¶

Name	Type	Description
`lower_is_better`	`bool`	True for interval score (lower is better).

Examples¶

>>> import polars as pl
>>> from datetime import datetime
>>> from yohou.metrics import IntervalScore
>>> y_true = pl.DataFrame({"time": [datetime(2020, 1, 1), datetime(2020, 1, 2)], "value": [10.0, 20.0]})
>>> y_pred = pl.DataFrame({
...     "vintage_time": [datetime(2019, 12, 31)] * 2,
...     "time": [datetime(2020, 1, 1), datetime(2020, 1, 2)],
...     "value_lower_0.9": [8.0, 18.0],
...     "value_upper_0.9": [12.0, 22.0],
... })
>>> scorer = IntervalScore()
>>> _ = scorer.fit(y_true)
>>> scorer.score(y_true, y_pred)
4.0

Notes¶

Lower is better
Penalizes both wide intervals and poor coverage
Scale-dependent (same units as target)
Also known as "Winkler score" in literature
Widely used in forecasting competitions (M4, M5)

Source Code¶

View on GitHub

Show/Hide sourceclass IntervalScore(BaseIntervalScorer):
    r"""Interval Score (Winkler Score) for prediction intervals.

    Combines interval width with penalties for observations falling outside
    the interval. Balances sharpness and coverage in a single metric.

    The interval score for coverage rate α is:

    $$\\text{IS}(\\alpha) = \\frac{1}{n}\\sum_{i=1}^{n} \\left[|U_i - L_i| + \\frac{2}{\\alpha}(L_i - y_i)\\mathbb{1}(y_i < L_i) + \\frac{2}{\\alpha}(y_i - U_i)\\mathbb{1}(y_i > U_i)\\right]$$

    where the first term is interval width and the second/third terms are penalties
    for under/over-coverage.

    Parameters
    ----------
    aggregation_method : list of str or str, default="all"
        Dimensions to collapse when aggregating scores. Orthogonal modes:

        - "stepwise": Collapse the forecasting-step dimension.
        - "vintagewise": Collapse the vintage/observed-time dimension.
        - "componentwise": Collapse components, return per-timestep scores.
        - "groupwise": Collapse panel groups (panel data only).
        - "coveragewise": Collapse coverage rates (return average interval score).

        - "all": Collapse all dimensions (returns scalar).
    coverage_rates : list of float, dict of float to float, or None, default=None
        Coverage rate filter (list) or filter with weights (dict).
    groups : list of str, dict of str to float, or None, default=None
        Panel group filter (list) or filter with weights (dict).
    components : list of str, dict of str to float, or None, default=None
        Component filter (list) or filter with weights (dict).

    Attributes
    ----------
    lower_is_better : bool
        True for interval score (lower is better).

    Examples
    --------
    >>> import polars as pl
    >>> from datetime import datetime
    >>> from yohou.metrics import IntervalScore
    >>> y_true = pl.DataFrame({"time": [datetime(2020, 1, 1), datetime(2020, 1, 2)], "value": [10.0, 20.0]})
    >>> y_pred = pl.DataFrame({
    ...     "vintage_time": [datetime(2019, 12, 31)] * 2,
    ...     "time": [datetime(2020, 1, 1), datetime(2020, 1, 2)],
    ...     "value_lower_0.9": [8.0, 18.0],
    ...     "value_upper_0.9": [12.0, 22.0],
    ... })
    >>> scorer = IntervalScore()
    >>> _ = scorer.fit(y_true)
    >>> scorer.score(y_true, y_pred)
    4.0

    Notes
    -----
    - Lower is better
    - Penalizes both wide intervals and poor coverage
    - Scale-dependent (same units as target)
    - Also known as "Winkler score" in literature
    - Widely used in forecasting competitions (M4, M5)

    See Also
    --------
    - [`EmpiricalCoverage`][yohou.metrics.interval.EmpiricalCoverage] : Coverage-only metric
    - [`MeanIntervalWidth`][yohou.metrics.interval.MeanIntervalWidth] : Width-only metric
    - [`PinballLoss`][yohou.metrics.interval.PinballLoss] : Asymmetric quantile-based metric

    """

    _parameter_constraints: dict = {
        **BaseIntervalScorer._parameter_constraints,
    }

    _metric_name = "interval_score"

    def __init__(
        self,
        aggregation_method: list[str] | str = "all",
        coverage_rates: list[float] | dict[float, float] | None = None,
        groups: list[str] | dict[str, float] | None = None,
        components: list[str] | dict[str, float] | None = None,
    ) -> None:
        agg_list = aggregation_method
        if aggregation_method == "all":
            agg_list = ["stepwise", "vintagewise", "componentwise", "groupwise", "coveragewise"]

        super().__init__(
            aggregation_method=agg_list,
            coverage_rates=coverage_rates,
            groups=groups,
            components=components,
        )

    def _compute_raw_scores(self, y_truth, y_pred, coverage_rates, target_columns):
        """Compute per-row interval score values."""
        frames = []
        for rate in coverage_rates:
            if rate == 0:
                raise ValueError(
                    "IntervalScore is undefined for coverage_rate=0 "
                    "(the penalty term requires division by the coverage rate)."
                )
            rate_data = {}
            for col in target_columns:
                lower_col = f"{col}_lower_{rate}"
                upper_col = f"{col}_upper_{rate}"
                if lower_col in y_pred.columns and upper_col in y_pred.columns:
                    width = (y_pred[upper_col] - y_pred[lower_col]).abs()

                    lower_penalty = (
                        pl
                        .when(y_truth[col] < y_pred[lower_col])
                        .then((2.0 / rate) * (y_pred[lower_col] - y_truth[col]))
                        .otherwise(0.0)
                    )
                    upper_penalty = (
                        pl
                        .when(y_truth[col] > y_pred[upper_col])
                        .then((2.0 / rate) * (y_truth[col] - y_pred[upper_col]))
                        .otherwise(0.0)
                    )

                    lower_penalty_series = y_pred.select(lower_penalty.alias("lp"))["lp"]
                    upper_penalty_series = y_pred.select(upper_penalty.alias("up"))["up"]

                    rate_data[col] = width + lower_penalty_series + upper_penalty_series
            frames.append(pl.DataFrame(rate_data).with_columns(pl.lit(rate).alias("coverage_rate")))
        return pl.concat(frames)

Tutorials¶

The following example notebooks use this component:

How to Use Conformity Scorers

Evaluation-Search

Compare Residual, AbsoluteResidual, GammaResidual, and AbsoluteGammaResidual conformity scorers with coverage/width analysis and DistanceSimilarity interaction.

View · Open in marimo
How to Evaluate Interval Forecasts

Evaluation-Search

Evaluate prediction intervals with EmpiricalCoverage, IntervalScore, MeanIntervalWidth, PinballLoss, and CalibrationError across coverage levels.

View · Open in marimo
How to Search Interval Forecaster Hyperparameters

Evaluation-Search

Tune interval forecaster parameters directly with interval metrics in GridSearchCV, including mixed point+interval multimetric search.

View · Open in marimo
How to Forecast Intervals with CatBoost Multiquantile

Forecasting-Models

Use IntervalReductionForecaster with CatBoost's native multiquantile objective for simultaneous lower and upper bound estimation.

View · Open in marimo
How to Use Distance-Based Similarity for Intervals

Forecasting-Models

Adaptive prediction intervals via similarity-weighted conformal prediction using DistanceSimilarity with configurable distance metrics and bandwidths.

View · Open in marimo
How to Build Interval Forecasts with Reduction

Forecasting-Models

Wrap any quantile-capable sklearn estimator with IntervalReductionForecaster to produce calibrated prediction intervals across multiple horizons.

View · Open in marimo
Conformal Prediction Intervals

Getting-Started

Build distribution-free prediction intervals with SplitConformalForecaster using calibration holdouts and configurable conformity scoring functions.

View · Open in marimo