Skip to content

CalibrationError

yohou.metrics.interval.CalibrationError

Bases: BaseIntervalScorer

Calibration Error for prediction intervals.

Measures the discrepancy between nominal coverage rate and empirical coverage across different rates. Indicates if intervals are well-calibrated.

The calibration error is:

\[\\text{CalibError} = \\frac{1}{K}\\sum_{k=1}^{K} |\\text{Coverage}(\\alpha_k) - \\alpha_k|\]

where K is the number of coverage rates.

Parameters

Name Type Description Default
aggregation_method list of str or str

Dimensions to collapse when aggregating scores. Orthogonal modes:

  • "stepwise": Collapse the forecasting-step dimension.
  • "vintagewise": Collapse the vintage/observed-time dimension.
  • "componentwise": Collapse components, return per-timestep scores.
  • "groupwise": Collapse panel groups (panel data only).
  • "coveragewise": Collapse coverage rates (return average calibration error).

  • "all": Collapse all dimensions (returns scalar).

"all"
coverage_rates list of float, dict of float to float, or None

Coverage rate filter (list) or filter with weights (dict).

None
groups list of str, dict of str to float, or None

Panel group filter (list) or filter with weights (dict).

None
components list of str, dict of str to float, or None

Component filter (list) or filter with weights (dict).

None

Attributes

Name Type Description
lower_is_better bool

True for calibration error (lower is better).

Examples

>>> import polars as pl
>>> from datetime import datetime
>>> from yohou.metrics import CalibrationError
>>> y_true = pl.DataFrame({"time": [datetime(2020, 1, 1), datetime(2020, 1, 2)], "value": [10.0, 20.0]})
>>> y_pred = pl.DataFrame({
...     "vintage_time": [datetime(2019, 12, 31)] * 2,
...     "time": [datetime(2020, 1, 1), datetime(2020, 1, 2)],
...     "value_lower_0.9": [8.0, 18.0],
...     "value_upper_0.9": [12.0, 22.0],
...     "value_lower_0.95": [7.0, 17.0],
...     "value_upper_0.95": [13.0, 23.0],
... })
>>> error = CalibrationError()
>>> _ = error.fit(y_true)
>>> error.score(y_true, y_pred)
0.0...

Notes

  • Lower is better (0 = perfect calibration)
  • Aggregates coverage errors across all rates
  • Scale-independent (always between 0 and 1)
  • Requires at least 2 coverage rates for meaningful metric
  • Missing values are excluded from computation

See Also

Source Code

Show/Hide source
class CalibrationError(BaseIntervalScorer):
    r"""Calibration Error for prediction intervals.

    Measures the discrepancy between nominal coverage rate and empirical coverage
    across different rates. Indicates if intervals are well-calibrated.

    The calibration error is:

    $$\\text{CalibError} = \\frac{1}{K}\\sum_{k=1}^{K} |\\text{Coverage}(\\alpha_k) - \\alpha_k|$$

    where K is the number of coverage rates.

    Parameters
    ----------
    aggregation_method : list of str or str, default="all"
        Dimensions to collapse when aggregating scores. Orthogonal modes:

        - "stepwise": Collapse the forecasting-step dimension.
        - "vintagewise": Collapse the vintage/observed-time dimension.
        - "componentwise": Collapse components, return per-timestep scores.
        - "groupwise": Collapse panel groups (panel data only).
        - "coveragewise": Collapse coverage rates (return average calibration error).

        - "all": Collapse all dimensions (returns scalar).
    coverage_rates : list of float, dict of float to float, or None, default=None
        Coverage rate filter (list) or filter with weights (dict).
    groups : list of str, dict of str to float, or None, default=None
        Panel group filter (list) or filter with weights (dict).
    components : list of str, dict of str to float, or None, default=None
        Component filter (list) or filter with weights (dict).

    Attributes
    ----------
    lower_is_better : bool
        True for calibration error (lower is better).

    Examples
    --------
    >>> import polars as pl
    >>> from datetime import datetime
    >>> from yohou.metrics import CalibrationError
    >>> y_true = pl.DataFrame({"time": [datetime(2020, 1, 1), datetime(2020, 1, 2)], "value": [10.0, 20.0]})
    >>> y_pred = pl.DataFrame({
    ...     "vintage_time": [datetime(2019, 12, 31)] * 2,
    ...     "time": [datetime(2020, 1, 1), datetime(2020, 1, 2)],
    ...     "value_lower_0.9": [8.0, 18.0],
    ...     "value_upper_0.9": [12.0, 22.0],
    ...     "value_lower_0.95": [7.0, 17.0],
    ...     "value_upper_0.95": [13.0, 23.0],
    ... })
    >>> error = CalibrationError()
    >>> _ = error.fit(y_true)
    >>> error.score(y_true, y_pred)  # doctest: +ELLIPSIS
    0.0...

    Notes
    -----
    - Lower is better (0 = perfect calibration)
    - Aggregates coverage errors across all rates
    - Scale-independent (always between 0 and 1)
    - Requires at least 2 coverage rates for meaningful metric
    - Missing values are excluded from computation

    See Also
    --------
    - [`EmpiricalCoverage`][yohou.metrics.interval.EmpiricalCoverage] : Per-rate coverage metric
    - [`IntervalScore`][yohou.metrics.interval.IntervalScore] : Combined coverage and sharpness metric

    """

    _parameter_constraints: dict = {
        **BaseIntervalScorer._parameter_constraints,
    }

    _metric_name = "calibration_error"

    def __init__(
        self,
        aggregation_method: list[str] | str = "all",
        coverage_rates: list[float] | dict[float, float] | None = None,
        groups: list[str] | dict[str, float] | None = None,
        components: list[str] | dict[str, float] | None = None,
    ) -> None:
        agg_list = aggregation_method
        if aggregation_method == "all":
            agg_list = ["stepwise", "vintagewise", "componentwise", "groupwise", "coveragewise"]

        super().__init__(
            aggregation_method=agg_list,
            coverage_rates=coverage_rates,
            groups=groups,
            components=components,
        )

    def _compute_raw_scores(self, y_truth, y_pred, coverage_rates, target_columns):
        """Compute per-row calibration error values."""
        frames = []
        for rate in coverage_rates:
            rate_data = {}
            for col in target_columns:
                lower_col = f"{col}_lower_{rate}"
                upper_col = f"{col}_upper_{rate}"
                if lower_col in y_pred.columns and upper_col in y_pred.columns:
                    in_interval = (y_truth[col] >= y_pred[lower_col]) & (y_truth[col] <= y_pred[upper_col])
                    rate_data[col] = (in_interval.cast(pl.Float64) - rate).abs()
            frames.append(pl.DataFrame(rate_data).with_columns(pl.lit(rate).alias("coverage_rate")))
        return pl.concat(frames)

    def score(  # type: ignore
        self, y_truth: pl.DataFrame, y_pred: pl.DataFrame, /, **params
    ) -> pl.DataFrame | float | dict[str | float, float | pl.DataFrame]:
        """Compute calibration error.

        Parameters
        ----------
        y_truth : pl.DataFrame
            True values with "time" column.
        y_pred : pl.DataFrame
            Predicted intervals with "{col}_lower_{rate}", "{col}_upper_{rate}" columns.
        **params : dict
            Metadata to route to nested estimators.

        Returns
        -------
        float or pl.DataFrame
            Calibration error score.

        Raises
        ------
        ValueError
            If fewer than 2 coverage rates are provided.

        """
        # Validate minimum coverage rates before delegating
        rates = self._extract_coverage_rates(y_pred)
        if len(rates) < 2:
            msg = (
                f"CalibrationError requires at least 2 coverage rates, "
                f"but only {len(rates)} provided. "
                f"Use multiple coverage rates to compute calibration error."
            )
            raise ValueError(msg)

        return super().score(y_truth, y_pred, **params)

Methods

score(y_truth, y_pred, /, **params)

Compute calibration error.

Parameters
Name Type Description Default
y_truth DataFrame

True values with "time" column.

required
y_pred DataFrame

Predicted intervals with "{col}lower}", "{colupper" columns.

required
**params dict

Metadata to route to nested estimators.

{}
Returns
Type Description
float or DataFrame

Calibration error score.

Raises
Type Description
ValueError

If fewer than 2 coverage rates are provided.

Source Code
Show/Hide source
def score(  # type: ignore
    self, y_truth: pl.DataFrame, y_pred: pl.DataFrame, /, **params
) -> pl.DataFrame | float | dict[str | float, float | pl.DataFrame]:
    """Compute calibration error.

    Parameters
    ----------
    y_truth : pl.DataFrame
        True values with "time" column.
    y_pred : pl.DataFrame
        Predicted intervals with "{col}_lower_{rate}", "{col}_upper_{rate}" columns.
    **params : dict
        Metadata to route to nested estimators.

    Returns
    -------
    float or pl.DataFrame
        Calibration error score.

    Raises
    ------
    ValueError
        If fewer than 2 coverage rates are provided.

    """
    # Validate minimum coverage rates before delegating
    rates = self._extract_coverage_rates(y_pred)
    if len(rates) < 2:
        msg = (
            f"CalibrationError requires at least 2 coverage rates, "
            f"but only {len(rates)} provided. "
            f"Use multiple coverage rates to compute calibration error."
        )
        raise ValueError(msg)

    return super().score(y_truth, y_pred, **params)

Tutorials

The following example notebooks use this component:

  • How to Evaluate Interval Forecasts


    Evaluation-Search

    Evaluate prediction intervals with EmpiricalCoverage, IntervalScore, MeanIntervalWidth, PinballLoss, and CalibrationError across coverage levels.

    View · Open in marimo