plot_score_per_step¶

`yohou.plotting.evaluation.plot_score_per_step(scorer, y_truth, y_pred, *, kind='line', compare_by='scorer', show_trend=False, columns=None, groups=None, facet_by='member', facet_n_cols=2, color_palette=None, show_legend=True, title=None, x_label=None, y_label=None, width=None, height=None, line_width=2.0, marker_size=8.0, marker_opacity=0.8, bar_opacity=0.85)` ¶

Plot scorer value by forecast horizon step.

For each step h in the forecast window, compute the scorer between y_truth and y_pred at row h and plot the result. This reveals how forecast accuracy degrades as the horizon increases.

Parameters¶

Name	Type	Description	Default
`scorer`	`BaseScorer or dict[str, BaseScorer]`	Yohou scorer instance. Will be cloned with `aggregation_method="componentwise"`. If BaseScorer: single scorer to evaluate. If dict: keys are scorer names, values are scorer instances. When combined with dict `y_pred`, the `compare_by` parameter controls which dimension is faceted vs overlaid.	required
`y_truth`	`DataFrame`	Ground truth with `"time"` column.	required
`y_pred`	`DataFrame or dict[str, DataFrame]`	Predictions with `"vintage_time"` and `"time"` columns. If DataFrame: single forecast. If dict: keys are model names, values are prediction DataFrames.	required
`kind`	`str`	Plot kind: `"line"` or `"bar"`. `"line"`: per-step line chart with markers. `"bar"`: per-step bar chart.	`"line"`
`compare_by`	`str`	When both `scorer` and `y_pred` are dicts, controls which dimension is overlaid (colored traces) vs faceted (subplots): `"scorer"`: overlay scorers, facet by model. `"model"`: overlay models, facet by scorer. Ignored when either `scorer` or `y_pred` is not a dict.	`"scorer"`
`show_trend`	`bool`	Overlay a linear trend line (`np.polyfit` degree 1).	`False`
`columns`	`str \| list[str] \| None`	Target column name(s) to score. When groups is set this acts as a member postfix filter. `None` uses all columns.	`None`
`groups`	`list[str] \| None`	Panel group prefixes to plot (faceted layout).	`None`
`facet_by`	`Literal['group', 'member'] \| None`	Faceting axis for panel data. `"group"` creates one subplot per group, `"member"` one per member. `None` disables faceting. Ignored for non-panel data.	`"member"`
`facet_n_cols`	`int`	Columns in the faceted grid.	`2`
`color_palette`	`list[str] \| None`	Custom colour palette.	`None`
`show_legend`	`bool`	Whether to show the legend.	`True`
`title`	`str \| None`	Plot title. Defaults to `"<ScorerName> by Horizon Step"`.	`None`
`x_label`	`str \| None`	X-axis label. Defaults to `"Horizon Step"`.	`None`
`y_label`	`str \| None`	Y-axis label. Defaults to the scorer class name.	`None`
`width`	`int \| None`	Plot width in pixels.	`None`
`height`	`int \| None`	Plot height in pixels.	`None`
`line_width`	`float`	Width of score lines.	`2.0`
`marker_size`	`float`	Marker size for line+marker traces.	`8.0`
`marker_opacity`	`float`	Opacity of scatter markers.	`0.8`
`bar_opacity`	`float`	Opacity of bars when `kind="bar"`.	`0.85`

Returns¶

Type	Description
`Figure`	Plotly figure object.

Raises¶

Type	Description
`TypeError`	If y_truth or y_pred is not a Polars DataFrame.
`ValueError`	If kind is not `"line"` or `"bar"`.

Examples¶

>>> import polars as pl
>>> from datetime import datetime
>>> from yohou.metrics import MeanAbsoluteError
>>> from yohou.plotting import plot_score_per_step

>>> y_truth = pl.DataFrame({
...     "time": [datetime(2020, 1, i) for i in range(1, 6)],
...     "value": [10.0, 20.0, 30.0, 40.0, 50.0],
... })
>>> y_pred = pl.DataFrame({
...     "vintage_time": [datetime(2019, 12, 31)] * 5,
...     "time": [datetime(2020, 1, i) for i in range(1, 6)],
...     "value": [12.0, 19.0, 28.0, 42.0, 48.0],
... })

>>> fig = plot_score_per_step(MeanAbsoluteError(), y_truth, y_pred)
>>> len(fig.data) >= 1
True

Source Code¶

View on GitHub

Show/Hide sourcedef plot_score_per_step(
    scorer: BaseScorer | dict[str, BaseScorer],
    y_truth: pl.DataFrame,
    y_pred: pl.DataFrame | dict[str, pl.DataFrame],
    *,
    kind: Literal["line", "bar"] = "line",
    compare_by: Literal["scorer", "model"] = "scorer",
    show_trend: bool = False,
    columns: str | list[str] | None = None,
    groups: list[str] | None = None,
    facet_by: Literal["group", "member"] | None = "member",
    facet_n_cols: int = 2,
    color_palette: list[str] | None = None,
    show_legend: bool = True,
    title: str | None = None,
    x_label: str | None = None,
    y_label: str | None = None,
    width: int | None = None,
    height: int | None = None,
    line_width: float = 2.0,
    marker_size: float = 8.0,
    marker_opacity: float = 0.8,
    bar_opacity: float = 0.85,
) -> go.Figure:
    """Plot scorer value by forecast horizon step.

    For each step *h* in the forecast window, compute the scorer between
    ``y_truth`` and ``y_pred`` at row *h* and plot the result. This
    reveals how forecast accuracy degrades as the horizon increases.

    Parameters
    ----------
    scorer : BaseScorer or dict[str, BaseScorer]
        Yohou scorer instance.  Will be cloned with
        ``aggregation_method="componentwise"``.

        - If BaseScorer: single scorer to evaluate.
        - If dict: keys are scorer names, values are scorer instances.
          When combined with dict ``y_pred``, the ``compare_by``
          parameter controls which dimension is faceted vs overlaid.
    y_truth : pl.DataFrame
        Ground truth with ``"time"`` column.
    y_pred : pl.DataFrame or dict[str, pl.DataFrame]
        Predictions with ``"vintage_time"`` and ``"time"`` columns.

        - If DataFrame: single forecast.
        - If dict: keys are model names, values are prediction DataFrames.
    kind : str, default="line"
        Plot kind: ``"line"`` or ``"bar"``.

        - ``"line"``: per-step line chart with markers.
        - ``"bar"``: per-step bar chart.
    compare_by : str, default="scorer"
        When both ``scorer`` and ``y_pred`` are dicts, controls which
        dimension is overlaid (colored traces) vs faceted (subplots):

        - ``"scorer"``: overlay scorers, facet by model.
        - ``"model"``: overlay models, facet by scorer.

        Ignored when either ``scorer`` or ``y_pred`` is not a dict.
    show_trend : bool, default=False
        Overlay a linear trend line (``np.polyfit`` degree 1).
    columns : str | list[str] | None, default=None
        Target column name(s) to score.  When *groups* is set
        this acts as a member postfix filter.  ``None`` uses all columns.
    groups : list[str] | None, default=None
        Panel group prefixes to plot (faceted layout).
    facet_by : Literal["group", "member"] | None, default="member"
        Faceting axis for panel data. ``"group"`` creates one subplot per
        group, ``"member"`` one per member. ``None`` disables faceting.
        Ignored for non-panel data.
    facet_n_cols : int, default=2
        Columns in the faceted grid.
    color_palette : list[str] | None, default=None
        Custom colour palette.
    show_legend : bool, default=True
        Whether to show the legend.
    title : str | None, default=None
        Plot title. Defaults to ``"<ScorerName> by Horizon Step"``.
    x_label : str | None, default=None
        X-axis label. Defaults to ``"Horizon Step"``.
    y_label : str | None, default=None
        Y-axis label. Defaults to the scorer class name.
    width : int | None, default=None
        Plot width in pixels.
    height : int | None, default=None
        Plot height in pixels.
    line_width : float, default=2.0
        Width of score lines.
    marker_size : float, default=8.0
        Marker size for line+marker traces.
    marker_opacity : float, default=0.8
        Opacity of scatter markers.
    bar_opacity : float, default=0.85
        Opacity of bars when ``kind="bar"``.

    Returns
    -------
    go.Figure
        Plotly figure object.

    Raises
    ------
    TypeError
        If *y_truth* or *y_pred* is not a Polars DataFrame.
    ValueError
        If *kind* is not ``"line"`` or ``"bar"``.

    Examples
    --------
    >>> import polars as pl
    >>> from datetime import datetime
    >>> from yohou.metrics import MeanAbsoluteError
    >>> from yohou.plotting import plot_score_per_step

    >>> y_truth = pl.DataFrame({
    ...     "time": [datetime(2020, 1, i) for i in range(1, 6)],
    ...     "value": [10.0, 20.0, 30.0, 40.0, 50.0],
    ... })
    >>> y_pred = pl.DataFrame({
    ...     "vintage_time": [datetime(2019, 12, 31)] * 5,
    ...     "time": [datetime(2020, 1, i) for i in range(1, 6)],
    ...     "value": [12.0, 19.0, 28.0, 42.0, 48.0],
    ... })

    >>> fig = plot_score_per_step(MeanAbsoluteError(), y_truth, y_pred)
    >>> len(fig.data) >= 1
    True

    See Also
    --------
    [`plot_score_summary`][yohou.plotting.plot_score_summary] : Grouped bar chart of aggregate scores.
    [`plot_score_time_series`][yohou.plotting.plot_score_time_series] : Score values over time.
    [`plot_score_distribution`][yohou.plotting.plot_score_distribution] : Score distribution histogram/KDE.
    """
    validate_plotting_data(y_truth)

    validate_plotting_params(kind=kind, valid_kinds={"line", "bar"}, width=width, height=height)

    y_pred_dict: dict[str, pl.DataFrame] = _normalize_y_pred(y_pred)
    scorer_dict = _normalize_scorers(scorer)

    # Prepare and fit each scorer for componentwise aggregation
    scorer_cw_dict: dict[str, BaseScorer] = {}
    for s_name, s in scorer_dict.items():
        s_cw = _prepare_scorer_for_componentwise(s)
        s_cw.fit(y_truth)
        scorer_cw_dict[s_name] = s_cw

    n_scorers = len(scorer_cw_dict)
    n_models = len(y_pred_dict)
    multi_scorer = n_scorers > 1

    def _render_horizon(
        fig: go.Figure,
        y_truth_sub: pl.DataFrame,
        y_pred_dict_sub: dict[str, pl.DataFrame],
        _colors: list[str],
        scorer_cw_r: BaseScorer,
        _show_legend: bool = True,
        *,
        row: int | None = None,
        col: int | None = None,
        legendgroup: str | None = None,
    ) -> None:
        """Render per-horizon score traces onto *fig*."""
        for idx, (mname, y_pred_m) in enumerate(y_pred_dict_sub.items()):
            validate_plotting_data(y_pred_m)
            scores_df = scorer_cw_r.score(y_truth_sub, y_pred_m)
            if not isinstance(scores_df, pl.DataFrame):
                msg_ = f"Scorer must return DataFrame for componentwise aggregation, got {type(scores_df).__name__}"
                raise TypeError(msg_)

            score_cols = [c for c in scores_df.columns if c not in _SCORER_META_COLS]
            if len(score_cols) == 1:
                score_vals = scores_df[score_cols[0]].drop_nulls().to_numpy()
            else:
                # average across components at each timestep
                score_vals = scores_df.select(score_cols).mean_horizontal().to_numpy()
                score_vals = score_vals[~np.isnan(score_vals)]

            n_steps = len(score_vals)
            steps = np.arange(1, n_steps + 1)
            c = _colors[idx % len(_colors)]

            lg_kwargs: dict = {}
            if legendgroup is not None:
                lg_kwargs["legendgroup"] = legendgroup

            if kind == "line":
                fig.add_trace(
                    go.Scatter(
                        x=steps,
                        y=score_vals,
                        mode="lines+markers",
                        name=mname,
                        showlegend=_show_legend,
                        line={"color": c, "width": line_width},
                        marker={"size": marker_size, "color": c, "opacity": marker_opacity},
                        **lg_kwargs,
                    ),
                    row=row,
                    col=col,
                )
            else:
                fig.add_trace(
                    go.Bar(
                        x=steps,
                        y=score_vals,
                        name=mname,
                        showlegend=_show_legend,
                        marker_color=c,
                        opacity=bar_opacity,
                        **lg_kwargs,
                    ),
                    row=row,
                    col=col,
                )

            if show_trend and n_steps >= 2:
                coeffs = np.polyfit(steps, score_vals, 1)
                trend_y = np.polyval(coeffs, steps)
                fig.add_trace(
                    go.Scatter(
                        x=steps,
                        y=trend_y,
                        mode="lines",
                        name=f"{mname} trend",
                        line={"color": c, "width": 1.5, "dash": "dash"},
                        showlegend=_show_legend,
                        **lg_kwargs,
                    ),
                    row=row,
                    col=col,
                )

    _col_filter: set[str] | None = None
    if columns is not None:
        _col_filter = set([columns] if isinstance(columns, str) else columns)

    # Panel dispatch
    _, _panel_groups = inspect_panel(y_truth)
    _effective_groups: list[str] | None = None
    if groups is not None:
        _effective_groups = groups
    elif _panel_groups:
        _effective_groups = list(_panel_groups)
    if _effective_groups:
        if multi_scorer:
            msg = "Multi-scorer is not supported with panel data in plot_score_per_step. Pass a single scorer instead."
            raise ValueError(msg)

        first_cw = next(iter(scorer_cw_dict.values()))
        colors = resolve_color_palette(color_palette, n_models)

        n_cols_grid = min(len(_effective_groups), facet_n_cols)
        n_rows_grid = (len(_effective_groups) + n_cols_grid - 1) // n_cols_grid
        pfig = make_subplots(
            rows=n_rows_grid,
            cols=n_cols_grid,
            subplot_titles=_effective_groups,
            vertical_spacing=max(0.04, 0.3 / n_rows_grid),
        )
        for g_idx, gname in enumerate(_effective_groups):
            r = g_idx // n_cols_grid + 1
            c_i = g_idx % n_cols_grid + 1
            g_cols_truth = [
                cn
                for cn in y_truth.columns
                if cn == "time"
                or (cn.startswith(f"{gname}__") and (_col_filter is None or _member_name(cn) in _col_filter))
            ]
            y_truth_g = y_truth.select(g_cols_truth) if len(g_cols_truth) > 1 else y_truth
            y_pred_dict_g: dict[str, pl.DataFrame] = {}
            for mname, y_pred_m in y_pred_dict.items():
                gp_cols = [
                    cn
                    for cn in y_pred_m.columns
                    if cn in ("time", "vintage_time")
                    or (cn.startswith(f"{gname}__") and (_col_filter is None or _member_name(cn) in _col_filter))
                ]
                y_pred_dict_g[mname] = y_pred_m.select(gp_cols) if len(gp_cols) > 2 else y_pred_m
            _render_horizon(
                pfig,
                y_truth_g,
                y_pred_dict_g,
                colors,
                first_cw,
                show_legend and g_idx == 0,
                row=r,
                col=c_i,
            )

        first_scorer = next(iter(scorer_dict.values()))
        scorer_name = first_scorer.__class__.__name__
        pfig = apply_default_layout(
            pfig,
            title=title or f"{scorer_name} by Horizon Step",
            x_label=x_label or "Horizon Step",
            y_label=y_label or scorer_name,
            width=width,
            height=height,
        )
        if kind == "bar" and n_models > 1:
            pfig.update_layout(barmode="group")
        pfig.update_layout(showlegend=show_legend)
        return pfig

    # Multi-scorer + multi-model -> faceted subplots
    if multi_scorer and n_models > 1:
        _warn_large_grid(n_scorers, n_models)

        if compare_by == "model":
            facet_labels = list(scorer_cw_dict.keys())
            overlay_labels = list(y_pred_dict.keys())
        else:
            facet_labels = list(y_pred_dict.keys())
            overlay_labels = list(scorer_cw_dict.keys())

        n_facets = len(facet_labels)
        n_cols_f = min(facet_n_cols, n_facets)
        n_rows_f = (n_facets + n_cols_f - 1) // n_cols_f
        colors = resolve_color_palette(color_palette, len(overlay_labels))

        pfig = make_subplots(
            rows=n_rows_f,
            cols=n_cols_f,
            subplot_titles=facet_labels,
            vertical_spacing=_subplot_spacing(n_rows_f),
        )

        legend_tracker = LegendTracker()
        for facet_idx, facet_label in enumerate(facet_labels):
            r = facet_idx // n_cols_f + 1
            c_i = facet_idx % n_cols_f + 1
            for overlay_idx, overlay_label in enumerate(overlay_labels):
                if compare_by == "model":
                    s_cw = scorer_cw_dict[facet_label]
                    ypd_sub = {overlay_label: y_pred_dict[overlay_label]}
                else:
                    s_cw = scorer_cw_dict[overlay_label]
                    ypd_sub = {overlay_label: y_pred_dict[facet_label]}
                _render_horizon(
                    pfig,
                    y_truth,
                    ypd_sub,
                    [colors[overlay_idx]],
                    s_cw,
                    legend_tracker.should_show(overlay_label),
                    row=r,
                    col=c_i,
                )

        pfig = apply_default_layout(
            pfig,
            title=title or "Score by Horizon Step",
            x_label=x_label or "Horizon Step",
            y_label=y_label or "Score",
            width=width,
            height=height or max(300 * n_rows_f, 400),
        )
        if kind == "bar":
            pfig.update_layout(barmode="group")
        pfig.update_layout(showlegend=show_legend)
        return pfig

    # Non-panel single figure

    # Determine effective truth/pred for column filtering
    if _col_filter is not None:
        _keep_truth = ["time"] + [c for c in y_truth.columns if c != "time" and c in _col_filter]
        _yt_eff = y_truth.select(_keep_truth)
        _ypd_eff = {
            k: v.select([c for c in v.columns if c in ("time", "vintage_time") or c in _col_filter])
            for k, v in y_pred_dict.items()
        }
    else:
        _yt_eff = y_truth
        _ypd_eff = y_pred_dict

    # Detect number of score components
    _score_cols = [c for c in _yt_eff.columns if c != "time"]

    if multi_scorer:
        # Overlay scorers (single model)
        colors = resolve_color_palette(color_palette, n_scorers)
        y_pred_single = next(iter(_ypd_eff.values()))
        fig = go.Figure()
        for idx, (s_name, s_cw) in enumerate(scorer_cw_dict.items()):
            _render_horizon(
                fig,
                _yt_eff,
                {s_name: y_pred_single},
                [colors[idx]],
                s_cw,
                _show_legend=show_legend,
            )
    elif len(_score_cols) > 1:
        # Multi-component: create subplots per component, group legend per model
        first_cw = next(iter(scorer_cw_dict.values()))
        colors = resolve_color_palette(color_palette, n_models)
        n_comps = len(_score_cols)
        n_cols_c = min(facet_n_cols, n_comps)
        n_rows_c = (n_comps + n_cols_c - 1) // n_cols_c

        fig = make_subplots(
            rows=n_rows_c,
            cols=n_cols_c,
            subplot_titles=_score_cols,
            vertical_spacing=_subplot_spacing(n_rows_c),
        )

        legend_tracker = LegendTracker()
        for comp_idx, comp_col in enumerate(_score_cols):
            r = comp_idx // n_cols_c + 1
            c_i = comp_idx % n_cols_c + 1
            yt_comp = _yt_eff.select(["time", comp_col])
            ypd_comp = {
                k: v.select([c for c in v.columns if c in ("time", "vintage_time", comp_col)])
                for k, v in _ypd_eff.items()
            }
            for m_idx, (mname, y_pred_m) in enumerate(ypd_comp.items()):
                _render_horizon(
                    fig,
                    yt_comp,
                    {mname: y_pred_m},
                    [colors[m_idx]],
                    first_cw,
                    legend_tracker.should_show(mname),
                    row=r,
                    col=c_i,
                    legendgroup=mname,
                )
    else:
        # Single component, overlay models (original behavior)
        first_cw = next(iter(scorer_cw_dict.values()))
        colors = resolve_color_palette(color_palette, n_models)
        fig = go.Figure()
        _render_horizon(fig, _yt_eff, _ypd_eff, colors, first_cw, show_legend)

    if multi_scorer:
        default_title = title or "Score by Horizon Step"
        default_y = y_label or "Score"
        default_height = height
    elif len(_score_cols) > 1:
        first_scorer = next(iter(scorer_dict.values()))
        scorer_name = first_scorer.__class__.__name__
        default_title = title or f"{scorer_name} by Horizon Step"
        default_y = y_label or scorer_name
        n_rows_c = (len(_score_cols) + min(facet_n_cols, len(_score_cols)) - 1) // min(facet_n_cols, len(_score_cols))
        default_height = height or max(300 * n_rows_c, 400)
    else:
        first_scorer = next(iter(scorer_dict.values()))
        scorer_name = first_scorer.__class__.__name__
        default_title = title or f"{scorer_name} by Horizon Step"
        default_y = y_label or scorer_name
        default_height = height

    fig = apply_default_layout(
        fig,
        title=default_title,
        x_label=x_label or "Horizon Step",
        y_label=default_y,
        width=width,
        height=default_height,
    )

    if kind == "bar" and (n_models > 1 or multi_scorer):
        fig.update_layout(barmode="group")

    fig.update_layout(showlegend=show_legend)

    return fig

Tutorials¶

The following example notebooks use this component:

Decomposition

Data-Features

Chain PolynomialTrendForecaster, PatternSeasonalityForecaster, and FourierSeasonalityForecaster inside DecompositionPipeline with component visualisation.

View · Open in marimo
How to Use Conformity Scorers

Evaluation-Search

Compare Residual, AbsoluteResidual, GammaResidual, and AbsoluteGammaResidual conformity scorers with coverage/width analysis and DistanceSimilarity interaction.

View · Open in marimo
How to Run Hyperparameter Search

Evaluation-Search

Tune forecaster hyperparameters with GridSearchCV and RandomizedSearchCV using temporal cross-validation splitters and result scatter visualisation.

View · Open in marimo
How to Score Multi-Vintage Forecasts

Evaluation-Search

Generate multi-vintage predictions with observe_predict, score per step and per vintage, and visualize with heatmap, per-step, and per-vintage plots.

View · Open in marimo
How to Forecast Intervals with CatBoost Multiquantile

Forecasting-Models

Use IntervalReductionForecaster with CatBoost's native multiquantile objective for simultaneous lower and upper bound estimation.

View · Open in marimo
How to Use Distance-Based Similarity for Intervals

Forecasting-Models

Adaptive prediction intervals via similarity-weighted conformal prediction using DistanceSimilarity with configurable distance metrics and bandwidths.

View · Open in marimo
How to Apply Time-Weighted Training

Forecasting-Models

Use time_weight and sample_weight_alignment to emphasise recent or seasonal training samples in PointReductionForecaster, with visualisation of weight curves and alignment strategy comparison.

View · Open in marimo
How to Combine Forecasters with VotingPointForecaster

Forecasting-Models

Build point ensembles with VotingPointForecaster using mean, weighted, and median aggregation strategies.

View · Open in marimo
Naive Forecasters

Getting-Started

Baseline forecasting (the first portion of the First Forecast tutorial) with SeasonalNaive using different seasonality periods, the observe/predict streaming workflow, and rolling evaluation patterns.

View · Open in marimo
Direct, Recursive, and MIMO Strategies

Getting-Started

Compare direct, recursive, and MIMO reduction strategies across forecasting horizons to understand the trade-offs for your use case.

View · Open in marimo
How to Visualize Forecast Evaluation Results

Visualization

Use plot_calibration, plot_score_per_step, and plot_forecast to diagnose forecast accuracy and interval calibration visually.

View · Open in marimo