plot_score_distribution¶

`yohou.plotting.evaluation.plot_score_distribution(scorer, y_truth, y_pred, *, kind='histogram', compare_by='scorer', n_bins=30, show_mean=True, show_zero=True, columns=None, groups=None, facet_by='member', facet_n_cols=2, color_palette=None, show_legend=True, title=None, x_label=None, y_label=None, width=None, height=None, bar_opacity=0.6, line_width=2.0, kde_points=200)` ¶

Plot the distribution of per-timestep scorer values.

Evaluates forecast quality at each timestep (using componentwise aggregation) and visualises the resulting score distribution as a histogram, KDE, or both. Supports multi-model comparison via overlaid distributions.

Parameters¶

Name	Type	Description	Default
`scorer`	`BaseScorer or dict[str, BaseScorer]`	Yohou scorer instance (e.g., `MeanAbsoluteError`). Will be cloned and configured with `aggregation_method="componentwise"`. If BaseScorer: single scorer to evaluate. If dict: keys are scorer names, values are scorer instances. When combined with dict `y_pred`, the `compare_by` parameter controls which dimension is faceted vs overlaid.	required
`y_truth`	`DataFrame`	Ground truth values with `"time"` column.	required
`y_pred`	`DataFrame or dict[str, DataFrame]`	Predicted values with `"vintage_time"` and `"time"` columns. If DataFrame: single forecast. If dict: keys are model names, values are prediction DataFrames.	required
`kind`	`str`	Distribution visualisation style: `"histogram"`, `"kde"` or `"both"`.	`"histogram"`
`compare_by`	`str`	When both `scorer` and `y_pred` are dicts, controls which dimension is overlaid (colored traces) vs faceted (subplots): `"scorer"`: overlay scorers, facet by model. `"model"`: overlay models, facet by scorer. Ignored when either `scorer` or `y_pred` is not a dict.	`"scorer"`
`n_bins`	`int`	Number of histogram bins (ignored for `kind="kde"`).	`30`
`show_mean`	`bool`	Add a vertical line at the mean score.	`True`
`show_zero`	`bool`	Add a vertical dashed line at zero (useful as a perfect-forecast reference for symmetric scorers).	`True`
`columns`	`str \| list[str] \| None`	Target column name(s) to score. When groups is set this acts as a member postfix filter. `None` uses all columns.	`None`
`groups`	`list[str] \| None`	Panel group prefixes to plot (faceted layout).	`None`
`facet_by`	`Literal['group', 'member'] \| None`	Faceting axis for panel data. `"group"` creates one subplot per group, `"member"` one per member. `None` disables faceting. Ignored for non-panel data.	`"member"`
`facet_n_cols`	`int`	Number of columns in the faceted grid.	`2`
`color_palette`	`list[str] \| None`	Custom colour palette.	`None`
`show_legend`	`bool`	Whether to show the legend.	`True`
`title`	`str \| None`	Plot title. Defaults to `"<ScorerName> Distribution"`.	`None`
`x_label`	`str \| None`	X-axis label. Defaults to the scorer class name.	`None`
`y_label`	`str \| None`	Y-axis label. Defaults to `"Count"` or `"Density"` depending on kind.	`None`
`width`	`int \| None`	Plot width in pixels.	`None`
`height`	`int \| None`	Plot height in pixels.	`None`
`bar_opacity`	`float`	Opacity of histogram bars.	`0.6`
`line_width`	`float`	Width of KDE lines.	`2.0`
`kde_points`	`int`	Number of points for KDE evaluation.	`200`

Returns¶

Type	Description
`Figure`	Plotly figure object.

Raises¶

Type	Description
`TypeError`	If y_truth or y_pred is not a Polars DataFrame.
`ValueError`	If kind is not one of `"histogram"`, `"kde"` or `"both"`.

Examples¶

>>> import polars as pl
>>> from datetime import datetime
>>> from yohou.metrics import MeanAbsoluteError
>>> from yohou.plotting import plot_score_distribution

>>> y_truth = pl.DataFrame({
...     "time": [datetime(2020, 1, i) for i in range(1, 6)],
...     "value": [10.0, 20.0, 30.0, 40.0, 50.0],
... })
>>> y_pred = pl.DataFrame({
...     "vintage_time": [datetime(2019, 12, 31)] * 5,
...     "time": [datetime(2020, 1, i) for i in range(1, 6)],
...     "value": [12.0, 19.0, 28.0, 42.0, 48.0],
... })

>>> fig = plot_score_distribution(MeanAbsoluteError(), y_truth, y_pred)
>>> len(fig.data) >= 1
True

Source Code¶

View on GitHub

Show/Hide sourcedef plot_score_distribution(
    scorer: BaseScorer | dict[str, BaseScorer],
    y_truth: pl.DataFrame,
    y_pred: pl.DataFrame | dict[str, pl.DataFrame],
    *,
    kind: Literal["histogram", "kde", "both"] = "histogram",
    compare_by: Literal["scorer", "model"] = "scorer",
    n_bins: int = 30,
    show_mean: bool = True,
    show_zero: bool = True,
    columns: str | list[str] | None = None,
    groups: list[str] | None = None,
    facet_by: Literal["group", "member"] | None = "member",
    facet_n_cols: int = 2,
    color_palette: list[str] | None = None,
    show_legend: bool = True,
    title: str | None = None,
    x_label: str | None = None,
    y_label: str | None = None,
    width: int | None = None,
    height: int | None = None,
    bar_opacity: float = 0.6,
    line_width: float = 2.0,
    kde_points: int = 200,
) -> go.Figure:
    """Plot the distribution of per-timestep scorer values.

    Evaluates forecast quality at each timestep (using componentwise
    aggregation) and visualises the resulting score distribution as a
    histogram, KDE, or both.  Supports multi-model comparison via
    overlaid distributions.

    Parameters
    ----------
    scorer : BaseScorer or dict[str, BaseScorer]
        Yohou scorer instance (e.g., ``MeanAbsoluteError``).  Will be
        cloned and configured with ``aggregation_method="componentwise"``.

        - If BaseScorer: single scorer to evaluate.
        - If dict: keys are scorer names, values are scorer instances.
          When combined with dict ``y_pred``, the ``compare_by`` parameter
          controls which dimension is faceted vs overlaid.
    y_truth : pl.DataFrame
        Ground truth values with ``"time"`` column.
    y_pred : pl.DataFrame or dict[str, pl.DataFrame]
        Predicted values with ``"vintage_time"`` and ``"time"`` columns.

        - If DataFrame: single forecast.
        - If dict: keys are model names, values are prediction DataFrames.
    kind : str, default="histogram"
        Distribution visualisation style: ``"histogram"``, ``"kde"`` or
        ``"both"``.
    compare_by : str, default="scorer"
        When both ``scorer`` and ``y_pred`` are dicts, controls which
        dimension is overlaid (colored traces) vs faceted (subplots):

        - ``"scorer"``: overlay scorers, facet by model.
        - ``"model"``: overlay models, facet by scorer.

        Ignored when either ``scorer`` or ``y_pred`` is not a dict.
    n_bins : int, default=30
        Number of histogram bins (ignored for ``kind="kde"``).
    show_mean : bool, default=True
        Add a vertical line at the mean score.
    show_zero : bool, default=True
        Add a vertical dashed line at zero (useful as a perfect-forecast
        reference for symmetric scorers).
    columns : str | list[str] | None, default=None
        Target column name(s) to score.  When *groups* is set
        this acts as a member postfix filter.  ``None`` uses all columns.
    groups : list[str] | None, default=None
        Panel group prefixes to plot (faceted layout).
    facet_by : Literal["group", "member"] | None, default="member"
        Faceting axis for panel data. ``"group"`` creates one subplot per
        group, ``"member"`` one per member. ``None`` disables faceting.
        Ignored for non-panel data.
    facet_n_cols : int, default=2
        Number of columns in the faceted grid.
    color_palette : list[str] | None, default=None
        Custom colour palette.
    show_legend : bool, default=True
        Whether to show the legend.
    title : str | None, default=None
        Plot title.  Defaults to ``"<ScorerName> Distribution"``.
    x_label : str | None, default=None
        X-axis label.  Defaults to the scorer class name.
    y_label : str | None, default=None
        Y-axis label.  Defaults to ``"Count"`` or ``"Density"``
        depending on *kind*.
    width : int | None, default=None
        Plot width in pixels.
    height : int | None, default=None
        Plot height in pixels.
    bar_opacity : float, default=0.6
        Opacity of histogram bars.
    line_width : float, default=2.0
        Width of KDE lines.
    kde_points : int, default=200
        Number of points for KDE evaluation.

    Returns
    -------
    go.Figure
        Plotly figure object.

    Raises
    ------
    TypeError
        If *y_truth* or *y_pred* is not a Polars DataFrame.
    ValueError
        If *kind* is not one of ``"histogram"``, ``"kde"`` or ``"both"``.

    Examples
    --------
    >>> import polars as pl
    >>> from datetime import datetime
    >>> from yohou.metrics import MeanAbsoluteError
    >>> from yohou.plotting import plot_score_distribution

    >>> y_truth = pl.DataFrame({
    ...     "time": [datetime(2020, 1, i) for i in range(1, 6)],
    ...     "value": [10.0, 20.0, 30.0, 40.0, 50.0],
    ... })
    >>> y_pred = pl.DataFrame({
    ...     "vintage_time": [datetime(2019, 12, 31)] * 5,
    ...     "time": [datetime(2020, 1, i) for i in range(1, 6)],
    ...     "value": [12.0, 19.0, 28.0, 42.0, 48.0],
    ... })

    >>> fig = plot_score_distribution(MeanAbsoluteError(), y_truth, y_pred)
    >>> len(fig.data) >= 1
    True

    See Also
    --------
    [`plot_score_time_series`][yohou.plotting.plot_score_time_series] : Score values over time.
    [`plot_score_per_step`][yohou.plotting.plot_score_per_step] : Score by forecast step.
    """
    from scipy.stats import gaussian_kde  # noqa: PLC0415

    validate_plotting_data(y_truth)

    validate_plotting_params(kind=kind, valid_kinds={"histogram", "kde", "both"}, width=width, height=height)

    y_pred_dict: dict[str, pl.DataFrame] = _normalize_y_pred(y_pred)
    scorer_dict = _normalize_scorers(scorer)

    # Prepare and fit each scorer for componentwise aggregation
    scorer_cw_dict: dict[str, BaseScorer] = {}
    for s_name, s in scorer_dict.items():
        s_cw = _prepare_scorer_for_componentwise(s)
        s_cw.fit(y_truth)
        scorer_cw_dict[s_name] = s_cw

    n_scorers = len(scorer_cw_dict)
    n_models = len(y_pred_dict)
    multi_scorer = n_scorers > 1

    def _render(
        fig: go.Figure,
        y_truth_sub: pl.DataFrame,
        y_pred_dict_sub: dict[str, pl.DataFrame],
        _colors: list[str],
        scorer_cw_r: BaseScorer,
        _show_legend: bool = True,
        *,
        row: int | None = None,
        col: int | None = None,
    ) -> None:
        """Render score distribution traces onto *fig*."""
        for idx, (mname, y_pred_m) in enumerate(y_pred_dict_sub.items()):
            validate_plotting_data(y_pred_m)
            scores_df = scorer_cw_r.score(y_truth_sub, y_pred_m)
            if not isinstance(scores_df, pl.DataFrame):
                msg_ = f"Scorer must return DataFrame for componentwise aggregation, got {type(scores_df).__name__}"
                raise TypeError(msg_)

            score_cols = [c for c in scores_df.columns if c not in _SCORER_META_COLS]
            if len(score_cols) == 1:
                score_vals = scores_df[score_cols[0]].drop_nulls().to_numpy()
            else:
                score_vals = scores_df.select(score_cols).to_numpy().flatten()
                score_vals = score_vals[~np.isnan(score_vals)]

            c = _colors[idx % len(_colors)]

            if kind in ("histogram", "both"):
                hist_norm = "probability density" if kind == "both" else ""
                fig.add_trace(
                    go.Histogram(
                        x=score_vals,
                        nbinsx=n_bins,
                        marker_color=c,
                        opacity=bar_opacity,
                        name=mname,
                        legendgroup=mname,
                        showlegend=_show_legend if kind != "both" else False,
                        histnorm=hist_norm,
                        hoverinfo="skip",
                    ),
                    row=row,
                    col=col,
                )

            if kind in ("kde", "both") and len(score_vals) > 1:
                try:
                    kde = gaussian_kde(score_vals)
                except np.linalg.LinAlgError:
                    pass
                else:
                    x_grid = np.linspace(
                        float(score_vals.min()),
                        float(score_vals.max()),
                        kde_points,
                    )
                    fig.add_trace(
                        go.Scatter(
                            x=x_grid,
                            y=kde(x_grid),
                            mode="lines",
                            line={"color": c, "width": line_width},
                            name=mname,
                            legendgroup=mname,
                            showlegend=_show_legend,
                            hoverinfo="skip",
                        ),
                        row=row,
                        col=col,
                    )

            if show_mean and len(score_vals) > 0:
                mean_val = float(np.mean(score_vals))
                fig.add_vline(
                    x=mean_val,
                    line_dash="dash",
                    line_color=c,
                    line_width=1.5,
                    row=row,
                    col=col,
                )
                if row is None:
                    fig.add_annotation(
                        x=mean_val,
                        y=1.0,
                        yref="paper",
                        text=f"\u03bc={mean_val:.3f}",
                        font={"color": c, "size": 11},
                        showarrow=False,
                        yanchor="bottom",
                    )

        if show_zero:
            fig.add_vline(
                x=0.0,
                line_dash="dot",
                line_color="grey",
            )

    # Column filter
    _col_filter: set[str] | None = None
    if columns is not None:
        _col_filter = set([columns] if isinstance(columns, str) else columns)

    # Panel dispatch
    _, _panel_groups = inspect_panel(y_truth)
    _effective_groups: list[str] | None = None
    if groups is not None:
        _effective_groups = groups
    elif _panel_groups:
        _effective_groups = list(_panel_groups)
    if _effective_groups:
        if multi_scorer:
            msg = (
                "Multi-scorer is not supported with panel data in "
                "plot_score_distribution. Pass a single scorer instead."
            )
            raise ValueError(msg)

        first_cw = next(iter(scorer_cw_dict.values()))
        colors = resolve_color_palette(color_palette, n_models)

        n_cols_grid = min(len(_effective_groups), facet_n_cols)
        n_rows_grid = (len(_effective_groups) + n_cols_grid - 1) // n_cols_grid
        pfig = make_subplots(
            rows=n_rows_grid,
            cols=n_cols_grid,
            subplot_titles=_effective_groups,
            vertical_spacing=max(0.04, 0.3 / n_rows_grid),
        )
        for g_idx, gname in enumerate(_effective_groups):
            r = g_idx // n_cols_grid + 1
            c_i = g_idx % n_cols_grid + 1
            g_cols_truth = [
                cn
                for cn in y_truth.columns
                if cn == "time"
                or (cn.startswith(f"{gname}__") and (_col_filter is None or _member_name(cn) in _col_filter))
            ]
            y_truth_g = y_truth.select(g_cols_truth) if len(g_cols_truth) > 1 else y_truth
            y_pred_dict_g: dict[str, pl.DataFrame] = {}
            for mname, y_pred_m in y_pred_dict.items():
                gp_cols = [
                    cn
                    for cn in y_pred_m.columns
                    if cn in ("time", "vintage_time")
                    or (cn.startswith(f"{gname}__") and (_col_filter is None or _member_name(cn) in _col_filter))
                ]
                y_pred_dict_g[mname] = y_pred_m.select(gp_cols) if len(gp_cols) > 2 else y_pred_m
            _render(pfig, y_truth_g, y_pred_dict_g, colors, first_cw, show_legend and g_idx == 0, row=r, col=c_i)

        first_scorer = next(iter(scorer_dict.values()))
        scorer_name = first_scorer.__class__.__name__
        pfig = apply_default_layout(
            pfig,
            title=title or f"{scorer_name} Distribution",
            x_label=x_label or scorer_name,
            y_label=y_label or ("Density" if kind in ("kde", "both") else "Count"),
            width=width,
            height=height,
        )
        pfig.update_layout(barmode="overlay" if n_models > 1 else "relative", showlegend=show_legend)
        return pfig

    # Multi-scorer + multi-model -> faceted subplots
    if multi_scorer and n_models > 1:
        _warn_large_grid(n_scorers, n_models)

        if compare_by == "model":
            facet_labels = list(scorer_cw_dict.keys())
            overlay_labels = list(y_pred_dict.keys())
        else:
            facet_labels = list(y_pred_dict.keys())
            overlay_labels = list(scorer_cw_dict.keys())

        n_facets = len(facet_labels)
        n_cols_f = min(facet_n_cols, n_facets)
        n_rows_f = (n_facets + n_cols_f - 1) // n_cols_f
        colors = resolve_color_palette(color_palette, len(overlay_labels))

        pfig = make_subplots(
            rows=n_rows_f,
            cols=n_cols_f,
            subplot_titles=facet_labels,
            vertical_spacing=_subplot_spacing(n_rows_f),
        )

        legend_tracker = LegendTracker()
        for facet_idx, facet_label in enumerate(facet_labels):
            r = facet_idx // n_cols_f + 1
            c_i = facet_idx % n_cols_f + 1
            for overlay_idx, overlay_label in enumerate(overlay_labels):
                if compare_by == "model":
                    s_cw = scorer_cw_dict[facet_label]
                    ypd_sub = {overlay_label: y_pred_dict[overlay_label]}
                else:
                    s_cw = scorer_cw_dict[overlay_label]
                    ypd_sub = {overlay_label: y_pred_dict[facet_label]}
                _render(
                    pfig,
                    y_truth,
                    ypd_sub,
                    [colors[overlay_idx]],
                    s_cw,
                    legend_tracker.should_show(overlay_label),
                    row=r,
                    col=c_i,
                )

        pfig = apply_default_layout(
            pfig,
            title=title or "Score Distribution",
            x_label=x_label or "Score",
            y_label=y_label or ("Density" if kind in ("kde", "both") else "Count"),
            width=width,
            height=height or max(300 * n_rows_f, 400),
        )
        pfig.update_layout(barmode="overlay", showlegend=show_legend)
        return pfig

    # Non-panel single figure
    fig = go.Figure()

    if multi_scorer:
        # Overlay scorers (single model)
        colors = resolve_color_palette(color_palette, n_scorers)
        y_pred_single = next(iter(y_pred_dict.values()))
        for idx, (s_name, s_cw) in enumerate(scorer_cw_dict.items()):
            _render(fig, y_truth, {s_name: y_pred_single}, [colors[idx]], s_cw, _show_legend=show_legend)
    else:
        # Overlay models (single scorer - original behavior)
        first_cw = next(iter(scorer_cw_dict.values()))
        colors = resolve_color_palette(color_palette, n_models)
        if _col_filter is not None:
            _keep_truth = ["time"] + [c for c in y_truth.columns if c != "time" and c in _col_filter]
            y_truth_filt = y_truth.select(_keep_truth)
            y_pred_dict_filt = {
                k: v.select([c for c in v.columns if c in ("time", "vintage_time") or c in _col_filter])
                for k, v in y_pred_dict.items()
            }
            _render(fig, y_truth_filt, y_pred_dict_filt, colors, first_cw, _show_legend=show_legend)
        else:
            _render(fig, y_truth, y_pred_dict, colors, first_cw, _show_legend=show_legend)

    if multi_scorer:
        default_title = title or "Score Distribution"
        default_x = x_label or "Score"
    else:
        first_scorer = next(iter(scorer_dict.values()))
        scorer_name = first_scorer.__class__.__name__
        default_title = title or f"{scorer_name} Distribution"
        default_x = x_label or scorer_name
    default_y = y_label or ("Density" if kind in ("kde", "both") else "Count")

    fig = apply_default_layout(
        fig,
        title=default_title,
        x_label=default_x,
        y_label=default_y,
        width=width,
        height=height,
    )

    if n_models > 1 or multi_scorer:
        fig.update_layout(barmode="overlay", showlegend=show_legend)
    else:
        fig.update_layout(showlegend=show_legend)

    return fig

Tutorials¶

The following example notebooks use this component:

How to Visualize Forecast Evaluation Results

Visualization

Use plot_calibration, plot_score_per_step, and plot_forecast to diagnose forecast accuracy and interval calibration visually.

View · Open in marimo