validate_column_names¶

`yohou.utils.validation.validate_column_names(df)` ¶

Validate that __ separator is used only for panel data group names.

The __ separator is reserved for panel data groups following the pattern

__ (e.g., "sales__store_1"). This function ensures column names either: - Don't contain __ at all (global columns), OR - Follow the exact pattern ^[^_]+__[^_]+.*$ (group columns) *[forecasting horizon]: The number of future timesteps to predict. *[observation horizon]: The number of recent time steps a stateful component must retain in memory to produce output. *[memory buffer]: The internal store of recent rows that a stateful component maintains. *[composite method]: A method that combines two operations in sequence, such as observe_predict or observe_transform. *[point forecast]: A single numeric prediction per timestep, produced by predict(). *[interval forecast]: A pair of bounds per timestep and coverage rate, produced by predict_interval(). *[class-probability forecast]: A probability distribution over categorical classes per timestep, produced by predict_class_proba(). *[coverage rate]: The target probability that an interval forecast should contain the true value. *[recursive prediction]: Predicting beyond the training horizon by feeding predictions back as inputs iteratively. *[error accumulation]: The compounding of prediction errors when recursive prediction feeds earlier predictions back as inputs. *[rolling evaluation]: Assessing forecaster performance by repeatedly observing new data and predicting, producing one vintage per iteration. *[time column contract]: Every DataFrame in Yohou must have a "time" column containing datetime values. *[panel data]: Multiple related time series handled together, with groups identified by the __ separator in column names. *[group prefix]: The double-underscore separator between a panel group name and a column name. *[panel strategy]: Controls how a forecaster handles panel data: "global" fits one model with per-group transformer state. *[exogenous features]: Additional input columns beyond the target that may improve forecasts. *[known-future features]: Exogenous features whose values are available for the prediction horizon at forecast time. *[step-indexed columns]: Feature columns named with a _step_h suffix that carry different values for each forecasting step. *[reduction strategy]: Converting a time series forecasting problem into a tabular supervised learning problem. *[step feature alignment]: Controls which step-indexed columns each direct estimator receives in a direct or dir-rec reduction strategy. *[target transformer]: A transformer applied to the target series before tabularization. *[feature transformer]: A transformer applied to the feature matrix before tabularization. *[stateful transformer]: A transformer that maintains an internal observation window and updates it during observe(). *[stateless transformer]: A transformer whose output depends only on its fitted parameters and the current input. *[forecaster composition]: Building complex forecasting workflows by combining simpler components. *[variance stabilization]: Transforming a time series so that its error variance is approximately constant over time. *[conformal prediction]: A distribution-free method for constructing prediction intervals with finite-sample coverage guarantees. *[conformity score]: The residual measurement computed on a held-out calibration set during conformal prediction. *[calibration set]: A portion of the training data held out from model fitting, used to compute conformity scores. *[similarity measure]: A function that weights conformity scores based on the similarity between prediction contexts. *[metadata routing]: The mechanism by which sample-level metadata flows from a composite estimator down to its child components. *[aggregation method]: Controls which dimensions a scorer collapses when computing its result. *[forecast error]: The difference between a predicted value and the corresponding actual value, computed on out-of-sample data. *[scale-dependent metric]: A metric whose value depends on the scale of the data, such as MAE or RMSE. *[scale-independent metric]: A metric that normalizes errors so they can be compared across series of different scales. *[cross-validation]: Evaluating model performance by repeatedly splitting data into training and test sets. *[temporal split]: A train/test split that respects time ordering, with training data always preceding test data. *[expanding window splitter]: A cross-validation splitter where each fold grows the training window while keeping the test window fixed. *[sliding window splitter]: A cross-validation splitter that maintains a fixed-size training window sliding forward in time. *[concept drift]: A shift in the statistical properties of the data over time, causing older models to become less accurate. *[time weighting]: Applying non-uniform weights to observations or errors so that specific time periods carry more or less influence. *[proper scoring rule]: A metric uniquely minimized when the predicted distribution matches the true distribution. *[ACF]: Autocorrelation function, measuring the linear correlation between a time series and a lagged copy of itself. *[baseline]: A simple reference model (such as SeasonalNaive) used to judge whether more complex approaches add value. *[calibration]: The degree to which predicted probabilities or interval coverage rates match observed frequencies. *[dir-rec strategy]: A hybrid reduction strategy that trains sequential models, each receiving predictions from all previous steps as features. *[direct strategy]: A reduction strategy that trains one independent model per forecast step in the horizon. *[ensemble]: A method that combines predictions from multiple independent forecasters to produce a single forecast. *[feature matrix]: The tabular matrix of input features produced during tabularization, used as input to the estimator. *[fold]: A single train/test partition produced by a cross-validation splitter. *[harmonics]: Integer multiples of a base frequency used to capture seasonal patterns in Fourier features. *[heteroscedasticity]: A condition where the variance of a time series changes over time, often growing with the level. *[multi-output strategy]: A reduction strategy that trains a single model to predict all forecast steps simultaneously. *[outlier]: An observation that deviates significantly from the expected pattern of a time series. *[PACF]: Partial autocorrelation function, measuring correlation after removing intermediate lag effects. *[prediction intervals]: A pair of lower and upper bounds per timestep that quantifies forecast uncertainty at a given coverage rate. *[residuals]: The differences between predicted and actual values, used to diagnose model performance. *[scorer]: An object that computes a metric from truth and prediction DataFrames after fitting on training data. *[stride]: The number of time steps observed between consecutive predict calls in a rolling evaluation. *[trend]: A long-term upward or downward movement in the level of a time series. *[vintage]: A set of predictions issued at the same point in time, identified by the vintage_time column. *[walk-forward evaluation]: Another name for rolling evaluation: stepping through a test set chronologically, observing actual values before issuing each forecast. *[seasonality]: A repeating, predictable pattern in a time series that occurs at regular intervals (daily, weekly, yearly). *[lag]: An observation from a previous time step used as an input feature for prediction. *[STL]: Seasonal and Trend decomposition using Loess, a method that separates a time series into trend, seasonal, and remainder components. *[autoregressive features]: Input features derived from past values of the target series itself, such as lags or rolling statistics. *[tail errors]: Large forecast errors in the extreme quantiles of the error distribution, indicating poor performance on unusual observations. *[structural break]: An abrupt, permanent change in the statistical properties of a time series caused by an external event. *[regime change]: A shift in the underlying data-generating process that produces a new statistical pattern. *[homoscedasticity]: A condition where the variance of a time series remains constant over time. *[MRO]: Method Resolution Order, the sequence Python follows when searching for a method across a class hierarchy. *[vintage weight]: A metadata weight applied per vintage in multi-vintage scoring to control the relative importance of each forecast vintage. *[step weight]: A metadata weight applied per forecast step to control the relative importance of predictions at different horizons.

Parameters¶

Name	Type	Description	Default
`df`	`DataFrame`	DataFrame to validate.	required

Raises¶

Type	Description
`ValueError`	If any column name contains __ but doesn't match the group pattern, or if __ appears multiple times in inconsistent way.

Examples¶

>>> import polars as pl
>>> # Valid: no __ separator
>>> df = pl.DataFrame({"time": [1, 2], "value": [10, 20]})
>>> validate_column_names(df)  # No error

>>> # Valid: proper group pattern
>>> df = pl.DataFrame({"time": [1, 2], "sales__store_1": [100, 110]})
>>> validate_column_names(df)  # No error

>>> # Invalid: __ without proper pattern
>>> df = pl.DataFrame({"time": [1, 2], "my__bad__col": [10, 20]})
>>> validate_column_names(df)
Traceback (most recent call last):
    ...
ValueError: Column 'my__bad__col' contains multiple __ separators...

Source Code¶

View on GitHub

Show/Hide sourcedef validate_column_names(df: pl.DataFrame) -> None:
    """Validate that __ separator is used only for panel data group names.

    The __ separator is reserved for panel data groups following the pattern
    <GROUP>__<SERIES> (e.g., "sales__store_1"). This function ensures column
    names either:
    - Don't contain __ at all (global columns), OR
    - Follow the exact pattern ^[^_]+__[^_]+.*$ (group columns)

    Parameters
    ----------
    df : pl.DataFrame
        DataFrame to validate.

    Raises
    ------
    ValueError
        If any column name contains __ but doesn't match the group pattern,
        or if __ appears multiple times in inconsistent way.

    Examples
    --------
    >>> import polars as pl
    >>> # Valid: no __ separator
    >>> df = pl.DataFrame({"time": [1, 2], "value": [10, 20]})
    >>> validate_column_names(df)  # No error

    >>> # Valid: proper group pattern
    >>> df = pl.DataFrame({"time": [1, 2], "sales__store_1": [100, 110]})
    >>> validate_column_names(df)  # No error

    >>> # Invalid: __ without proper pattern
    >>> df = pl.DataFrame({"time": [1, 2], "my__bad__col": [10, 20]})
    >>> validate_column_names(df)  # doctest: +SKIP
    Traceback (most recent call last):
        ...
    ValueError: Column 'my__bad__col' contains multiple __ separators...

    See Also
    --------
    - [`check_inputs`][yohou.utils.validation.check_inputs] : Validates time intervals and calls this function

    """
    # Pattern: allows underscores in group/series names, but not adjacent to __
    # Valid: store_1__sales, my_store__my_sales
    # Invalid: store___sales (underscore adjacent to __), _store__sales, store__sales_
    # Strategy: split on __, check parts don't start/end with _ and are non-empty

    # Handle None case
    if df is None:
        return

    for col_name in df.columns:
        if col_name == "time":
            continue

        if "__" not in col_name:
            # No __ separator - valid global column
            continue

        # Column contains __ - validate it follows the pattern
        parts = col_name.split("__")

        # Check for common issues to provide helpful error messages
        if len(parts) != 2:
            raise ValueError(
                f"Column '{col_name}' contains multiple __ separators. "
                f"The __ separator is reserved for panel data groups and must appear "
                f"exactly once, following the pattern '<GROUP>__<SERIES>' "
                f"(e.g., 'sales__store_1'). If this column was produced by a "
                f"meta-transformer (FeatureUnion, ColumnTransformer), ensure it "
                f"uses panel-safe prefixing with single underscore separators. "
                f"Please rename columns to avoid using __ "
                f"or use it only for panel data groups."
            )

        group, series = parts

        # Check for empty parts
        if not group or not series:
            raise ValueError(
                f"Column '{col_name}' has __ at the beginning or end. "
                f"The __ separator must separate a non-empty group prefix from a "
                f"non-empty series suffix (e.g., 'sales__store_1')."
            )

        # Check for underscores adjacent to __
        if group.endswith("_") or series.startswith("_"):
            raise ValueError(
                f"Column '{col_name}' has underscores adjacent to the __ separator. "
                f"The pattern '<GROUP>__<SERIES>' requires that the group part doesn't "
                f"end with _ and the series part doesn't start with _ "
                f"(e.g., 'store_1__sales' is valid, but 'store_1___sales' or 'store1_"
                "__sales' are not)."
            )