Skip to content

validate_search_data

yohou.utils.validation.validate_search_data(y, X_actual)

Validate input data for hyperparameter search (GridSearchCV, RandomizedSearchCV).

Performs comprehensive validation of time series data for cross-validation: - Checks that y is not None - Validates time column presence, dtype, nulls, and sorting - Validates panel data internal consistency - Validates panel data group matching between y and X_actual - Validates consistent time intervals across DataFrames

This function is designed for SearchCV contexts where we validate data without modifying forecaster state (unlike validate_forecaster_data).

Parameters

Name Type Description Default
y DataFrame

Target time series with "time" column.

required
X_actual DataFrame or None

Exogenous feature time series with "time" column, or None.

required

Returns

Type Description
str

The common time interval shared by all provided DataFrames (e.g., "1d", "1mo").

Raises

Type Description
ValueError

If y is None, time columns are invalid, panel data is inconsistent, or intervals don't match across DataFrames.

Examples

>>> import polars as pl
>>> from datetime import datetime
>>> time_index = pl.datetime_range(
...     start=datetime(2020, 1, 1), end=datetime(2020, 1, 5), interval="1d", eager=True
... )
>>> y = pl.DataFrame({"time": time_index, "sales": [100, 110, 120, 130, 140]})
>>> X_actual = pl.DataFrame({"time": time_index, "holiday": [0, 0, 1, 0, 0]})
>>> interval = validate_search_data(y, X_actual)
>>> interval
'1d'

See Also

Source Code

Show/Hide source
def validate_search_data(y: pl.DataFrame, X_actual: pl.DataFrame | None) -> str:
    """Validate input data for hyperparameter search (GridSearchCV, RandomizedSearchCV).

    Performs comprehensive validation of time series data for cross-validation:
    - Checks that y is not None
    - Validates time column presence, dtype, nulls, and sorting
    - Validates panel data internal consistency
    - Validates panel data group matching between y and X_actual
    - Validates consistent time intervals across DataFrames

    This function is designed for SearchCV contexts where we validate data
    without modifying forecaster state (unlike validate_forecaster_data).

    Parameters
    ----------
    y : pl.DataFrame
        Target time series with "time" column.

    X_actual : pl.DataFrame or None
        Exogenous feature time series with "time" column, or None.

    Returns
    -------
    str
        The common time interval shared by all provided DataFrames (e.g., "1d", "1mo").

    Raises
    ------
    ValueError
        If y is None, time columns are invalid, panel data is inconsistent,
        or intervals don't match across DataFrames.

    Examples
    --------
    >>> import polars as pl
    >>> from datetime import datetime
    >>> time_index = pl.datetime_range(
    ...     start=datetime(2020, 1, 1), end=datetime(2020, 1, 5), interval="1d", eager=True
    ... )
    >>> y = pl.DataFrame({"time": time_index, "sales": [100, 110, 120, 130, 140]})
    >>> X_actual = pl.DataFrame({"time": time_index, "holiday": [0, 0, 1, 0, 0]})
    >>> interval = validate_search_data(y, X_actual)
    >>> interval
    '1d'

    See Also
    --------
    - [`validate_forecaster_data`][yohou.utils.validate_data.validate_forecaster_data] : Data validation with forecaster state management
    - [`check_inputs`][yohou.utils.validation.check_inputs] : Validates consistent time intervals
    - [`check_time_column`][yohou.utils.validation.check_time_column] : Validates time column properties

    """
    if y is None:
        raise ValueError("`y` cannot be None")

    # Validate time columns
    check_time_column(y, "y")
    if X_actual is not None:
        check_time_column(X_actual, "X_actual")

    # Validate panel data internal consistency
    check_panel_internal_consistency(y, "y")
    if X_actual is not None:
        check_panel_internal_consistency(X_actual, "X_actual")
        # Validate panel data groups match
        check_panel_groups_match(y, X_actual)

    # Validate consistent time intervals and return the interval
    return check_inputs(y, X_actual)