Skip to content

ExpandingWindowSplitter

yohou.model_selection.split.ExpandingWindowSplitter

Bases: BaseSplitter

Expanding window time series cross-validation splitter.

Provides train/test indices to split time series data samples that are observed at fixed time intervals, in train/test sets. In each split, test indices must be higher than before, and thus shuffling in cross validator is inappropriate.

The training set grows with each split (expanding window), meaning successive training sets are supersets of those that come before them. This is useful when more data generally leads to better models and when you want to simulate accumulating historical data over time.

Parameters

Name Type Description Default
n_splits int

Number of splits. Must be at least 2.

3
max_train_size int

Maximum size for a single training set. If None, all available training data is used.

None
test_size int

Used to limit the size of the test set. Defaults to n_samples // (n_splits + 1), which is the maximum allowed value with no overlap between test sets.

None

Examples

>>> import polars as pl
>>> from datetime import datetime, timedelta
>>> from yohou.model_selection import ExpandingWindowSplitter
>>>
>>> # Create time series
>>> time = [datetime(2020, 1, 1) + timedelta(days=i) for i in range(100)]
>>> y = pl.DataFrame({"time": time, "value": range(100)})
>>>
>>> # 3 splits with 10-day test windows
>>> splitter = ExpandingWindowSplitter(n_splits=3, test_size=10)
>>> splits = list(splitter.split(y))
>>> len(splits)
3
>>>
>>> # First split: train on [0:70], test on [70:80]
>>> train, test = splits[0]
>>> len(train), len(test)
(70, 10)
>>>
>>> # Second split: train on [0:80], test on [80:90] (training set grows)
>>> train, test = splits[1]
>>> len(train), len(test)
(80, 10)
>>>

Notes

  • Training sets grow with each split (expanding window)
  • Test sets do not overlap
  • All data is used in temporal order
  • For panel data, splits all groups together using row indices

See Also

Source Code

Show/Hide source
class ExpandingWindowSplitter(BaseSplitter):
    """Expanding window time series cross-validation splitter.

    Provides train/test indices to split time series data samples
    that are observed at fixed time intervals, in train/test sets.
    In each split, test indices must be higher than before, and thus
    shuffling in cross validator is inappropriate.

    The training set grows with each split (expanding window), meaning
    successive training sets are supersets of those that come before them.
    This is useful when more data generally leads to better models and
    when you want to simulate accumulating historical data over time.

    Parameters
    ----------
    n_splits : int, default=3
        Number of splits. Must be at least 2.
    max_train_size : int, default=None
        Maximum size for a single training set. If None, all available
        training data is used.
    test_size : int, default=None
        Used to limit the size of the test set. Defaults to
        ``n_samples // (n_splits + 1)``, which is the maximum allowed
        value with no overlap between test sets.

    Examples
    --------
    >>> import polars as pl
    >>> from datetime import datetime, timedelta
    >>> from yohou.model_selection import ExpandingWindowSplitter
    >>>
    >>> # Create time series
    >>> time = [datetime(2020, 1, 1) + timedelta(days=i) for i in range(100)]
    >>> y = pl.DataFrame({"time": time, "value": range(100)})
    >>>
    >>> # 3 splits with 10-day test windows
    >>> splitter = ExpandingWindowSplitter(n_splits=3, test_size=10)
    >>> splits = list(splitter.split(y))
    >>> len(splits)
    3
    >>>
    >>> # First split: train on [0:70], test on [70:80]
    >>> train, test = splits[0]
    >>> len(train), len(test)
    (70, 10)
    >>>
    >>> # Second split: train on [0:80], test on [80:90] (training set grows)
    >>> train, test = splits[1]
    >>> len(train), len(test)
    (80, 10)
    >>>

    Notes
    -----
    - Training sets grow with each split (expanding window)
    - Test sets do not overlap
    - All data is used in temporal order
    - For panel data, splits all groups together using row indices

    See Also
    --------
    - [`SlidingWindowSplitter`][yohou.model_selection.split.SlidingWindowSplitter] : Fixed-size rolling window splitter

    """

    _parameter_constraints: dict = {
        "n_splits": [Interval(numbers.Integral, 2, None, closed="left")],
        "max_train_size": [Interval(numbers.Integral, 1, None, closed="left"), None],
        "test_size": [Interval(numbers.Integral, 1, None, closed="left"), None],
    }

    _tags: ClassVar[dict[str, Any]] = {"splitter_type": "expanding"}

    def __init__(
        self,
        n_splits: int = 3,
        *,
        max_train_size: int | None = None,
        test_size: int | None = None,
    ) -> None:
        self.n_splits = n_splits
        self.max_train_size = max_train_size
        self.test_size = test_size

        # Validate parameters
        self._validate_params()

    def split(
        self,
        y: pl.DataFrame,
        X_actual: pl.DataFrame | None = None,
    ) -> Iterator[tuple[np.ndarray[Any, np.dtype[np.intp]], np.ndarray[Any, np.dtype[np.intp]]]]:
        """Generate indices to split time series data with expanding windows.

        Parameters
        ----------
        y : pl.DataFrame
            Target time series used to generate train/test split indices.
            Must have a ``"time"`` column.
        X_actual : pl.DataFrame or None, default=None
            Actual features.  Not used for splitting but accepted for
            API consistency.

        Yields
        ------
        train : ndarray
            Training set row indices for that split.
        test : ndarray
            Test set row indices for that split.

        """
        # Validate data
        y, X_actual = validate_splitter_data(self, y=y, X_actual=X_actual)

        n_samples = len(y)
        indices = np.arange(n_samples)
        max_train_size = self.max_train_size

        # Delegate to concrete implementation
        for test_index in self._iter_test_indices(y, X_actual):
            train_end = test_index[0]
            train_index = indices[indices < train_end]

            # Apply max_train_size if specified
            if max_train_size is not None and len(train_index) > max_train_size:
                train_index = train_index[-max_train_size:]

            yield train_index, test_index

    def _iter_test_indices(
        self,
        y: pl.DataFrame,
        X_actual: pl.DataFrame | None = None,
    ) -> Iterator[np.ndarray[Any, np.dtype[np.intp]]]:
        """Generate test indices for expanding window splits.

        Parameters
        ----------
        y : pl.DataFrame
            Target time series.
        X_actual : pl.DataFrame or None, default=None
            Actual features. Not used for splitting but accepted for
            API consistency.

        Yields
        ------
        test : ndarray
            Test set indices for this split.

        """
        n_samples = len(y)
        n_splits = self.n_splits
        n_folds = n_splits + 1
        test_size = self.test_size if self.test_size is not None else n_samples // n_folds

        if n_folds > n_samples:
            raise ValueError(f"Cannot have number of folds={n_folds} greater than the number of samples={n_samples}.")

        if test_size >= n_samples:
            raise ValueError(f"test_size={test_size} should be less than the number of samples={n_samples}.")

        test_starts = range(n_samples - n_splits * test_size, n_samples, test_size)

        for test_start in test_starts:
            if test_start < 0:
                continue
            yield np.arange(test_start, test_start + test_size, dtype=np.intp)

    def get_n_splits(
        self,
        y: pl.DataFrame | None = None,
        X_actual: pl.DataFrame | None = None,
    ) -> int:
        """Return the number of cross-validation folds.

        Parameters
        ----------
        y : pl.DataFrame or None, default=None
            Not used.  Accepted for API consistency.
        X_actual : pl.DataFrame or None, default=None
            Not used.  Accepted for API consistency.

        Returns
        -------
        int
            The number of cross-validation folds.

        """
        return self.n_splits

Methods

split(y, X_actual=None)

Generate indices to split time series data with expanding windows.

Parameters
Name Type Description Default
y DataFrame

Target time series used to generate train/test split indices. Must have a "time" column.

required
X_actual DataFrame or None

Actual features. Not used for splitting but accepted for API consistency.

None

Yields:

Name Type Description
train ndarray

Training set row indices for that split.

test ndarray

Test set row indices for that split.

Source Code
Show/Hide source
def split(
    self,
    y: pl.DataFrame,
    X_actual: pl.DataFrame | None = None,
) -> Iterator[tuple[np.ndarray[Any, np.dtype[np.intp]], np.ndarray[Any, np.dtype[np.intp]]]]:
    """Generate indices to split time series data with expanding windows.

    Parameters
    ----------
    y : pl.DataFrame
        Target time series used to generate train/test split indices.
        Must have a ``"time"`` column.
    X_actual : pl.DataFrame or None, default=None
        Actual features.  Not used for splitting but accepted for
        API consistency.

    Yields
    ------
    train : ndarray
        Training set row indices for that split.
    test : ndarray
        Test set row indices for that split.

    """
    # Validate data
    y, X_actual = validate_splitter_data(self, y=y, X_actual=X_actual)

    n_samples = len(y)
    indices = np.arange(n_samples)
    max_train_size = self.max_train_size

    # Delegate to concrete implementation
    for test_index in self._iter_test_indices(y, X_actual):
        train_end = test_index[0]
        train_index = indices[indices < train_end]

        # Apply max_train_size if specified
        if max_train_size is not None and len(train_index) > max_train_size:
            train_index = train_index[-max_train_size:]

        yield train_index, test_index

get_n_splits(y=None, X_actual=None)

Return the number of cross-validation folds.

Parameters
Name Type Description Default
y DataFrame or None

Not used. Accepted for API consistency.

None
X_actual DataFrame or None

Not used. Accepted for API consistency.

None
Returns
Type Description
int

The number of cross-validation folds.

Source Code
Show/Hide source
def get_n_splits(
    self,
    y: pl.DataFrame | None = None,
    X_actual: pl.DataFrame | None = None,
) -> int:
    """Return the number of cross-validation folds.

    Parameters
    ----------
    y : pl.DataFrame or None, default=None
        Not used.  Accepted for API consistency.
    X_actual : pl.DataFrame or None, default=None
        Not used.  Accepted for API consistency.

    Returns
    -------
    int
        The number of cross-validation folds.

    """
    return self.n_splits

Tutorials

The following example notebooks use this component:

  • How to Tune Fourier Seasonality Terms


    Data-Features

    Explore how Fourier harmonic count affects seasonal fit quality, compare Fourier vs Pattern seasonality, and tune harmonics jointly with GridSearchCV.

    View · Open in marimo

  • How to Handle Short Series


    Data-Features

    Use Fourier seasonality, simple train/test splits, and panel pooling when individual series are too short for standard approaches.

    View · Open in marimo

  • Cross-Validation for Time Series


    Evaluation-Search

    Evaluate forecasters with cross_val_score, cross_validate, and cross_val_predict using temporal splitters.

    View · Open in marimo

  • How to Run Panel Cross-Validation


    Panel-Data

    Time series cross-validation on panel data with GridSearchCV, selective group observation, rewind operations, and groupwise performance comparison.

    View · Open in marimo