Skip to content

SklearnTransformer

yohou.preprocessing.sklearn_base.SklearnTransformer

Bases: BaseClassWrapper, BaseTransformer

Wrapper to integrate sklearn transformers into the Yohou pipeline.

Preserves the polars DataFrame structure and "time" column while applying sklearn scaling transformations to numeric columns.

This class can be used to:

  1. Wrap any sklearn-compatible transformer for use in yohou pipelines
  2. Serve as a base class for creating yohou transformer extensions

Parameters

Name Type Description Default
transformer type

The sklearn transformer class to wrap. Must be a subclass of sklearn.base.TransformerMixin. If not provided, _estimator_default_class is used (subclasses define this).

None
**params dict

Parameters passed to the underlying sklearn transformer constructor. See the documentation of the specific transformer for available parameters.

{}

Attributes

Name Type Description
instance_ TransformerMixin

The fitted sklearn transformer instance (created by BaseClassWrapper).

Examples

>>> import polars as pl
>>> from datetime import datetime
>>> from sklearn.preprocessing import StandardScaler as SklearnStandardScaler
>>> from yohou.preprocessing import SklearnTransformer
>>> X = pl.DataFrame({
...     "time": [datetime(2024, 1, i) for i in range(1, 6)],
...     "value": [10.0, 20.0, 30.0, 40.0, 50.0],
... })
>>> transformer = SklearnTransformer(transformer=SklearnStandardScaler, with_mean=True)
>>> transformer.fit(X)
SklearnTransformer(...)
>>> X_transformed = transformer.transform(X)
>>> "time" in X_transformed.columns
True

See Also

  • StandardScaler : Pre-configured wrapper for sklearn's StandardScaler.
  • MinMaxScaler : Pre-configured wrapper for sklearn's MinMaxScaler.
  • RobustScaler : Pre-configured wrapper for sklearn's RobustScaler.
  • MaxAbsScaler : Pre-configured wrapper for sklearn's MaxAbsScaler.

Source Code

Show/Hide source
class SklearnTransformer(BaseClassWrapper, BaseTransformer):
    """Wrapper to integrate sklearn transformers into the Yohou pipeline.

    Preserves the polars DataFrame structure and "time" column while applying
    sklearn scaling transformations to numeric columns.

    This class can be used to:

    1. Wrap any sklearn-compatible transformer for use in yohou pipelines
    2. Serve as a base class for creating yohou transformer extensions

    Parameters
    ----------
    transformer : type, default=None
        The sklearn transformer class to wrap. Must be a subclass of
        ``sklearn.base.TransformerMixin``. If not provided,
        ``_estimator_default_class`` is used (subclasses define this).

    **params : dict
        Parameters passed to the underlying sklearn transformer constructor.
        See the documentation of the specific transformer for available parameters.

    Attributes
    ----------
    instance_ : TransformerMixin
        The fitted sklearn transformer instance (created by ``BaseClassWrapper``).

    Examples
    --------
    >>> import polars as pl
    >>> from datetime import datetime
    >>> from sklearn.preprocessing import StandardScaler as SklearnStandardScaler
    >>> from yohou.preprocessing import SklearnTransformer
    >>> X = pl.DataFrame({
    ...     "time": [datetime(2024, 1, i) for i in range(1, 6)],
    ...     "value": [10.0, 20.0, 30.0, 40.0, 50.0],
    ... })
    >>> transformer = SklearnTransformer(transformer=SklearnStandardScaler, with_mean=True)
    >>> transformer.fit(X)  # doctest: +ELLIPSIS
    SklearnTransformer(...)
    >>> X_transformed = transformer.transform(X)
    >>> "time" in X_transformed.columns
    True

    See Also
    --------
    - [`StandardScaler`][yohou.preprocessing.sklearn_wrappers.StandardScaler] : Pre-configured wrapper for sklearn's StandardScaler.
    - [`MinMaxScaler`][yohou.preprocessing.sklearn_wrappers.MinMaxScaler] : Pre-configured wrapper for sklearn's MinMaxScaler.
    - [`RobustScaler`][yohou.preprocessing.sklearn_wrappers.RobustScaler] : Pre-configured wrapper for sklearn's RobustScaler.
    - [`MaxAbsScaler`][yohou.preprocessing.sklearn_wrappers.MaxAbsScaler] : Pre-configured wrapper for sklearn's MaxAbsScaler.

    """

    _estimator_name = "transformer"
    _estimator_base_class = TransformerMixin
    _estimator_default_class: type | None = None

    _parameter_constraints: dict = {
        "transformer": [HasMethods(["fit", "transform"]), None],
    }

    def __init__(self, transformer=None, **params):
        if transformer is not None:
            super().__init__(transformer=transformer, **params)
        else:
            super().__init__(**params)

    def __sklearn_tags__(self):
        """Get estimator tags.

        Override to ensure stateful=False before and after fit. The invertible tag
        is set dynamically based on whether the wrapped transformer has inverse_transform.

        Returns
        -------
        Tags
            Estimator tags with stateful=False and invertible based on underlying transformer.

        """
        tags = super().__sklearn_tags__()
        # transformers are always stateless (no memory / observation horizon)
        if tags.transformer_tags is not None:
            tags.transformer_tags.stateful = False
            # Invertible only if underlying transformer has inverse_transform
            tags.transformer_tags.invertible = _transformer_has_inverse(self)
        return tags

    @_fit_context(prefer_skip_nested_validation=True)
    def fit(self, X: pl.DataFrame, y: pl.DataFrame | None = None, **params) -> "SklearnTransformer":
        """Fit the transformer to the data.

        Computes scaling parameters (e.g., mean, std, min, max) from the
        training data, excluding the "time" column.

        Parameters
        ----------
        X : pl.DataFrame
            Input time series with "time" column.

        y : pl.DataFrame or None, default=None
            Target time series. Ignored and only present for API consistency.

        **params : dict
            Metadata to route to nested estimators.

        Returns
        -------
        self
            Fitted transformer.

        Raises
        ------
        ValueError
            If X does not have a "time" column.

        """
        # Validate input data (checks time column, schema, etc.)
        X = validate_transformer_data(self, X=X, reset=True)

        # Call parent fit (stores schema, memory, etc.)
        BaseTransformer.fit(self, X, y, **params)

        # Strip time column before fitting sklearn transformer
        X_no_time = X.select(~cs.by_name("time"))

        # Configure transformer output and fit (instance_ created by _fit_context)
        self.instance_.set_output(transform="polars")
        self.instance_.fit(X_no_time)

        return self

    def transform(self, X: pl.DataFrame, **params) -> pl.DataFrame:
        """Transform the input time series.

        Applies the learned scaling transformation to each feature.

        Parameters
        ----------
        X : pl.DataFrame
            Feature time series with "time" column.

        **params : dict
            Metadata to route to nested estimators.

        Returns
        -------
        pl.DataFrame
            Transformed time series with "time" column preserved.

        """
        check_is_fitted(self, ["instance_", "X_schema_", "feature_names_in_"])

        # Validate input data
        X = validate_transformer_data(self, X=X, reset=False, check_continuity=False)

        # Strip time column before transforming
        time = X.select(cs.by_name("time"))
        X_no_time = X.select(~cs.by_name("time"))

        # Apply scaling transformation
        X_scaled_no_time = self.instance_.transform(X_no_time)

        # Reattach time column to the scaled features
        return pl.concat([time, X_scaled_no_time], how="horizontal")

    @available_if(_transformer_has_inverse)
    def inverse_transform(self, X_t: pl.DataFrame, X_p: pl.DataFrame | None = None, **params) -> pl.DataFrame:
        """Apply the inverse transformer transformation to the data.

        This method is only available if the underlying sklearn transformer
        supports inverse_transform (e.g., StandardScaler, PowerTransformer).

        Reverts the scaling transformation, restoring the original data scale.

        Parameters
        ----------
        X_t : pl.DataFrame
            Scaled features with "time" column.

        X_p : pl.DataFrame or None, default=None
            Past observations for stateful inverse transformation. Ignored for
            sklearn wrappers since sklearn transformers are stateless.

        **params : dict
            Metadata to route to nested estimators.

        Returns
        -------
        pl.DataFrame
            Unscaled features with "time" column preserved.

        """
        check_is_fitted(self, ["instance_"])
        X_t, _ = validate_transformer_data(self, X=X_t, reset=False, inverse=True, check_continuity=False)

        # Strip time column before inverse transforming
        time = X_t.select(cs.by_name("time"))
        X_no_time = X_t.select(~cs.by_name("time"))

        # Apply inverse scaling transformation (returns numpy array)
        X_unscaled_array = self.instance_.inverse_transform(X_no_time)

        # Convert back to DataFrame with original column names
        X_unscaled_no_time = pl.DataFrame(X_unscaled_array, schema=X_no_time.columns, orient="row")

        # Reattach time column to the unscaled features
        return pl.concat([time, X_unscaled_no_time], how="horizontal")

    def get_feature_names_out(self, input_features: list[str] | None = None) -> list[str]:
        """Get output feature names for transformation.

        Parameters
        ----------
        input_features : list of str or None, default=None
            Input features. If None, uses feature names from fit.

        Returns
        -------
        list of str
            Transformed feature names (same as input features for transformers).

        """
        check_is_fitted(self, ["instance_"])
        return list(self.instance_.get_feature_names_out(input_features))

Methods

__sklearn_tags__()

Get estimator tags.

Override to ensure stateful=False before and after fit. The invertible tag is set dynamically based on whether the wrapped transformer has inverse_transform.

Returns
Type Description
Tags

Estimator tags with stateful=False and invertible based on underlying transformer.

Source Code
Show/Hide source
def __sklearn_tags__(self):
    """Get estimator tags.

    Override to ensure stateful=False before and after fit. The invertible tag
    is set dynamically based on whether the wrapped transformer has inverse_transform.

    Returns
    -------
    Tags
        Estimator tags with stateful=False and invertible based on underlying transformer.

    """
    tags = super().__sklearn_tags__()
    # transformers are always stateless (no memory / observation horizon)
    if tags.transformer_tags is not None:
        tags.transformer_tags.stateful = False
        # Invertible only if underlying transformer has inverse_transform
        tags.transformer_tags.invertible = _transformer_has_inverse(self)
    return tags

fit(X, y=None, **params)

Fit the transformer to the data.

Computes scaling parameters (e.g., mean, std, min, max) from the training data, excluding the "time" column.

Parameters
Name Type Description Default
X DataFrame

Input time series with "time" column.

required
y DataFrame or None

Target time series. Ignored and only present for API consistency.

None
**params dict

Metadata to route to nested estimators.

{}
Returns
Type Description
self

Fitted transformer.

Raises
Type Description
ValueError

If X does not have a "time" column.

Source Code
Show/Hide source
@_fit_context(prefer_skip_nested_validation=True)
def fit(self, X: pl.DataFrame, y: pl.DataFrame | None = None, **params) -> "SklearnTransformer":
    """Fit the transformer to the data.

    Computes scaling parameters (e.g., mean, std, min, max) from the
    training data, excluding the "time" column.

    Parameters
    ----------
    X : pl.DataFrame
        Input time series with "time" column.

    y : pl.DataFrame or None, default=None
        Target time series. Ignored and only present for API consistency.

    **params : dict
        Metadata to route to nested estimators.

    Returns
    -------
    self
        Fitted transformer.

    Raises
    ------
    ValueError
        If X does not have a "time" column.

    """
    # Validate input data (checks time column, schema, etc.)
    X = validate_transformer_data(self, X=X, reset=True)

    # Call parent fit (stores schema, memory, etc.)
    BaseTransformer.fit(self, X, y, **params)

    # Strip time column before fitting sklearn transformer
    X_no_time = X.select(~cs.by_name("time"))

    # Configure transformer output and fit (instance_ created by _fit_context)
    self.instance_.set_output(transform="polars")
    self.instance_.fit(X_no_time)

    return self

transform(X, **params)

Transform the input time series.

Applies the learned scaling transformation to each feature.

Parameters
Name Type Description Default
X DataFrame

Feature time series with "time" column.

required
**params dict

Metadata to route to nested estimators.

{}
Returns
Type Description
DataFrame

Transformed time series with "time" column preserved.

Source Code
Show/Hide source
def transform(self, X: pl.DataFrame, **params) -> pl.DataFrame:
    """Transform the input time series.

    Applies the learned scaling transformation to each feature.

    Parameters
    ----------
    X : pl.DataFrame
        Feature time series with "time" column.

    **params : dict
        Metadata to route to nested estimators.

    Returns
    -------
    pl.DataFrame
        Transformed time series with "time" column preserved.

    """
    check_is_fitted(self, ["instance_", "X_schema_", "feature_names_in_"])

    # Validate input data
    X = validate_transformer_data(self, X=X, reset=False, check_continuity=False)

    # Strip time column before transforming
    time = X.select(cs.by_name("time"))
    X_no_time = X.select(~cs.by_name("time"))

    # Apply scaling transformation
    X_scaled_no_time = self.instance_.transform(X_no_time)

    # Reattach time column to the scaled features
    return pl.concat([time, X_scaled_no_time], how="horizontal")

inverse_transform(X_t, X_p=None, **params)

Apply the inverse transformer transformation to the data.

This method is only available if the underlying sklearn transformer supports inverse_transform (e.g., StandardScaler, PowerTransformer).

Reverts the scaling transformation, restoring the original data scale.

Parameters
Name Type Description Default
X_t DataFrame

Scaled features with "time" column.

required
X_p DataFrame or None

Past observations for stateful inverse transformation. Ignored for sklearn wrappers since sklearn transformers are stateless.

None
**params dict

Metadata to route to nested estimators.

{}
Returns
Type Description
DataFrame

Unscaled features with "time" column preserved.

Source Code
Show/Hide source
@available_if(_transformer_has_inverse)
def inverse_transform(self, X_t: pl.DataFrame, X_p: pl.DataFrame | None = None, **params) -> pl.DataFrame:
    """Apply the inverse transformer transformation to the data.

    This method is only available if the underlying sklearn transformer
    supports inverse_transform (e.g., StandardScaler, PowerTransformer).

    Reverts the scaling transformation, restoring the original data scale.

    Parameters
    ----------
    X_t : pl.DataFrame
        Scaled features with "time" column.

    X_p : pl.DataFrame or None, default=None
        Past observations for stateful inverse transformation. Ignored for
        sklearn wrappers since sklearn transformers are stateless.

    **params : dict
        Metadata to route to nested estimators.

    Returns
    -------
    pl.DataFrame
        Unscaled features with "time" column preserved.

    """
    check_is_fitted(self, ["instance_"])
    X_t, _ = validate_transformer_data(self, X=X_t, reset=False, inverse=True, check_continuity=False)

    # Strip time column before inverse transforming
    time = X_t.select(cs.by_name("time"))
    X_no_time = X_t.select(~cs.by_name("time"))

    # Apply inverse scaling transformation (returns numpy array)
    X_unscaled_array = self.instance_.inverse_transform(X_no_time)

    # Convert back to DataFrame with original column names
    X_unscaled_no_time = pl.DataFrame(X_unscaled_array, schema=X_no_time.columns, orient="row")

    # Reattach time column to the unscaled features
    return pl.concat([time, X_unscaled_no_time], how="horizontal")

get_feature_names_out(input_features=None)

Get output feature names for transformation.

Parameters
Name Type Description Default
input_features list of str or None

Input features. If None, uses feature names from fit.

None
Returns
Type Description
list of str

Transformed feature names (same as input features for transformers).

Source Code
Show/Hide source
def get_feature_names_out(self, input_features: list[str] | None = None) -> list[str]:
    """Get output feature names for transformation.

    Parameters
    ----------
    input_features : list of str or None, default=None
        Input features. If None, uses feature names from fit.

    Returns
    -------
    list of str
        Transformed feature names (same as input features for transformers).

    """
    check_is_fitted(self, ["instance_"])
    return list(self.instance_.get_feature_names_out(input_features))