Skip to content

TransformedSpaceKNNImputer

yohou.preprocessing.imputation.TransformedSpaceKNNImputer

Bases: BaseTransformer

K-nearest neighbors imputation in a transformed feature space.

Projects the data through an optional transformer before performing KNN imputation. Neighbor search and imputation are both performed in the transformed representation, making the result fundamentally different from composing a transformer and a KNN imputer sequentially in a pipeline.

When transformer=None imputation happens directly on the raw features. Setting transformer=LagTransformer(lag=k) subsumes a window-based KNN imputer because neighbors are now lag-feature vectors, i.e. temporally similar windows, rather than individual time points. Any projection (PolynomialFeatures, SplineTransformer, PCA, …) can be used as the imputation space.

Parameters

Name Type Description Default
n_neighbors int

Number of neighboring samples to use for imputation.

5
weights (uniform, distance)

Weight function used in prediction:

  • "uniform": All points in the neighborhood weighted equally.
  • "distance": Closer neighbors have greater influence.
"uniform"
metric nan_euclidean

Distance metric for searching neighbors. Only nan_euclidean is supported as it handles missing values.

"nan_euclidean"
transformer BaseTransformer or None

An optional yohou transformer used to project the data before KNN imputation. Must implement fit / transform. If None, imputation is performed directly on the raw features.

None

Attributes

Name Type Description
imputer_ sklearn KNNImputer

The fitted sklearn KNNImputer instance (fitted in transformed space).

transformer_ BaseTransformer or None

A deep-copied and fitted instance of the transformer (or None).

Examples

>>> import polars as pl
>>> from datetime import datetime
>>> import numpy as np
>>> from yohou.preprocessing import TransformedSpaceKNNImputer

Basic usage (no transformer, raw-feature KNN):

>>> X = pl.DataFrame({
...     "time": [datetime(2020, 1, i) for i in range(1, 11)],
...     "value": [1.0, 2.0, np.nan, 4.0, 5.0, 6.0, np.nan, 8.0, 9.0, 10.0],
... })
>>> imputer = TransformedSpaceKNNImputer(n_neighbors=3)
>>> imputer.fit(X)
TransformedSpaceKNNImputer(...)
>>> X_imputed = imputer.transform(X)
>>> X_imputed["value"].null_count()
0

With a lag transformer (window-based KNN):

>>> from yohou.preprocessing import LagTransformer
>>> X = pl.DataFrame({
...     "time": [datetime(2020, 1, i) for i in range(1, 21)],
...     "value": [float(i) for i in range(1, 21)],
... })
>>> imputer = TransformedSpaceKNNImputer(
...     n_neighbors=3,
...     transformer=LagTransformer(lag=3),
... )
>>> imputer.fit(X)
TransformedSpaceKNNImputer(...)
>>> X_t = imputer.transform(X)
>>> X_t["value_lag_3"].null_count()
0

See Also

Source Code

Show/Hide source
class TransformedSpaceKNNImputer(BaseTransformer):
    """K-nearest neighbors imputation in a transformed feature space.

    Projects the data through an optional transformer before performing KNN
    imputation.  Neighbor search *and* imputation are both performed in the
    transformed representation, making the result fundamentally different from
    composing a transformer and a KNN imputer sequentially in a pipeline.

    When ``transformer=None`` imputation happens directly on the raw features.
    Setting ``transformer=LagTransformer(lag=k)`` subsumes a window-based KNN
    imputer
    because neighbors are now lag-feature vectors, i.e. temporally similar
    windows, rather than individual time points.  Any projection
    (``PolynomialFeatures``, ``SplineTransformer``, PCA, …) can be used as the
    imputation space.

    Parameters
    ----------
    n_neighbors : int, default=5
        Number of neighboring samples to use for imputation.
    weights : {"uniform", "distance"}, default="uniform"
        Weight function used in prediction:

        - ``"uniform"``: All points in the neighborhood weighted equally.
        - ``"distance"``: Closer neighbors have greater influence.
    metric : {"nan_euclidean"}, default="nan_euclidean"
        Distance metric for searching neighbors.  Only ``nan_euclidean`` is
        supported as it handles missing values.
    transformer : BaseTransformer or None, default=None
        An optional yohou transformer used to project the data before KNN
        imputation.  Must implement ``fit`` / ``transform``.  If ``None``,
        imputation is performed directly on the raw features.

    Attributes
    ----------
    imputer_ : sklearn KNNImputer
        The fitted sklearn KNNImputer instance (fitted in transformed space).
    transformer_ : BaseTransformer or None
        A deep-copied and fitted instance of the transformer (or ``None``).

    Examples
    --------
    >>> import polars as pl
    >>> from datetime import datetime
    >>> import numpy as np
    >>> from yohou.preprocessing import TransformedSpaceKNNImputer

    Basic usage (no transformer, raw-feature KNN):

    >>> X = pl.DataFrame({
    ...     "time": [datetime(2020, 1, i) for i in range(1, 11)],
    ...     "value": [1.0, 2.0, np.nan, 4.0, 5.0, 6.0, np.nan, 8.0, 9.0, 10.0],
    ... })
    >>> imputer = TransformedSpaceKNNImputer(n_neighbors=3)
    >>> imputer.fit(X)
    TransformedSpaceKNNImputer(...)
    >>> X_imputed = imputer.transform(X)
    >>> X_imputed["value"].null_count()
    0

    With a lag transformer (window-based KNN):

    >>> from yohou.preprocessing import LagTransformer
    >>> X = pl.DataFrame({
    ...     "time": [datetime(2020, 1, i) for i in range(1, 21)],
    ...     "value": [float(i) for i in range(1, 21)],
    ... })
    >>> imputer = TransformedSpaceKNNImputer(
    ...     n_neighbors=3,
    ...     transformer=LagTransformer(lag=3),
    ... )
    >>> imputer.fit(X)
    TransformedSpaceKNNImputer(...)
    >>> X_t = imputer.transform(X)
    >>> X_t["value_lag_3"].null_count()
    0

    See Also
    --------
    - [`LagTransformer`][yohou.preprocessing.window.LagTransformer] : Creates lagged features from time series.
    - [`SimpleTimeImputer`][yohou.preprocessing.imputation.SimpleTimeImputer] : Interpolation-based imputation.
    - [`SimpleImputer`][yohou.preprocessing.imputation.SimpleImputer] : Simple constant-strategy imputation.

    """

    _parameter_constraints: dict = {
        "n_neighbors": [Interval(numbers.Integral, 1, None, closed="left")],
        "weights": [StrOptions({"uniform", "distance"})],
        "metric": [StrOptions({"nan_euclidean"})],
        "transformer": [None, BaseTransformer],
    }

    def __init__(
        self,
        n_neighbors: int = 5,
        weights: str = "uniform",
        metric: str = "nan_euclidean",
        transformer: BaseTransformer | None = None,
    ):
        self.n_neighbors = n_neighbors
        self.weights = weights
        self.metric = metric
        self.transformer = transformer

    def __sklearn_tags__(self) -> Tags:
        """Get estimator tags.

        Returns
        -------
        Tags
            Estimator tags.

        """
        tags = super().__sklearn_tags__()
        assert tags.transformer_tags is not None
        # Stateful when the inner transformer declares itself stateful.
        # We query the *parameter* object's tags (not fitted state) so the
        # tag is stable before and after fit (check_tags_static_after_fit).
        if self.transformer is not None:
            inner_tags = self.transformer.__sklearn_tags__()
            if inner_tags.transformer_tags is not None:
                tags.transformer_tags.stateful = inner_tags.transformer_tags.stateful
        else:
            tags.transformer_tags.stateful = False
        tags.transformer_tags.invertible = False
        return tags

    @_fit_context(prefer_skip_nested_validation=True)
    def fit(self, X: pl.DataFrame, y: pl.DataFrame | None = None, **params) -> "TransformedSpaceKNNImputer":
        """Fit the imputer, optionally projecting through a transformer first.

        Parameters
        ----------
        X : pl.DataFrame
            Input time series with a ``"time"`` column (datetime) and one or
            more numeric columns.
        y : pl.DataFrame or None, default=None
            Ignored.  Present for API compatibility.
        **params : dict
            Metadata to route to nested estimators.

        Returns
        -------
        self
            The fitted imputer instance.

        """
        X = validate_transformer_data(self, X=X, reset=True)

        # Fit and apply the inner transformer (if any)
        if self.transformer is not None:
            self.transformer_ = deepcopy(self.transformer)
            self.transformer_.fit(X)
            X_projected = self.transformer_.transform(X)
            # Inherit observation_horizon from the inner transformer
            if hasattr(self.transformer_, "_observation_horizon"):
                self._observation_horizon = self.transformer_.observation_horizon
        else:
            self.transformer_ = None
            X_projected = X

        BaseTransformer.fit(self, X, y, **params)

        # Fit sklearn KNNImputer on the (optionally transformed) data
        X_no_time = X_projected.select(~cs.by_name("time"))
        self.imputer_ = sklearn_KNNImputer(
            n_neighbors=self.n_neighbors,
            weights=self.weights,
            metric=self.metric,
        )
        self.imputer_.fit(X_no_time.to_numpy())

        # Store output schema for transform
        self._output_schema = X_projected.schema

        return self

    def _transform(self, X: pl.DataFrame) -> pl.DataFrame:
        """Impute missing values, optionally in a transformed feature space.

        Parameters
        ----------
        X : pl.DataFrame
            Validated input time series.

        Returns
        -------
        pl.DataFrame
            Imputed time series.

        """
        # Project via inner transformer (if any)
        X_projected = self.transformer_.transform(X) if self.transformer_ is not None else X

        # Apply sklearn KNNImputer in the (optionally transformed) space
        time = X_projected.select(cs.by_name("time"))
        X_no_time = X_projected.select(~cs.by_name("time"))
        data_cols = X_no_time.columns

        X_imputed_np = self.imputer_.transform(X_no_time.to_numpy())
        X_imputed = pl.DataFrame(X_imputed_np, schema=data_cols)

        return pl.concat([time, X_imputed], how="horizontal")

    def get_feature_names_out(self, input_features: list[str] | None = None) -> list[str]:
        """Get output feature names for transformation.

        Parameters
        ----------
        input_features : list of str or None, default=None
            Column names of the input features.  If ``None``, uses the
            feature names seen during ``fit``.

        Returns
        -------
        list of str
            Output feature names after transformation.

        """
        check_is_fitted(self, ["imputer_"])
        if self.transformer_ is not None:
            return self.transformer_.get_feature_names_out(input_features)
        input_features = _check_feature_names_in(self, input_features)
        return list(input_features)

Methods

__sklearn_tags__()

Get estimator tags.

Returns
Type Description
Tags

Estimator tags.

Source Code
Show/Hide source
def __sklearn_tags__(self) -> Tags:
    """Get estimator tags.

    Returns
    -------
    Tags
        Estimator tags.

    """
    tags = super().__sklearn_tags__()
    assert tags.transformer_tags is not None
    # Stateful when the inner transformer declares itself stateful.
    # We query the *parameter* object's tags (not fitted state) so the
    # tag is stable before and after fit (check_tags_static_after_fit).
    if self.transformer is not None:
        inner_tags = self.transformer.__sklearn_tags__()
        if inner_tags.transformer_tags is not None:
            tags.transformer_tags.stateful = inner_tags.transformer_tags.stateful
    else:
        tags.transformer_tags.stateful = False
    tags.transformer_tags.invertible = False
    return tags

fit(X, y=None, **params)

Fit the imputer, optionally projecting through a transformer first.

Parameters
Name Type Description Default
X DataFrame

Input time series with a "time" column (datetime) and one or more numeric columns.

required
y DataFrame or None

Ignored. Present for API compatibility.

None
**params dict

Metadata to route to nested estimators.

{}
Returns
Type Description
self

The fitted imputer instance.

Source Code
Show/Hide source
@_fit_context(prefer_skip_nested_validation=True)
def fit(self, X: pl.DataFrame, y: pl.DataFrame | None = None, **params) -> "TransformedSpaceKNNImputer":
    """Fit the imputer, optionally projecting through a transformer first.

    Parameters
    ----------
    X : pl.DataFrame
        Input time series with a ``"time"`` column (datetime) and one or
        more numeric columns.
    y : pl.DataFrame or None, default=None
        Ignored.  Present for API compatibility.
    **params : dict
        Metadata to route to nested estimators.

    Returns
    -------
    self
        The fitted imputer instance.

    """
    X = validate_transformer_data(self, X=X, reset=True)

    # Fit and apply the inner transformer (if any)
    if self.transformer is not None:
        self.transformer_ = deepcopy(self.transformer)
        self.transformer_.fit(X)
        X_projected = self.transformer_.transform(X)
        # Inherit observation_horizon from the inner transformer
        if hasattr(self.transformer_, "_observation_horizon"):
            self._observation_horizon = self.transformer_.observation_horizon
    else:
        self.transformer_ = None
        X_projected = X

    BaseTransformer.fit(self, X, y, **params)

    # Fit sklearn KNNImputer on the (optionally transformed) data
    X_no_time = X_projected.select(~cs.by_name("time"))
    self.imputer_ = sklearn_KNNImputer(
        n_neighbors=self.n_neighbors,
        weights=self.weights,
        metric=self.metric,
    )
    self.imputer_.fit(X_no_time.to_numpy())

    # Store output schema for transform
    self._output_schema = X_projected.schema

    return self

get_feature_names_out(input_features=None)

Get output feature names for transformation.

Parameters
Name Type Description Default
input_features list of str or None

Column names of the input features. If None, uses the feature names seen during fit.

None
Returns
Type Description
list of str

Output feature names after transformation.

Source Code
Show/Hide source
def get_feature_names_out(self, input_features: list[str] | None = None) -> list[str]:
    """Get output feature names for transformation.

    Parameters
    ----------
    input_features : list of str or None, default=None
        Column names of the input features.  If ``None``, uses the
        feature names seen during ``fit``.

    Returns
    -------
    list of str
        Output feature names after transformation.

    """
    check_is_fitted(self, ["imputer_"])
    if self.transformer_ is not None:
        return self.transformer_.get_feature_names_out(input_features)
    input_features = _check_feature_names_in(self, input_features)
    return list(input_features)

Tutorials

The following example notebooks use this component:

  • How to Handle Missing Data


    Data-Features

    Compare SimpleTimeImputer, SeasonalImputer, SimpleImputer, and TransformedSpaceKNNImputer on synthetic block and scattered gaps in monthly tourism data.

    View · Open in marimo