Skip to content

RobustScaler

yohou.preprocessing.sklearn_wrappers.RobustScaler

Bases: SklearnScaler

Scale features using statistics that are robust to outliers.

This Scaler removes the median and scales the data according to the quantile range (defaults to IQR: Interquartile Range). The IQR is the range between the 1st quartile (25th quantile) and the 3rd quartile (75th quantile).

Centering and scaling happen independently on each feature by computing the relevant statistics on the samples in the training set. Median and interquartile range are then stored to be used on later data using the transform() method.

Standardization of a dataset is a common preprocessing for many machine learning estimators. Typically this is done by removing the mean and scaling to unit variance. However, outliers can often influence the sample mean / variance in a negative way. In such cases, using the median and the interquartile range often give better results.

This is a Yohou wrapper that preserves the polars DataFrame structure and "time" column.

Parameters

Name Type Description Default
with_centering bool

If True, center the data before scaling.

True
with_scaling bool

If True, scale the data to interquartile range.

True
quantile_range (tuple(q_min, q_max), 0.0 < q_min < q_max < 100.0)

Quantile range used to calculate scale_. By default this is equal to the IQR, i.e., q_min is the first quantile and q_max is the third quantile.

(25.0, 75.0)
unit_variance bool

If True, scale data so that normally distributed features have a variance of 1.

False

Attributes

Name Type Description
instance_ RobustScaler

The fitted sklearn RobustScaler instance.

center_ array of floats

The median value for each feature in the training set.

scale_ array of floats

The (scaled) interquartile range for each feature in the training set.

Examples

>>> import polars as pl
>>> from datetime import datetime
>>> from yohou.preprocessing import RobustScaler
>>> X = pl.DataFrame({
...     "time": [datetime(2024, 1, i) for i in range(1, 6)],
...     "value": [10.0, 20.0, 30.0, 100.0, 50.0],  # 100 is an outlier
... })
>>> scaler = RobustScaler()
>>> scaler.fit(X)
RobustScaler(...)
>>> X_scaled = scaler.transform(X)
>>> # Median-centered and scaled by IQR
>>> "time" in X_scaled.columns
True

See Also

  • StandardScaler : Scale using mean and standard deviation (sensitive to outliers).

Source Code

Show/Hide source
class RobustScaler(SklearnScaler):
    """Scale features using statistics that are robust to outliers.

    This Scaler removes the median and scales the data according to the
    quantile range (defaults to IQR: Interquartile Range). The IQR is the range
    between the 1st quartile (25th quantile) and the 3rd quartile (75th quantile).

    Centering and scaling happen independently on each feature by computing the
    relevant statistics on the samples in the training set. Median and
    interquartile range are then stored to be used on later data using the
    ``transform()`` method.

    Standardization of a dataset is a common preprocessing for many machine
    learning estimators. Typically this is done by removing the mean and scaling
    to unit variance. However, outliers can often influence the sample mean /
    variance in a negative way. In such cases, using the median and the
    interquartile range often give better results.

    This is a Yohou wrapper that preserves the polars DataFrame structure and
    "time" column.

    Parameters
    ----------
    with_centering : bool, default=True
        If True, center the data before scaling.

    with_scaling : bool, default=True
        If True, scale the data to interquartile range.

    quantile_range : tuple (q_min, q_max), 0.0 < q_min < q_max < 100.0, default=(25.0, 75.0)
        Quantile range used to calculate ``scale_``. By default this is equal
        to the IQR, i.e., ``q_min`` is the first quantile and ``q_max`` is the
        third quantile.

    unit_variance : bool, default=False
        If True, scale data so that normally distributed features have a
        variance of 1.

    Attributes
    ----------
    instance_ : sklearn.preprocessing.RobustScaler
        The fitted sklearn RobustScaler instance.

    center_ : array of floats
        The median value for each feature in the training set.

    scale_ : array of floats
        The (scaled) interquartile range for each feature in the training set.

    Examples
    --------
    >>> import polars as pl
    >>> from datetime import datetime
    >>> from yohou.preprocessing import RobustScaler
    >>> X = pl.DataFrame({
    ...     "time": [datetime(2024, 1, i) for i in range(1, 6)],
    ...     "value": [10.0, 20.0, 30.0, 100.0, 50.0],  # 100 is an outlier
    ... })
    >>> scaler = RobustScaler()
    >>> scaler.fit(X)  # doctest: +ELLIPSIS
    RobustScaler(...)
    >>> X_scaled = scaler.transform(X)
    >>> # Median-centered and scaled by IQR
    >>> "time" in X_scaled.columns
    True

    See Also
    --------
    - [`StandardScaler`][yohou.preprocessing.sklearn_wrappers.StandardScaler] : Scale using mean and standard deviation (sensitive to outliers).

    """

    _estimator_default_class = sklearn_RobustScaler

    def __init__(
        self,
        with_centering=True,
        with_scaling=True,
        quantile_range=(25.0, 75.0),
        copy=True,
        unit_variance=False,
        **kwargs,
    ):
        super().__init__(
            with_centering=with_centering,
            with_scaling=with_scaling,
            quantile_range=quantile_range,
            copy=copy,
            unit_variance=unit_variance,
            **kwargs,
        )

    @property
    def center_(self) -> np.ndarray:
        """The median value for each feature in the training set."""
        check_is_fitted(self, ["instance_"])
        return self.instance_.center_

    @property
    def scale_(self) -> np.ndarray:
        """The (scaled) interquartile range for each feature."""
        check_is_fitted(self, ["instance_"])
        return self.instance_.scale_

Methods

center_ property

The median value for each feature in the training set.

scale_ property

The (scaled) interquartile range for each feature.

Tutorials

The following example notebooks use this component:

  • How to Use Scikit-learn Scalers


    Data-Features

    Wrap sklearn scalers (StandardScaler, MinMaxScaler, RobustScaler, PowerTransformer, PolynomialFeatures) for polars DataFrames with inverse transforms.

    View · Open in marimo