Skip to content

OutlierThresholdHandler

yohou.preprocessing.outlier.OutlierThresholdHandler

Bases: BaseTransformer

Handle outliers based on fixed threshold values.

Values outside the specified thresholds are either clipped to the threshold values or set to NaN. This is useful for removing known invalid readings or physical impossibilities from sensor data.

Parameters

Name Type Description Default
low float or None

Lower threshold. Values below this are handled according to strategy. If None, no lower bound is applied.

None
high float or None

Upper threshold. Values above this are handled according to strategy. If None, no upper bound is applied.

None
strategy (clip, nan)

How to handle outliers: - "clip": Replace outliers with threshold values - "nan": Replace outliers with NaN

"clip"

Attributes

Name Type Description
low_ float or None

Validated lower threshold.

high_ float or None

Validated upper threshold.

Examples

>>> import polars as pl
>>> from datetime import datetime, timedelta
>>> from yohou.preprocessing import OutlierThresholdHandler
>>> X = pl.DataFrame({
...     "time": [datetime(2020, 1, i) for i in range(1, 6)],
...     "value": [-100.0, 50.0, 100.0, 150.0, 999.0],
... })
>>> # Clip values to [0, 200]
>>> handler = OutlierThresholdHandler(low=0.0, high=200.0, strategy="clip")
>>> handler.fit(X)
OutlierThresholdHandler(high=200.0, low=0.0)
>>> X_handled = handler.transform(X)
>>> X_handled["value"].to_list()
[0.0, 50.0, 100.0, 150.0, 200.0]
>>> # Set out-of-range values to NaN
>>> handler = OutlierThresholdHandler(low=0.0, high=200.0, strategy="nan")
>>> handler.fit(X)
OutlierThresholdHandler(...)
>>> X_handled = handler.transform(X)
>>> X_handled["value"].null_count()
2

See Also

Source Code

Show/Hide source
class OutlierThresholdHandler(BaseTransformer):
    """Handle outliers based on fixed threshold values.

    Values outside the specified thresholds are either clipped to the threshold
    values or set to NaN. This is useful for removing known invalid readings
    or physical impossibilities from sensor data.

    Parameters
    ----------
    low : float or None, default=None
        Lower threshold. Values below this are handled according to strategy.
        If None, no lower bound is applied.
    high : float or None, default=None
        Upper threshold. Values above this are handled according to strategy.
        If None, no upper bound is applied.
    strategy : {"clip", "nan"}, default="clip"
        How to handle outliers:
        - "clip": Replace outliers with threshold values
        - "nan": Replace outliers with NaN

    Attributes
    ----------
    low_ : float or None
        Validated lower threshold.
    high_ : float or None
        Validated upper threshold.

    Examples
    --------
    >>> import polars as pl
    >>> from datetime import datetime, timedelta
    >>> from yohou.preprocessing import OutlierThresholdHandler

    >>> X = pl.DataFrame({
    ...     "time": [datetime(2020, 1, i) for i in range(1, 6)],
    ...     "value": [-100.0, 50.0, 100.0, 150.0, 999.0],
    ... })

    >>> # Clip values to [0, 200]
    >>> handler = OutlierThresholdHandler(low=0.0, high=200.0, strategy="clip")
    >>> handler.fit(X)
    OutlierThresholdHandler(high=200.0, low=0.0)
    >>> X_handled = handler.transform(X)
    >>> X_handled["value"].to_list()
    [0.0, 50.0, 100.0, 150.0, 200.0]

    >>> # Set out-of-range values to NaN
    >>> handler = OutlierThresholdHandler(low=0.0, high=200.0, strategy="nan")
    >>> handler.fit(X)  # doctest: +ELLIPSIS
    OutlierThresholdHandler(...)
    >>> X_handled = handler.transform(X)
    >>> X_handled["value"].null_count()
    2

    See Also
    --------
    - [`OutlierPercentileHandler`][yohou.preprocessing.outlier.OutlierPercentileHandler] : Handle outliers based on percentiles.

    """

    _valid_strategies = {"clip", "nan"}

    _parameter_constraints: dict = {
        "low": [numbers.Real, None],
        "high": [numbers.Real, None],
        "strategy": [StrOptions(_valid_strategies)],
    }

    _tags = {"stateful": False, "invertible": False}

    def __init__(
        self,
        low: float | None = None,
        high: float | None = None,
        strategy: str = "clip",
    ):
        self.low = low
        self.high = high
        self.strategy = strategy

    def _fit(self, X: pl.DataFrame, y: pl.DataFrame | None = None) -> None:
        """Fit the internal model."""
        self.low_ = self.low
        self.high_ = self.high

        # Validate threshold ordering
        if self.low_ is not None and self.high_ is not None and self.low_ > self.high_:
            msg = f"low ({self.low_}) must be <= high ({self.high_})"
            raise ValueError(msg)

    def _transform(self, X: pl.DataFrame) -> pl.DataFrame:
        """Handle outliers in time series.

        Parameters
        ----------
        X : pl.DataFrame
            Validated input time series.

        Returns
        -------
        pl.DataFrame
            Transformed time series.

        """
        return _apply_outlier_handling(
            X,
            self.strategy,
            lambda _col_name: (self.low_, self.high_),
        )

    def get_feature_names_out(self, input_features: list[str] | None = None) -> list[str]:
        """Get output feature names for transformation.

        Parameters
        ----------
        input_features : list of str or None, default=None
            Column names of the input features.  If ``None``, uses the
            feature names seen during ``fit``.

        Returns
        -------
        list of str
            Output feature names after transformation.

        """
        check_is_fitted(self, ["feature_names_in_"])
        input_features = _check_feature_names_in(self, input_features)
        return list(input_features)

Methods

get_feature_names_out(input_features=None)

Get output feature names for transformation.

Parameters
Name Type Description Default
input_features list of str or None

Column names of the input features. If None, uses the feature names seen during fit.

None
Returns
Type Description
list of str

Output feature names after transformation.

Source Code
Show/Hide source
def get_feature_names_out(self, input_features: list[str] | None = None) -> list[str]:
    """Get output feature names for transformation.

    Parameters
    ----------
    input_features : list of str or None, default=None
        Column names of the input features.  If ``None``, uses the
        feature names seen during ``fit``.

    Returns
    -------
    list of str
        Output feature names after transformation.

    """
    check_is_fitted(self, ["feature_names_in_"])
    input_features = _check_feature_names_in(self, input_features)
    return list(input_features)

Tutorials

The following example notebooks use this component:

  • How to Clean Time Series Data


    Data-Features

    End-to-end data cleaning pipeline combining SimpleTimeImputer and SeasonalImputer for missing values with OutlierThresholdHandler for anomaly clipping.

    View · Open in marimo

  • How to Handle Outliers in a Forecasting Pipeline


    Data-Features

    Detect and clip outliers with OutlierThresholdHandler and OutlierPercentileHandler, then see how outliers affect conformal prediction intervals.

    View · Open in marimo