Skip to content

QuantileTransformer

yohou.preprocessing.sklearn_wrappers.QuantileTransformer

Bases: SklearnTransformer

Transform features using quantiles information.

This method transforms the features to follow a uniform or a normal distribution. Therefore, for a given feature, this transformation tends to spread out the most frequent values. It also reduces the impact of (marginal) outliers: this is therefore a robust preprocessing scheme.

The transformation is applied on each feature independently. First an estimate of the cumulative distribution function of a feature is used to map the original values to a uniform distribution. The obtained values are then mapped to the desired output distribution using the associated quantile function.

This is a Yohou wrapper that preserves the polars DataFrame structure and "time" column.

Parameters

Name Type Description Default
n_quantiles int

Number of quantiles to be computed. It corresponds to the number of landmarks used to discretize the cumulative distribution function. If n_quantiles is larger than the number of samples, n_quantiles is set to the number of samples.

1000 or n_samples
output_distribution ('uniform', 'normal')

Marginal distribution for the transformed data. The choices are 'uniform' (default) or 'normal'.

'uniform'
ignore_implicit_zeros bool

Only applies to sparse matrices. If True, the sparse entries of the matrix are discarded to compute the quantile statistics.

False
subsample int

Maximum number of samples used to estimate the quantiles for computational efficiency.

10_000
random_state int, RandomState instance or None

Determines random number generation for subsampling and smoothing noise.

None

Attributes

Name Type Description
instance_ QuantileTransformer

The fitted sklearn QuantileTransformer instance.

n_quantiles_ int

The actual number of quantiles used to discretize the cumulative distribution function.

quantiles_ ndarray of shape (n_quantiles, n_features)

The values corresponding to the quantiles of reference.

references_ ndarray of shape (n_quantiles,)

Quantiles of references.

Examples

>>> import polars as pl
>>> from datetime import datetime
>>> from yohou.preprocessing import QuantileTransformer
>>> X = pl.DataFrame({
...     "time": [datetime(2024, 1, i) for i in range(1, 11)],
...     "value": [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 100.0],  # 100 is outlier
... })
>>> qt = QuantileTransformer(n_quantiles=10, output_distribution="uniform")
>>> qt.fit(X)
QuantileTransformer(...)
>>> X_transformed = qt.transform(X)
>>> # Outlier impact is reduced
>>> "time" in X_transformed.columns
True

See Also

Source Code

Show/Hide source
class QuantileTransformer(SklearnTransformer):
    """Transform features using quantiles information.

    This method transforms the features to follow a uniform or a normal
    distribution. Therefore, for a given feature, this transformation tends
    to spread out the most frequent values. It also reduces the impact of
    (marginal) outliers: this is therefore a robust preprocessing scheme.

    The transformation is applied on each feature independently. First an
    estimate of the cumulative distribution function of a feature is used to
    map the original values to a uniform distribution. The obtained values are
    then mapped to the desired output distribution using the associated
    quantile function.

    This is a Yohou wrapper that preserves the polars DataFrame structure and
    "time" column.

    Parameters
    ----------
    n_quantiles : int, default=1000 or n_samples
        Number of quantiles to be computed. It corresponds to the number of
        landmarks used to discretize the cumulative distribution function.
        If n_quantiles is larger than the number of samples, n_quantiles is set
        to the number of samples.

    output_distribution : {'uniform', 'normal'}, default='uniform'
        Marginal distribution for the transformed data. The choices are
        'uniform' (default) or 'normal'.

    ignore_implicit_zeros : bool, default=False
        Only applies to sparse matrices. If True, the sparse entries of the
        matrix are discarded to compute the quantile statistics.

    subsample : int, default=10_000
        Maximum number of samples used to estimate the quantiles for
        computational efficiency.

    random_state : int, RandomState instance or None, default=None
        Determines random number generation for subsampling and smoothing
        noise.

    Attributes
    ----------
    instance_ : sklearn.preprocessing.QuantileTransformer
        The fitted sklearn QuantileTransformer instance.

    n_quantiles_ : int
        The actual number of quantiles used to discretize the cumulative
        distribution function.

    quantiles_ : ndarray of shape (n_quantiles, n_features)
        The values corresponding to the quantiles of reference.

    references_ : ndarray of shape (n_quantiles,)
        Quantiles of references.

    Examples
    --------
    >>> import polars as pl
    >>> from datetime import datetime
    >>> from yohou.preprocessing import QuantileTransformer
    >>> X = pl.DataFrame({
    ...     "time": [datetime(2024, 1, i) for i in range(1, 11)],
    ...     "value": [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 100.0],  # 100 is outlier
    ... })
    >>> qt = QuantileTransformer(n_quantiles=10, output_distribution="uniform")
    >>> qt.fit(X)  # doctest: +ELLIPSIS
    QuantileTransformer(...)
    >>> X_transformed = qt.transform(X)
    >>> # Outlier impact is reduced
    >>> "time" in X_transformed.columns
    True

    See Also
    --------
    - [`PowerTransformer`][yohou.preprocessing.sklearn_wrappers.PowerTransformer] : Apply a power transform to make data more Gaussian-like.

    """

    _estimator_default_class = sklearn_QuantileTransformer

    def __init__(
        self,
        n_quantiles=1000,
        output_distribution="uniform",
        ignore_implicit_zeros=False,
        subsample=10_000,
        random_state=None,
        copy=True,
        **kwargs,
    ):
        super().__init__(
            n_quantiles=n_quantiles,
            output_distribution=output_distribution,
            ignore_implicit_zeros=ignore_implicit_zeros,
            subsample=subsample,
            random_state=random_state,
            copy=copy,
            **kwargs,
        )

    @property
    def n_quantiles_(self) -> int:
        """The actual number of quantiles used."""
        check_is_fitted(self, ["instance_"])
        return self.instance_.n_quantiles_

    @property
    def quantiles_(self) -> np.ndarray:
        """The values corresponding to the quantiles of reference."""
        check_is_fitted(self, ["instance_"])
        return self.instance_.quantiles_

    @property
    def references_(self) -> np.ndarray:
        """Quantiles of references."""
        check_is_fitted(self, ["instance_"])
        return self.instance_.references_

Methods

n_quantiles_ property

The actual number of quantiles used.

quantiles_ property

The values corresponding to the quantiles of reference.

references_ property

Quantiles of references.