Skip to content

fetch_air_quality_classification

yohou.datasets._fetchers.fetch_air_quality_classification(*, data_home=None, download_if_missing=True, n_retries=3, delay=1.0)

Fetch a categorical air quality dataset derived from KDD Cup 2018.

Downloads the KDD Cup 2018 dataset (single station) and bins its PM2.5 values into four WHO air quality categories using guideline thresholds of 15, 35, and 75 ug/m3. The remaining five pollutant measurements (PM10, NO2, CO, O3, SO2) become exogenous features. Rows with null PM2.5 values are dropped.

Parameters

Name Type Description Default
data_home str, PathLike, or None

Specify another download and cache folder for the datasets. By default all yohou data is stored in ~/yohou_data/.

None
download_if_missing bool

If False, raise an OSError if the data is not locally available instead of trying to download it.

True
n_retries int

Number of retries when HTTP errors are encountered.

3
delay float

Number of seconds between retries.

1.0

Returns

Type Description
Bunch

Dictionary-like object with the following attributes:

y : pl.DataFrame DataFrame with "time" (Datetime) and "air_quality" (Utf8) columns. The "air_quality" column contains one of "good", "moderate", "unhealthy", or "hazardous". X_actual : pl.DataFrame DataFrame with "time" and 5 pollutant feature columns ("pm10", "no2", "co", "o3", "so2"). feature_names : list of str Feature column names (excludes "time"). target_names : list of str ["air_quality"]. classes : list of str ["good", "hazardous", "moderate", "unhealthy"] (sorted). DESCR : str Human-readable dataset description.

See Also

Examples

>>> from yohou.datasets import fetch_air_quality_classification
>>> data = fetch_air_quality_classification()
>>> data.y.columns
['time', 'air_quality']
>>> sorted(data.classes)
['good', 'hazardous', 'moderate', 'unhealthy']

Source Code

Show/Hide source
def fetch_air_quality_classification(
    *,
    data_home: str | os.PathLike | None = None,
    download_if_missing: bool = True,
    n_retries: int = 3,
    delay: float = 1.0,
) -> Bunch:
    """Fetch a categorical air quality dataset derived from KDD Cup 2018.

    Downloads the KDD Cup 2018 dataset (single station) and bins its
    PM2.5 values into four WHO air quality categories using guideline
    thresholds of 15, 35, and 75 ug/m3. The remaining five pollutant
    measurements (PM10, NO2, CO, O3, SO2) become exogenous features.
    Rows with null PM2.5 values are dropped.

    Parameters
    ----------
    data_home : str, PathLike, or None
        Specify another download and cache folder for the datasets.
        By default all yohou data is stored in ``~/yohou_data/``.
    download_if_missing : bool, default=True
        If ``False``, raise an ``OSError`` if the data is not locally
        available instead of trying to download it.
    n_retries : int, default=3
        Number of retries when HTTP errors are encountered.
    delay : float, default=1.0
        Number of seconds between retries.

    Returns
    -------
    Bunch
        Dictionary-like object with the following attributes:

        y : pl.DataFrame
            DataFrame with ``"time"`` (Datetime) and ``"air_quality"``
            (Utf8) columns. The ``"air_quality"`` column contains one
            of ``"good"``, ``"moderate"``, ``"unhealthy"``, or
            ``"hazardous"``.
        X_actual : pl.DataFrame
            DataFrame with ``"time"`` and 5 pollutant feature columns
            (``"pm10"``, ``"no2"``, ``"co"``, ``"o3"``, ``"so2"``).
        feature_names : list of str
            Feature column names (excludes ``"time"``).
        target_names : list of str
            ``["air_quality"]``.
        classes : list of str
            ``["good", "hazardous", "moderate", "unhealthy"]`` (sorted).
        DESCR : str
            Human-readable dataset description.

    See Also
    --------
    - [`fetch_kdd_cup`][yohou.datasets._fetchers.fetch_kdd_cup] : Full KDD Cup 2018 air quality dataset.
    - [`fetch_demand_classification`][yohou.datasets._fetchers.fetch_demand_classification] : Categorical electricity demand dataset.

    Examples
    --------
    >>> from yohou.datasets import fetch_air_quality_classification
    >>> data = fetch_air_quality_classification()  # doctest: +SKIP
    >>> data.y.columns  # doctest: +SKIP
    ['time', 'air_quality']
    >>> sorted(data.classes)  # doctest: +SKIP
    ['good', 'hazardous', 'moderate', 'unhealthy']

    """
    if _is_wasm():
        return _fetch_classification_wasm("air_quality_classification")

    bunch = fetch_kdd_cup(
        n_groups=1,
        data_home=data_home,
        download_if_missing=download_if_missing,
        n_retries=n_retries,
        delay=delay,
    )
    frame = bunch.frame

    # Identify station prefix from first non-time column
    non_time = [c for c in frame.columns if c != "time"]
    station_prefix = non_time[0].split("__")[0]

    pm25_col = f"{station_prefix}__pm2.5"
    feature_measurements = ["pm10", "no2", "co", "o3", "so2"]
    feature_cols = [f"{station_prefix}__{m}" for m in feature_measurements]

    # Drop rows with null PM2.5
    frame = frame.drop_nulls(subset=[pm25_col])

    # Bin PM2.5 into WHO categories
    air_quality = (
        pl
        .when(pl.col(pm25_col) < _WHO_PM25_THRESHOLDS[0])
        .then(pl.lit("good"))
        .when(pl.col(pm25_col) < _WHO_PM25_THRESHOLDS[1])
        .then(pl.lit("moderate"))
        .when(pl.col(pm25_col) < _WHO_PM25_THRESHOLDS[2])
        .then(pl.lit("unhealthy"))
        .otherwise(pl.lit("hazardous"))
    )

    y = frame.select("time", air_quality.alias("air_quality"))
    X_actual = frame.select("time", *feature_cols).rename({c: c.split("__")[1] for c in feature_cols})

    classes = sorted(set(y["air_quality"].to_list()))

    return Bunch(
        y=y,
        X_actual=X_actual,
        feature_names=[c for c in X_actual.columns if c != "time"],
        target_names=["air_quality"],
        classes=classes,
        DESCR=(
            "Air quality classification dataset derived from KDD Cup 2018. "
            "PM2.5 values are binned into four WHO air quality categories: "
            "good (<15 ug/m3), moderate (15-35), unhealthy (35-75), "
            "hazardous (>75). Features are the remaining five pollutant "
            "measurements (PM10, NO2, CO, O3, SO2) from the same station."
        ),
    )