Skip to content

fetch_kdd_cup

yohou.datasets._fetchers.fetch_kdd_cup(*, n_groups=5, data_home=None, download_if_missing=True, n_retries=3, delay=1.0)

Fetch the KDD Cup 2018 air quality dataset from Monash/Zenodo.

Hourly time series of air quality measurements (PM2.5, PM10, NO2, CO, O3, SO2) from 59 monitoring stations in Beijing and London. This is a multivariate panel dataset: each station (panel group) contains multiple measurement columns.

Column names use yohou's __ separator convention with the station as group prefix and the measurement as member suffix, e.g. "beijing_dongsi_aq__pm2.5".

Parameters

Name Type Description Default
n_groups int or None

Maximum number of station groups to include. Each station has 6 measurement series (PM2.5, PM10, NO2, CO, O3, SO2), so n_groups=5 loads 30 raw series. None loads all 59 stations (270 series).

5
data_home str, PathLike, or None

Specify another download and cache folder for the datasets. By default all yohou data is stored in ~/yohou_data/.

None
download_if_missing bool

If False, raise an OSError if the data is not locally available instead of trying to download it.

True
n_retries int

Number of retries when HTTP errors are encountered.

3
delay float

Number of seconds between retries.

1.0

Returns

Type Description
Bunch

Dictionary-like object with the following attributes:

frame : pl.DataFrame DataFrame with "time" (Datetime) and up to 270 series columns using the __ separator convention (e.g. "beijing_dongsi_aq__pm2.5"). feature_names : list of str Non-time column names. DESCR : str Full description of the dataset. frequency : str "1h". n_series : int Number of series actually loaded. filename : str Path to the cached parquet file.

See Also

References

[1] Godahewa, R., Bergmeir, C., Webb, G. I., Hyndman, R. J., & Montero-Manso, P. (2021). "Monash Time Series Forecasting Archive." Neural Information Processing Systems Track on Datasets and Benchmarks. https://doi.org/10.5281/zenodo.4656756

Examples

>>> from yohou.datasets import fetch_kdd_cup
>>> bunch = fetch_kdd_cup()
>>> bunch.frame.columns[:3]
['time', 'beijing_aotizhongxin_aq__pm2.5', 'beijing_aotizhongxin_aq__pm10']

Source Code

Show/Hide source
def fetch_kdd_cup(
    *,
    n_groups: int | None = 5,
    data_home: str | os.PathLike | None = None,
    download_if_missing: bool = True,
    n_retries: int = 3,
    delay: float = 1.0,
) -> Bunch:
    """Fetch the KDD Cup 2018 air quality dataset from Monash/Zenodo.

    Hourly time series of air quality measurements (PM2.5, PM10, NO2,
    CO, O3, SO2) from 59 monitoring stations in Beijing and London.
    This is a multivariate panel dataset: each station (panel group)
    contains multiple measurement columns.

    Column names use yohou's ``__`` separator convention with the
    station as group prefix and the measurement as member suffix,
    e.g. ``"beijing_dongsi_aq__pm2.5"``.

    Parameters
    ----------
    n_groups : int or None, default=5
        Maximum number of station groups to include. Each station has
        6 measurement series (PM2.5, PM10, NO2, CO, O3, SO2), so
        ``n_groups=5`` loads 30 raw series. ``None`` loads all 59
        stations (270 series).
    data_home : str, PathLike, or None
        Specify another download and cache folder for the datasets.
        By default all yohou data is stored in ``~/yohou_data/``.
    download_if_missing : bool, default=True
        If ``False``, raise an ``OSError`` if the data is not locally
        available instead of trying to download it.
    n_retries : int, default=3
        Number of retries when HTTP errors are encountered.
    delay : float, default=1.0
        Number of seconds between retries.

    Returns
    -------
    Bunch
        Dictionary-like object with the following attributes:

        frame : pl.DataFrame
            DataFrame with ``"time"`` (Datetime) and up to 270 series
            columns using the ``__`` separator convention
            (e.g. ``"beijing_dongsi_aq__pm2.5"``).
        feature_names : list of str
            Non-time column names.
        DESCR : str
            Full description of the dataset.
        frequency : str
            ``"1h"``.
        n_series : int
            Number of series actually loaded.
        filename : str
            Path to the cached parquet file.

    See Also
    --------
    - [`fetch_electricity_demand`][yohou.datasets._fetchers.fetch_electricity_demand] : Half-hourly electricity demand series.
    - [`fetch_pedestrian_counts`][yohou.datasets._fetchers.fetch_pedestrian_counts] : Hourly pedestrian sensor series.
    - [`get_data_home`][yohou.datasets._fetchers.get_data_home] : Return the path of the data directory.

    References
    ----------
    [1] Godahewa, R., Bergmeir, C., Webb, G. I., Hyndman, R. J., &
        Montero-Manso, P. (2021). "Monash Time Series Forecasting Archive."
        Neural Information Processing Systems Track on Datasets and
        Benchmarks. https://doi.org/10.5281/zenodo.4656756

    Examples
    --------
    >>> from yohou.datasets import fetch_kdd_cup
    >>> bunch = fetch_kdd_cup()  # doctest: +SKIP
    >>> bunch.frame.columns[:3]  # doctest: +SKIP
    ['time', 'beijing_aotizhongxin_aq__pm2.5', 'beijing_aotizhongxin_aq__pm10']

    """
    _n_measurements = len(_KDD_CUP_MEASUREMENTS)
    n_series = n_groups * _n_measurements if n_groups is not None else None
    bunch = _fetch_dataset(
        metadata=KDD_CUP_2018,
        dataset_name="kdd_cup_2018",
        value_column_name="value",
        n_series=n_series,
        data_home=data_home,
        download_if_missing=download_if_missing,
        n_retries=n_retries,
        delay=delay,
    )
    bunch.frame = _restructure_kdd_cup_columns(bunch.frame)
    bunch.feature_names = [c for c in bunch.frame.columns if c != "time"]
    return bunch

Tutorials

The following example notebooks use this component:

  • How to Aggregate Scorer Results


    Evaluation-Search

    Demonstrate all scorer aggregation strategies (stepwise, vintagewise, componentwise, groupwise, coveragewise, all) on panel data with weighted group aggregation.

    View · Open in marimo

  • How to Forecast Panel Prediction Intervals


    Panel-Data

    Combine conformal and quantile regression intervals on panel data with per-group coverage analysis, calibration plots, and groupwise interval scoring.

    View · Open in marimo