Skip to content

make_exogenous_classification

yohou.datasets._generators.make_exogenous_classification(*, n_samples=400, forecasting_horizon=6, noise=0.1, forecast_bias=0.3, random_state=42)

Generate a synthetic classification dataset with exogenous features.

Creates hourly air quality readings classified into three categories based on pollutant concentration thresholds.

Three exogenous feature types are produced:

  • X_actual (observation features): realized pollutant readings with a 24 hour sinusoidal cycle.
  • X_future (known future): a deterministic is_weekend indicator covering the full time range.
  • X_forecast (external forecasts): pollutant concentration forecasts with one vintage per observation, each covering the next forecasting_horizon steps.

Classification thresholds on the continuous pollutant signal:

  • "good": pollutant < 40
  • "moderate": 40 <= pollutant < 60
  • "poor": pollutant >= 60

Parameters

Name Type Description Default
n_samples int

Number of hourly observations.

400
forecasting_horizon int

Number of forward steps per X_forecast vintage.

6
noise float

Standard deviation of the classification boundary noise.

0.1
forecast_bias float

Systematic bias added to pollutant forecasts relative to actuals.

0.3
random_state int

Seed for reproducibility.

42

Returns

Type Description
Bunch

Dictionary-like object with the following attributes:

y : pl.DataFrame Target with columns ["time", "air_quality"]. Values are one of "good", "moderate", "poor". X_actual : pl.DataFrame Observation features with columns ["time", "pollutant"]. X_future : pl.DataFrame Known future features with columns ["time", "is_weekend"]. X_forecast : pl.DataFrame External forecasts with columns ["vintage_time", "time", "pollutant_forecast"]. frame : pl.DataFrame y, X_actual, and X_future joined on "time". feature_names : list of str ["pollutant", "is_weekend", "pollutant_forecast"]. target_names : list of str ["air_quality"]. classes : list of str ["good", "moderate", "poor"]. frequency : str "1h". DESCR : str Human readable description.

See Also

Examples

>>> from yohou.datasets import make_exogenous_classification
>>> data = make_exogenous_classification(n_samples=200)
>>> data.y.columns
['time', 'air_quality']
>>> sorted(data.classes)
['good', 'moderate', 'poor']

Source Code

Show/Hide source
def make_exogenous_classification(
    *,
    n_samples: int = 400,
    forecasting_horizon: int = 6,
    noise: float = 0.1,
    forecast_bias: float = 0.3,
    random_state: int = 42,
) -> Bunch:
    """Generate a synthetic classification dataset with exogenous features.

    Creates hourly air quality readings classified into three categories
    based on pollutant concentration thresholds.

    Three exogenous feature types are produced:

    - **X_actual** (observation features): realized pollutant readings
      with a 24 hour sinusoidal cycle.
    - **X_future** (known future): a deterministic ``is_weekend``
      indicator covering the full time range.
    - **X_forecast** (external forecasts): pollutant concentration
      forecasts with one vintage per observation, each covering the next
      ``forecasting_horizon`` steps.

    Classification thresholds on the continuous pollutant signal:

    - ``"good"``: pollutant < 40
    - ``"moderate"``: 40 <= pollutant < 60
    - ``"poor"``: pollutant >= 60

    Parameters
    ----------
    n_samples : int, default=400
        Number of hourly observations.
    forecasting_horizon : int, default=6
        Number of forward steps per X_forecast vintage.
    noise : float, default=0.1
        Standard deviation of the classification boundary noise.
    forecast_bias : float, default=0.3
        Systematic bias added to pollutant forecasts relative to actuals.
    random_state : int, default=42
        Seed for reproducibility.

    Returns
    -------
    Bunch
        Dictionary-like object with the following attributes:

        y : pl.DataFrame
            Target with columns ``["time", "air_quality"]``. Values are
            one of ``"good"``, ``"moderate"``, ``"poor"``.
        X_actual : pl.DataFrame
            Observation features with columns ``["time", "pollutant"]``.
        X_future : pl.DataFrame
            Known future features with columns ``["time", "is_weekend"]``.
        X_forecast : pl.DataFrame
            External forecasts with columns
            ``["vintage_time", "time", "pollutant_forecast"]``.
        frame : pl.DataFrame
            ``y``, ``X_actual``, and ``X_future`` joined on ``"time"``.
        feature_names : list of str
            ``["pollutant", "is_weekend", "pollutant_forecast"]``.
        target_names : list of str
            ``["air_quality"]``.
        classes : list of str
            ``["good", "moderate", "poor"]``.
        frequency : str
            ``"1h"``.
        DESCR : str
            Human readable description.

    See Also
    --------
    - [`make_exogenous_regression`][yohou.datasets._generators.make_exogenous_regression] : Regression variant with continuous target.
    - [`fetch_air_quality_classification`][yohou.datasets._fetchers.fetch_air_quality_classification] : Real air quality classification dataset.

    Examples
    --------
    >>> from yohou.datasets import make_exogenous_classification
    >>> data = make_exogenous_classification(n_samples=200)
    >>> data.y.columns
    ['time', 'air_quality']
    >>> sorted(data.classes)
    ['good', 'moderate', 'poor']

    """
    rng = np.random.default_rng(random_state)
    times = pl.Series(
        "time",
        [datetime(2024, 1, 1) + timedelta(hours=i) for i in range(n_samples)],
    )
    t = np.arange(n_samples, dtype=float)

    # Pollutant with 24h cycle centered at 50, amplitude 20
    pollutant = 50.0 + 20.0 * np.sin(2 * np.pi * t / 24) + rng.normal(0, 5.0, n_samples)

    is_weekend = np.array([
        1.0 if (datetime(2024, 1, 1) + timedelta(hours=i)).weekday() >= 5 else 0.0 for i in range(n_samples)
    ])

    # Weekend effect: lower pollutant readings
    effective_pollutant = pollutant - 5.0 * is_weekend + rng.normal(0, noise, n_samples)

    # Classify
    classes = ["good", "moderate", "poor"]
    labels = np.where(
        effective_pollutant < 40,
        "good",
        np.where(effective_pollutant < 60, "moderate", "poor"),
    )

    y = pl.DataFrame({"time": times, "air_quality": labels})
    X_actual = pl.DataFrame({"time": times, "pollutant": pollutant})
    X_future = pl.DataFrame({"time": times, "is_weekend": is_weekend})

    forecast_rows: list[dict[str, object]] = []
    for i in range(forecasting_horizon, n_samples):
        for step in range(1, forecasting_horizon + 1):
            if i + step < n_samples:
                forecast_rows.append({
                    "vintage_time": times[i],
                    "time": times[i + step],
                    "pollutant_forecast": float(pollutant[i + step] + forecast_bias + rng.normal(0, 2.0)),
                })
    X_forecast = pl.DataFrame(forecast_rows)

    frame = y.join(X_actual, on="time").join(X_future, on="time")

    return Bunch(
        y=y,
        X_actual=X_actual,
        X_future=X_future,
        X_forecast=X_forecast,
        frame=frame,
        feature_names=["pollutant", "is_weekend", "pollutant_forecast"],
        target_names=["air_quality"],
        classes=classes,
        frequency="1h",
        DESCR=(
            "Synthetic hourly air quality classification with exogenous features.\n"
            "Target: air_quality in {good, moderate, poor} based on pollutant thresholds.\n"
            "X_actual: realized pollutant readings (sinusoidal 24h cycle + noise).\n"
            "X_future: is_weekend indicator.\n"
            "X_forecast: pollutant concentration forecasts with systematic bias."
        ),
    )