Data Catalog¶
Bundled datasets available in yohou.datasets. Each dataset is downloaded on first use and cached locally as Parquet files. The cache directory defaults to ~/yohou_data/ and can be changed via the data_home parameter or the YOHOU_DATA environment variable.
Quick Reference¶
| Function | Shape | Frequency | Series | Observations | Domain |
|---|---|---|---|---|---|
fetch_sunspot() |
Univariate | Daily | 1 | 73,924 | Astronomy |
fetch_tourism_monthly() |
Panel | Monthly | 366 | ~80/series | Tourism |
fetch_tourism_quarterly() |
Panel | Quarterly | 427 | ~32/series | Tourism |
fetch_hospital() |
Panel | Monthly | 767 | 84/series | Healthcare |
fetch_pedestrian_counts() |
Panel | Hourly | 66 | ~10,000/series | Transport |
fetch_kdd_cup() |
Panel | Hourly | 354 | ~8,760/series | Environment |
fetch_electricity_demand() |
Multivariate | Half-hourly | 5 | ~46,000/series | Energy |
fetch_dominick() |
Panel | Weekly | 115,704 | ~412/series | Retail |
fetch_air_quality_classification() |
Classification | Hourly | 1 | ~8,000 | Environment |
fetch_demand_classification() |
Classification | Half-hourly | 1 | ~46,000 | Energy |
Return Type¶
All numeric fetch functions return a sklearn.utils.Bunch (a dict subclass with attribute access) containing:
| Attribute | Type | Description |
|---|---|---|
frame |
pl.DataFrame |
Time column (time, Datetime) followed by one or more target columns (Float64) |
feature_names |
list[str] |
Names of all non-time columns |
DESCR |
str |
Full dataset description |
frequency |
str |
Polars duration string ("1d", "1mo", "3mo", "1h", "30m", "1w") |
n_series |
int |
Number of series loaded |
filename |
str |
Path to the cached Parquet file |
Classification fetch functions return a Bunch with different attributes:
| Attribute | Type | Description |
|---|---|---|
y |
pl.DataFrame |
Time column + target column (Utf8 class labels) |
X_actual |
pl.DataFrame |
Time column + numeric feature columns (Float64) |
feature_names |
list[str] |
Names of feature columns in X_actual |
target_names |
list[str] |
Names of target columns in y |
classes |
list[str] |
Unique class labels |
DESCR |
str |
Full dataset description |
Common Parameters¶
All fetch functions accept the following keyword-only parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
data_home |
str, PathLike, or None |
None |
Cache directory. None resolves to ~/yohou_data/ or the YOHOU_DATA environment variable |
download_if_missing |
bool |
True |
If False, raises OSError when data is not cached locally |
n_retries |
int |
3 |
Number of download retry attempts |
delay |
float |
1.0 |
Seconds between retries |
Panel datasets also accept:
| Parameter | Type | Default | Description |
|---|---|---|---|
n_series |
int or None |
varies | Maximum number of series to load. None loads all series |
Numeric Datasets¶
fetch_sunspot()¶
Daily sunspot numbers, 1818 to 2020. Single univariate series with 73,924 observations.
| Property | Value |
|---|---|
| Frequency | "1d" (Daily) |
| Time column | time (Datetime) |
| Target column | sunspot_number (Float64) |
| Series count | 1 |
| Shape | Univariate |
fetch_tourism_monthly()¶
Monthly Australian tourism visitor counts, 1992 to 2011. Up to 366 panel series, ~80 observations each.
| Property | Value |
|---|---|
| Frequency | "1mo" (Monthly) |
| Time column | time (Datetime) |
| Target columns | T1__tourists, T2__tourists, ... (Float64) |
Default n_series |
None (all 366) |
| Shape | Panel |
from yohou.datasets import fetch_tourism_monthly
bunch = fetch_tourism_monthly(n_series=10)
bunch.n_series # 10
fetch_tourism_quarterly()¶
Quarterly Australian tourism visitor counts, 1992 to 2011. Up to 427 panel series, ~32 observations each.
| Property | Value |
|---|---|
| Frequency | "3mo" (Quarterly) |
| Time column | time (Datetime) |
| Target columns | T1__tourists, T2__tourists, ... (Float64) |
Default n_series |
None (all 427) |
| Shape | Panel |
from yohou.datasets import fetch_tourism_quarterly
bunch = fetch_tourism_quarterly(n_series=5)
bunch.frame.columns[:3] # ['time', 'T1__tourists', 'T2__tourists']
fetch_hospital()¶
Monthly hospital patient counts, 2000 to 2006. Up to 767 panel series, 84 observations each.
| Property | Value |
|---|---|
| Frequency | "1mo" (Monthly) |
| Time column | time (Datetime) |
| Target columns | T1__patients, T2__patients, ... (Float64) |
Default n_series |
None (all 767) |
| Shape | Panel |
from yohou.datasets import fetch_hospital
bunch = fetch_hospital(n_series=20)
bunch.feature_names[:2] # ['T1__patients', 'T2__patients']
fetch_pedestrian_counts()¶
Hourly Melbourne pedestrian sensor counts, 2009 to 2020. Up to 66 sensors, ~10,000 observations each.
| Property | Value |
|---|---|
| Frequency | "1h" (Hourly) |
| Time column | time (Datetime) |
| Target columns | T1__count, T2__count, ... (Float64) |
Default n_series |
20 |
| Shape | Panel |
from yohou.datasets import fetch_pedestrian_counts
bunch = fetch_pedestrian_counts() # default 20 sensors
bunch_all = fetch_pedestrian_counts(n_series=None) # all 66 sensors
fetch_kdd_cup()¶
Hourly air quality measurements from 59 monitoring stations (Beijing and London), 2017 to 2018. Each station reports 6 pollutants (PM2.5, PM10, NO2, CO, O3, SO2), so the total series count equals n_groups × 6. Column names follow the station__measurement format.
| Property | Value |
|---|---|
| Frequency | "1h" (Hourly) |
| Time column | time (Datetime) |
| Target columns | beijing_dongsi_aq__pm2.5, beijing_dongsi_aq__pm10, ... (Float64) |
Default n_groups |
5 (30 series) |
| Shape | Panel |
This function accepts n_groups instead of n_series. Each group is one station with all 6 pollutant measurements.
from yohou.datasets import fetch_kdd_cup
bunch = fetch_kdd_cup(n_groups=2)
bunch.n_series # 12 (2 stations × 6 pollutants)
fetch_electricity_demand()¶
Half-hourly electricity demand for 5 Australian states, 2008 to 2015. Fixed multivariate dataset with ~46,000 observations per series.
| Property | Value |
|---|---|
| Frequency | "30m" (Half-hourly) |
| Time column | time (Datetime) |
| Target columns | nsw__demand, qun__demand, sa__demand, tas__demand, vic__demand (Float64) |
| Series count | 5 (fixed) |
| Shape | Multivariate |
from yohou.datasets import fetch_electricity_demand
bunch = fetch_electricity_demand()
bunch.feature_names # ['nsw__demand', 'qun__demand', 'sa__demand', 'tas__demand', 'vic__demand']
fetch_dominick()¶
Weekly store-level profit data from Dominick's Finer Foods, 1989 to 1997. Up to 115,704 panel series, ~412 observations each. Default n_series=50 limits memory usage. Pass n_series=None to load all series (several GB).
| Property | Value |
|---|---|
| Frequency | "1w" (Weekly) |
| Time column | time (Datetime) |
| Target columns | T1__profit, T2__profit, ... (Float64) |
Default n_series |
50 |
| Shape | Panel |
Classification Datasets¶
Classification fetch functions return y (target labels) and X_actual (numeric features) instead of frame. Both are pl.DataFrame instances with a shared time column.
fetch_air_quality_classification()¶
Hourly air quality classification derived from fetch_kdd_cup(n_groups=1). PM2.5 values are binned into 4 WHO categories using thresholds at 15, 35, and 75 µg/m³. Rows with null PM2.5 are dropped. ~8,000 observations.
| Property | Value |
|---|---|
| Frequency | Hourly |
Target column (y) |
air_quality (Utf8: "good", "moderate", "unhealthy", "hazardous") |
Feature columns (X_actual) |
pm10, no2, co, o3, so2 (Float64) |
| Shape | Multivariate |
from yohou.datasets import fetch_air_quality_classification
bunch = fetch_air_quality_classification()
bunch.classes # ['good', 'hazardous', 'moderate', 'unhealthy']
bunch.y.head()
fetch_demand_classification()¶
Half-hourly electricity demand classification derived from fetch_electricity_demand(). Victoria's demand is binned into 3 tercile-based levels. Features are the remaining 4 states. ~46,000 observations.
| Property | Value |
|---|---|
| Frequency | Half-hourly |
Target column (y) |
demand_level (Utf8: "low", "medium", "high") |
Feature columns (X_actual) |
nsw__demand, qun__demand, sa__demand, tas__demand (Float64) |
| Shape | Multivariate |
from yohou.datasets import fetch_demand_classification
bunch = fetch_demand_classification()
bunch.classes # ['high', 'low', 'medium']
bunch.X_actual.columns # ['time', 'nsw__demand', 'qun__demand', 'sa__demand', 'tas__demand']
Synthetic Generators¶
Generator functions create synthetic datasets locally (no download required). Useful for testing and examples involving exogenous features.
make_exogenous_regression()¶
Generates synthetic hourly electricity prices driven by temperature, holiday, and weather forecast features.
def make_exogenous_regression(
*,
n_samples: int = 200,
forecasting_horizon: int = 6,
noise: float = 0.1,
forecast_bias: float = 0.5,
random_state: int = 42,
) -> Bunch
| Parameter | Type | Default | Description |
|---|---|---|---|
n_samples |
int |
200 |
Number of time steps |
forecasting_horizon |
int |
6 |
Length of the forecast period |
noise |
float |
0.1 |
Scale of random noise |
forecast_bias |
float |
0.5 |
Bias added to forecast features |
random_state |
int |
42 |
Random seed for reproducibility |
Returns a Bunch with:
| Attribute | Description |
|---|---|
y |
pl.DataFrame with time and price (Float64) |
X_actual |
pl.DataFrame with time and temperature (Float64) |
X_future |
pl.DataFrame with time and is_holiday (Float64) |
X_forecast |
pl.DataFrame with vintage_time, time, and wx_temp (Float64) |
frame |
pl.DataFrame joining y, X_actual, and X_future on time |
feature_names |
["temperature", "is_holiday", "wx_temp"] |
target_names |
["price"] |
frequency |
"1h" |
from yohou.datasets import make_exogenous_regression
bunch = make_exogenous_regression(n_samples=500)
bunch.y.shape # (500, 2)
bunch.X_actual.shape # (500, 2)
make_exogenous_classification()¶
Generates synthetic hourly air quality labels driven by pollutant level, weekend indicator, and pollutant forecast features.
def make_exogenous_classification(
*,
n_samples: int = 400,
forecasting_horizon: int = 6,
noise: float = 0.1,
forecast_bias: float = 0.3,
random_state: int = 42,
) -> Bunch
| Parameter | Type | Default | Description |
|---|---|---|---|
n_samples |
int |
400 |
Number of time steps |
forecasting_horizon |
int |
6 |
Length of the forecast period |
noise |
float |
0.1 |
Scale of random noise |
forecast_bias |
float |
0.3 |
Bias added to forecast features |
random_state |
int |
42 |
Random seed for reproducibility |
Returns a Bunch with:
| Attribute | Description |
|---|---|
y |
pl.DataFrame with time and air_quality (Utf8: "good", "moderate", "poor") |
X_actual |
pl.DataFrame with time and pollutant (Float64) |
X_future |
pl.DataFrame with time and is_weekend (Float64) |
X_forecast |
pl.DataFrame with vintage_time, time, and pollutant_forecast (Float64) |
frame |
pl.DataFrame joining y, X_actual, and X_future on time |
classes |
["good", "moderate", "poor"] |
frequency |
"1h" |
from yohou.datasets import make_exogenous_classification
bunch = make_exogenous_classification(n_samples=300)
bunch.classes # ['good', 'moderate', 'poor']
Utility Functions¶
get_data_home()¶
from yohou.datasets import get_data_home
get_data_home() # ~/yohou_data/ (default)
get_data_home("/tmp/data") # /tmp/data (custom path, created if missing)
Returns the path to the data cache directory. If data_home is None, resolves to the YOHOU_DATA environment variable or ~/yohou_data/. Creates the directory if it does not exist.
clear_data_home()¶
Deletes all cached data files in the data home directory.
See Also¶
- Extensions: base classes and extension packages for custom components
- Tags: tag system for declaring component capabilities