Exogenous Features¶

Forecasting rarely happens in isolation. Electricity prices depend on weather, retail demand responds to holidays, and industrial output tracks commodity indices. These external signals are exogenous features, and getting them into a forecasting model correctly is surprisingly subtle. A single X parameter cannot capture the temporal semantics that matter in production forecasting, so Yohou separates exogenous data into three parameters: X_actual, X_future, and X_forecast. All three appear in fit() and observe(); only X_future and X_forecast appear in predict(), because observation features are not available for future time steps.

The Three Categories¶

External data that feeds a forecasting model falls into exactly three categories, each with distinct temporal availability:

X_actual: Observation Features¶

Actual measurements available up to the current observation point. Temperature readings, sensor data, realized demand, settled prices. These values are historical by definition: you cannot know tomorrow's actual temperature today.

X_actual flows through the actual_transformer pipeline. Lag features, rolling statistics, and other time-dependent transformations apply to it just as they do to the target variable. Because X_actual is unavailable for future time steps, it does not appear in the predict() signature. The forecaster stores the most recent observation window internally and uses it automatically at predict time.

X_future: Known-Future Features¶

Values you can look up for a future timestamp but cannot derive from the observation point. Holiday calendars, scheduled auction prices, planned maintenance windows, announced promotions. Looking up whether December 25^th is a holiday gives the same answer whether you check in January or November, but no amount of inspecting today's date tells you when Easter falls next year. That takes a calendar.

X_future bypasses the actual_transformer entirely. Instead, the framework windows it forward from each observation point to produce step-indexed columns (is_holiday_step_1, is_holiday_step_2, ..., is_holiday_step_H). Each step column tells the estimator what the holiday status will be at that specific forecast horizon.

What Belongs Here¶

actual_transformer only ever runs on the observation frame, which stops at the observation point T. It can compute a feature for T, for T-1, for any timestamp already observed, and for none beyond. That single fact decides the channel:

Does the feature's value at T determine its value at T+h?

If it does, actual_transformer already suffices and X_future is the wrong home. A Fourier pair is the clearest case: sin(2*pi*(t+h)/S) is a fixed linear combination of the sine and cosine at t, so once the estimator sees the pair at T it can express the value at every horizon. Windowing it forward across H steps adds columns without adding information, and those columns are exactly collinear with the pair the model already had.

If it does not, nothing computed at T can stand in, and the forward window is the only way to put the value in front of the estimator. Whether next Tuesday is a public holiday is not a function of anything measurable today. It is a lookup.

The same test, in the form you can apply while writing the fit() call: can you compute it from the timestamp alone, or do you need an external table? A timestamp alone means a clock feature, which belongs in actual_transformer (see FourierFeatureTransformer and CalendarFeatureTransformer). An external table means an event feature, which belongs in X_future.

Note that "deterministic and knowable in advance" does not separate the two. A day-of-week indicator and a holiday calendar are both perfectly deterministic and both knowable for any future date, yet they belong in different channels. Determinism is not the question; derivability from T is.

Holidays Sit On Both Sides¶

Holidays deserve a note, because they can legitimately appear on either channel and the two uses are complementary rather than competing.

HolidayFeatureTransformer on X_actual gives past holiday effects: whether recently observed timestamps were holidays, which is what you want when demand rebounds the day after a closure. It runs on the observation frame, so it cannot say anything about a holiday next week.

A holiday calendar passed through X_future gives future holiday effects: whether each forecast step lands on a holiday, which is what you want when the holiday itself moves demand.

Reaching for one does not rule out the other. A model that needs both the closure and the rebound uses both.

X_forecast: External Forecasts¶

Predictions from external models, each issued at a specific time (the vintage). Weather model output, demand projections, competitor price forecasts. The 6:00 AM weather forecast and the 9:30 AM forecast for the same target hour typically differ because the model was updated with newer data.

X_forecast requires a vintage_time column that identifies when each forecast was issued. Like X_future, it bypasses actual_transformer and produces step-indexed columns. Unlike X_future, different vintages produce different step values, enabling multi-vintage prediction from a single observation state.

Bypassing actual_transformer is a statement about that slot, not about the frame. actual_transformer operates on single-axis X_actual data, and an X_forecast frame carries a second time axis it cannot read. The frame is still transformable, by a transformer built for its shape, and that transformer has a slot of its own: forecast_transformer. See Transformer Kinds for the two transformer kinds and why the vintage axis needs its own.

The Three Transformer Slots¶

A forecaster holds three transformer slots, each named for what it consumes:

Slot	Consumes	Kind
`target_transformer`	the target series	actual (single-axis)
`actual_transformer`	the feature frame built from `y` and `X_actual`	actual (single-axis)
`forecast_transformer`	`X_forecast`	forecast (vintage-indexed)

The split follows the frame shape rather than the role. An actual-kind transformer reads one time axis; a forecast-kind transformer reads (vintage_time, time). Passing one where the other belongs raises a ValueError naming the slot, and the message points at the slot that does accept it.

X_future has no slot. It is single-axis, so it needs no new base class, but a stateful actual transformer applied to it would hold future rows in its buffer across observe calls, which is a leak with no guard today. So one arm of the step-column derivation arrives transformed and the other does not. This is a known gap rather than a design position; clock features that would otherwise tempt you toward X_future belong in actual_transformer, which does have a slot.

Benefits of the Three-Parameter API¶

Separating exogenous data into three parameters provides four capabilities:

Leakage-free walk-forward evaluation. The observe/predict loop separates X_actual (observation-only, never passed to predict) from X_future and X_forecast (both available at predict time). This eliminates data leakage where future actual measurements (e.g., tomorrow's temperature) would otherwise appear in each prediction step.

Partial features at predict time. predict(X_future=holidays) or predict(X_forecast=weather) works without providing observation features. The API accepts only the data categories relevant at prediction time, so schema validation passes cleanly. Both overrides are optional and can be used independently or together.

Explicit predict-time semantics. predict() accepts X_future and X_forecast but not X_actual. The estimator uses the stored observation buffer (_X_t_observed) from fit for observation-derived features, and step columns from X_future/X_forecast are the only features that can be overridden at predict time. There is no ambiguity about which features contribute to a given prediction.

Native support for vintage-indexed data. X_forecast accepts tidy tables with a vintage_time column directly. The framework handles the pivot from [vintage_time, time, col] to step-indexed format internally, removing manual preprocessing.

Step-Indexed Columns¶

Both X_future and X_forecast become step-indexed columns in the internal feature matrix. This pivoting transforms temporal data into the tabular format that sklearn estimators expect.

For a forecasting horizon \(H\) and a feature column temperature:

\[ \text{temperature\_step\_}h = \text{temperature}(T + h \cdot \Delta t) \]

where \(T\) is the observation time and \(\Delta t\) is the time series frequency.

The resulting feature matrix has columns temperature_step_1, temperature_step_2, ..., temperature_step_H alongside any transformer-derived features (target_lag_1, temp_rolling_mean_7, etc.).

Two public utilities handle this pivoting:

window_forecasts() selects the latest vintage at or before each observation time (as-of matching) and converts tidy [vintage_time, time, col1, col2] to wide [time, col1_step_1, col1_step_2, ...]
window_futures() converts flat [time, col1, col2] to wide format by looking forward from each observation time

Both are called internally by _derive_step_columns() but are available as public utilities for data preparation workflows.

The Bypass Principle¶

Step-indexed columns bypass actual_transformer entirely. The actual_transformer operates on X_actual (and optionally on the target via target_as_feature) to produce lags, rolling statistics, and other observation-derived features. Step columns from X_future and X_forecast are already forward-looking by construction: is_holiday_step_3 is the feature for horizon 3. Applying lag or rolling transformations to them would be meaningless.

This bypass also brings a practical benefit: at predict time, the framework can swap step columns without re-running the transformer. Five different weather forecast vintages produce five different predictions from a single predict() call each, with no deepcopy and no transformer refit.

Step Feature Alignment¶

When using the "direct" reduction strategy (which fits \(H\) independent estimators, one per forecast horizon), the step_feature_alignment parameter controls which step columns each estimator sees. This parameter is available on point, interval, and class-probability reduction forecasters.

Mode	Estimator \(h\) receives	Use case
`"all"` (default)	All step columns `*_step_1..H`	Maximum information, backward compatible
`"matched"`	Only `*_step_h`	Cleanest signal: each estimator sees only its horizon's forecast
`"cumulative"`	`*_step_1..h`	All information up to and including horizon \(h\)

For an electricity pricing use case, step_feature_alignment="matched" means estimator \(h\) trains on (wind_step_h, price_step_h): the weather forecast for time \(T+h\) predicting the price at \(T+h\). This avoids cross-horizon information that could confuse simpler estimators.

Why Only Direct¶

The parameter applies to "direct" alone. Setting it on another strategy warns at fit and changes nothing, but the two exclusions are not the same kind of thing.

"multi-output" cannot filter. One estimator predicts every horizon from a single feature vector, and output \(h\) reads wind_step_h from that same vector, so every step column has to be present for some output. There is no per-estimator view to narrow, because there is only one estimator. Since "multi-output" is the default, this is the case most users are in.

"dir-rec" could filter. It fits \(H\) estimators, one per step, exactly as "direct" does, and a per-step column filter applied before each step's feature augmentation would be well defined. It does not, and that is a deliberate scope decision rather than a structural limit: combining a filtered base with the accumulating augmentation columns raises design questions that have not been worked through. Recording it here is what stops the exclusion from being read as an oversight.

Predict-Time Override (Column Swap)¶

When predict(X_forecast=...) or predict(X_future=...) is called with new data, the framework temporarily replaces all step columns in _X_t_observed with freshly derived values. The save-swap-restore flow:

Resolve effective raws: use the provided override, or fall back to the stored _X_future_raw_ / _X_forecast_raw_ from fit
Re-derive all step columns via _derive_step_columns()
Save the current step columns and raws from _X_t_observed
Swap the new raws and step columns into _X_t_observed
Call the underlying estimator's predict
Restore saved raws and step columns (in a finally block, so state is always restored even on error)

The forecaster's state is unchanged after the call. Five consecutive predict() calls with five different X_forecast values return five different results, all independent.

Thread Safety

The column-swap mechanism mutates and restores _X_t_observed in place. For parallel multi-vintage predictions, copy.deepcopy(forecaster) once per thread.

Partial Coverage and Null Handling¶

Not every X_forecast vintage covers the full forecast horizon. If the weather model issues a 12-step forecast but the model was trained with H=24, the left join produces null step columns for steps 13 through 24. This is by design: tree-based estimators (XGBoost, LightGBM, HistGradientBoosting) handle null values natively. Linear models require imputation or complete coverage.

Similarly, X_forecast may not cover all training observation times. Rows without matching forecast data produce null step columns. This is common when forecast archives start later than the target series.

Step columns always span exactly 1..H per value column because derivation extracts \(H\) steps per observation, anchored to the observation time: step \(k\) is the value at \(T + k \cdot \Delta t\), taken from the newest vintage at or before \(T\). Vintages are not trimmed to a fixed window. A value at a target time beyond one observation's horizon is not discarded; it simply serves an earlier observation whose horizon does reach it. Where the resolved vintage carries no value at a step's target time, that step column is padded with null and a UserWarning is emitted.

Coverage is measured per observation and per base column. At fit, a column that fails to cover some training observations is named in a per-column UserWarning, with the count of observations it misses, so a channel that is dead for most of the batch is visible rather than hidden behind a single covered row. On the observe and predict paths the per-call warning reports the worst-covered observation, distinguishing zero coverage (the channel contributes nothing) from partial coverage (a short-range forecast).

Heterogeneous Vintage Cadence¶

A single X_forecast frame can carry channels issued on different schedules: a weather forecast refreshed daily next to a demand projection refreshed weekly. Each base column resolves against its own newest applicable vintage. The daily channel resolves from its daily vintage; the weekly channel resolves from its own most recent vintage, even when newer vintages carrying only the daily channel exist.

This resolution is realized by densifying the vintage axis before step columns are derived. For each vintage and each base column, the values come from the newest vintage at or before it that carries that column, so every vintage row carries a value for every column. All of a column's steps therefore originate in a single source vintage: a forecast trajectory is never spliced across vintages, and where a source vintage does not reach a target time the value stays null. A frame where every vintage already carries every column (uniform cadence) is unchanged, so this is transparent to the common case.

Densification runs for every forecaster that consumes X_forecast, whether or not a forecast_transformer is set. That matters for two reasons. A plain forecaster (including an ordinary panel forecaster) has no forecast transformer, and without densification its slower channels would collapse to null under the frame-wide as-of. And when a forecast_transformer is present, it operates row-wise on the vintage-indexed frame, so a transformer that combines columns from differently-scheduled sources (summing several series, then deriving a further quantity from that sum and two others) needs all of its inputs on one row. The dense frame is the contract that makes such cross-source transformers work; a downstream consumer can rely on the frame reaching its forecast_transformer being dense.

Bounded Retention Across Observations¶

observe and rewind retain, for each base column, the newest vintage still covering the observation point, rather than collapsing the frame to a single vintage. This keeps channels on different schedules alive in the cache the fallback path reads, while retained state stays bounded by the number of base columns (at most one vintage per channel). A vintage is evicted once its own latest target no longer reaches past the observation point, so it covers nothing; that channel's step features then become null and the coverage diagnostic fires. For a vintage issued to cover the forecasting_horizon periods after its own vintage time, the common case, that happens once the vintage is a full horizon old; a longer-range vintage survives proportionally longer. Either way, eviction reads the vintage's own reach rather than a tunable age, so no staleness parameter is introduced.

Expressing a Publication Lag¶

A forecast whose issuance boundary differs from its label (for example a day-ahead product available some fixed number of hours before the target boundary) is expressed by shifting the emitted vintage_time, not through a new API parameter. As-of selection reads whatever labels the frame carries, per column, and horizon steps are measured from the label. Because resolution is per column, sources with different lags compose naturally: shift each source's vintage_time by its own amount and each still resolves against its own vintages.

The Observe-Predict Loop¶

The observe() method accepts X_actual, X_future, and X_forecast, matching the fit() signature. When new data becomes available after fitting, observe() extends the internal observation buffer (_X_t_observed) with new X_actual data processed through actual_transformer, and re-derives step columns from X_future and X_forecast.

A typical walk-forward evaluation alternates between observe() (to feed new actual data) and predict() (to produce forecasts from the updated state). The three-parameter separation ensures that observe() can update the observation window with X_actual without any of that data leaking into the prediction step, because predict() only accepts X_future and X_forecast.

Cross-Validation with Exogenous Data¶

In cross-validation, the three parameters receive different splitting treatment:

X_actual is split by time indices, same as the target y
X_future is passed in full to both training and testing (deterministic data is available for all dates, so no filtering is needed)
X_forecast is split by vintage_time range: training receives vintages where vintage_time \(\leq T\), testing receives vintages where \(T <\) vintage_time \(\leq T_\text{test\_end}\), with \(T\) as the fold's training cutoff

The vintage_time filter on X_forecast prevents future forecast vintages from leaking into training folds. A forecast issued on Wednesday cannot train a model whose observation point is Monday. The test fold only sees vintages issued within the test window, matching what would be available in a real walk-forward deployment.

The same splitting logic is available via train_test_split for a single temporal split. Positional arrays (y, X_actual) are split by row position, X_future is not split (pass it to both fit and predict directly), and X_forecast is split by vintage_time range using the cutoff inferred from the first array's "time" column.

Composition Forecasters¶

All composition forecasters propagate the three parameters to their children:

ColumnForecaster routes X_actual, X_future, X_forecast to each child forecaster. Children that don't use exogenous features ignore the parameters via the requires_exogenous tag.
DecompositionPipeline passes all three parameters to the residual forecaster after trend/seasonality removal.
ForecastedFeatureForecaster uses X_actual as the target for the feature forecaster (training it to predict the exogenous series). At predict time the feature forecaster produces a forecast that is passed to the target forecaster as X_forecast (merged with any caller-supplied X_forecast), so the forecasted features become step columns the target consumes. X_future is forwarded to the target unchanged. Its feature_stride parameter sets how often the feature forecaster regenerates this forecast (every step by default), for cases where the feature model is too expensive to re-run on every step.
VotingPointForecaster, VotingIntervalForecaster, and VotingClassProbaForecaster each pass all three parameters to every ensemble member.
SplitConformalForecaster forwards all parameters to the wrapped point forecaster. It declares no transformer slots of its own, so the inner forecaster's slots are the configuration surface: reach them through the nested path, as in point_forecaster__forecast_transformer. Configuring the transform there rather than outside the forecaster keeps it tunable through search and applied consistently across fit, calibration, and prediction, since all three go through the same inner.

Connections¶

Exogenous Features Tutorial provides a hands-on introduction with synthetic data
How to Use Exogenous Features covers production workflow recipes
Forecaster Composition describes ForecastedFeatureForecaster, which automates the two-stage pattern of forecasting exogenous features before the target
Reduction Forecasting explains the direct reduction strategy and how step_feature_alignment fits in
window_forecasts and window_futures handle vintage pivoting and known-future windowing respectively