ML and DL Forecasting
Core Idea
Machine learning methods treat time series forecasting as supervised learning: construct features from historical observations (lags, rolling statistics, calendar features) and train a standard model — gradient boosting, random forest, or a neural network — to predict the next value(s). Deep learning approaches learn the temporal structure directly from raw sequences.
This contrasts with classical statistical methods (ARIMA, ETS) which model the time series process directly.
Mathematical Formulation
Tabular Forecasting with Gradient Boosting
Transform a univariate time series into a supervised dataset by windowing:
This is the lag feature approach. Gradient boosting (XGBoost, LightGBM) excels here due to its robustness to scale, handling of missing lags, and built-in non-linearity.
import pandas as pd
import numpy as np
from lightgbm import LGBMRegressor
def make_lag_features(y: pd.Series, lags: list[int],
rolling_windows: list[int]) -> pd.DataFrame:
df = pd.DataFrame({'y': y})
for lag in lags:
df[f'lag_{lag}'] = y.shift(lag)
for w in rolling_windows:
df[f'roll_mean_{w}'] = y.shift(1).rolling(w).mean()
df[f'roll_std_{w}'] = y.shift(1).rolling(w).std()
df['month'] = y.index.month
df['day_of_week'] = y.index.dayofweek
return df.dropna()
df_feat = make_lag_features(train_series, lags=[1,2,3,7,14], rolling_windows=[7, 28])
X_train, y_train = df_feat.drop('y', axis=1), df_feat['y']
model = LGBMRegressor(n_estimators=500, learning_rate=0.05, num_leaves=31)
model.fit(X_train, y_train)Multi-step forecasting strategies:
- Recursive: predict , append to history, repeat.
- Direct: train a separate model for each horizon .
- MIMO (multi-output): a single model predicts all horizons simultaneously.
LSTM-Based Forecasting
LSTMs process the entire history as a sequence, learning what to remember vs forget via gating:
import torch
import torch.nn as nn
class LSTMForecaster(nn.Module):
def __init__(self, input_size, hidden_size, num_layers, forecast_horizon):
super().__init__()
self.lstm = nn.LSTM(input_size, hidden_size, num_layers, batch_first=True)
self.linear = nn.Linear(hidden_size, forecast_horizon)
def forward(self, x):
# x: (batch, seq_len, input_size)
out, _ = self.lstm(x)
return self.linear(out[:, -1, :]) # last hidden state → forecastN-BEATS (Neural Basis Expansion Analysis)
N-BEATS (Oreshkin et al., 2020) is a pure DL architecture without LSTM or attention, based on residual stacks of fully connected layers with basis function outputs:
- Each block produces a backcast (fitted to history) and forecast (output).
- Residuals from the backcast are passed to the next block.
- Interpretable variant uses Fourier and polynomial basis functions.
N-BEATS consistently outperforms traditional methods on the M3/M4 benchmarks.
Temporal Fusion Transformer (TFT)
TFT (Lim et al., 2021) is a multi-horizon attention-based architecture:
- Handles static covariates, known future covariates, and observed past covariates.
- Uses a Variable Selection Network to identify informative inputs.
- Interpretable: attention weights show which past time steps are most relevant per forecast step.
Inductive Bias
- Lag-feature + GBM: assumes the relationship between past observations and future values is well-captured by a finite set of handcrafted lags; can model complex non-linear patterns.
- LSTM: assumes temporal dependencies are captured by a compact hidden state; good at long-range dependencies but requires large data.
- N-BEATS/TFT: architectures designed explicitly for time series; less inductive bias than classical models.
Training Objective
| Approach | Typical loss | Notes |
|---|---|---|
| Point forecasting | MAE, RMSE, MAPE | Choose based on domain (asymmetric errors → MAPE) |
| Quantile forecasting | Pinball/quantile loss | For prediction intervals |
| Probabilistic | NLL (Gaussian, NegBin) | Calibrated distributions |
Strengths
- GBM lag approach is fast, interpretable, and robust; works well for tabular/panel data.
- Deep learning models capture complex interactions and scale to many series.
- Can incorporate exogenous covariates naturally.
Weaknesses
- Lag features require careful engineering; wrong lags → poor performance.
- LSTMs are hard to train for very long sequences; prone to vanishing gradients.
- All ML approaches require careful temporal cross-validation to avoid data leakage (future test points must not appear in training features).
Variants
- Prophet (Meta): additive decomposition model (trend + Fourier seasonality + holidays); see Time Series Implementation for implementation details.
- Temporal Convolutional Network (TCN): dilated causal convolutions for long-range dependencies without recurrence.
- DeepAR (Amazon): autoregressive RNN that outputs full probability distributions.
- PatchTST: treats time series as patches fed to a standard Transformer.
References
- Oreshkin, B. et al. (2020). “N-BEATS: Neural basis expansion analysis for interpretable time series forecasting.” ICLR.
- Lim, B. et al. (2021). “Temporal Fusion Transformers for interpretable multi-horizon time series forecasting.” Int. J. Forecasting.
- Godahewa, R. et al. (2021). “Monash Time Series Forecasting Archive.” NeurIPS.
Links
- Time Series Models — ARIMA and statistical baseline
- Tree Ensembles — gradient boosting backbone
- Recurrent Networks — LSTM architecture
- Feature Engineering — lag and rolling window features
- Backpropagation Through Time — gradient flow in RNNs and LSTMs used for sequence forecasting