Feature Engineering
Definition
The process of constructing new input features from raw data to improve model performance by making relevant patterns more explicit and accessible to the learning algorithm.
Intuition
Features are the language in which your model thinks. A poor feature set forces the model to “discover” structure that could have been provided directly; good features encode domain knowledge and reduce the hypothesis space. Tree-based models can learn non-linear interactions, but encoding them as features speeds up learning and reduces the need for large trees.
Formal Description
Numerical Feature Transformations
Polynomial and interaction features:
from sklearn.preprocessing import PolynomialFeatures
# Adds x1², x2², x1*x2 (degree=2, no bias term)
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X)Use with regularised linear models (ridge, lasso) to capture non-linear effects.
Binning / discretisation:
from sklearn.preprocessing import KBinsDiscretizer
# Equal-frequency bins (quantile-based)
kbd = KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='quantile')
X_binned = kbd.fit_transform(X[['age']])Useful when the relationship is highly non-linear, or to create monotonic features for tree models.
Ratio and difference features:
# E.g., for insurance: premium per unit of exposure
df['premium_per_vehicle'] = df['total_premium'] / df['n_vehicles']
df['age_diff'] = df['policy_end_year'] - df['policy_start_year']Log, sqrt, and power transforms: applied to right-skewed count/amount features (see data_preprocessing).
Temporal Features
df['day_of_week'] = df['date'].dt.dayofweek
df['month'] = df['date'].dt.month
df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)
df['days_since'] = (pd.Timestamp.today() - df['date']).dt.daysCyclical encoding (preserves periodicity for sine/cosine):
import numpy as np
df['month_sin'] = np.sin(2 * np.pi * df['month'] / 12)
df['month_cos'] = np.cos(2 * np.pi * df['month'] / 12)Categorical Feature Engineering
Frequency encoding:
freq = X_train[col].value_counts(normalize=True)
X_train[col + '_freq'] = X_train[col].map(freq)Target encoding (with cross-validation to avoid leakage):
from sklearn.model_selection import KFold
import numpy as np
X_train['target_enc'] = np.nan
kf = KFold(n_splits=5, shuffle=True, random_state=42)
for tr_idx, val_idx in kf.split(X_train):
means = y_train.iloc[tr_idx].groupby(X_train.iloc[tr_idx][col]).mean()
X_train.iloc[val_idx, X_train.columns.get_loc('target_enc')] = \
X_train.iloc[val_idx][col].map(means)Groupby aggregates:
# Statistics of claim amounts per policyholder
agg = df.groupby('policy_id')['claim_amount'].agg(['mean', 'max', 'count'])
df = df.merge(agg, on='policy_id', how='left')Text Features
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(max_features=10000, ngram_range=(1,2), sublinear_tf=True)
X_text = tfidf.fit_transform(df['description'])For deep learning: use pre-trained embeddings (BERT, sentence-transformers) to encode semantic content.
Feature Selection
Filter methods:
from sklearn.feature_selection import mutual_info_classif, SelectKBest
selector = SelectKBest(mutual_info_classif, k=20)
X_selected = selector.fit_transform(X_train, y_train)Embedded methods: lasso (L1 regularisation) shrinks unimportant coefficients to zero; random forest feature importance via mean impurity decrease.
Variance threshold: remove near-constant features:
from sklearn.feature_selection import VarianceThreshold
sel = VarianceThreshold(threshold=0.01)
X_filtered = sel.fit_transform(X)Feature Leakage
A feature is leaked if it encodes information available only after the prediction time:
- Target leakage: a feature derived from the target (e.g., “claim indicator” in a model predicting claim amount — only exists after the claim is filed).
- Temporal leakage: using future data to predict past events (e.g., including a field updated at claim settlement when the prediction is at policy inception).
Detection: unexplainably high feature importance or correlation with target; performance collapses on live data.
Prevention: define a precise cutoff time and ensure all features are derived only from information available at that point.
Applications
- GLMs for insurance pricing: log(exposure), interaction between vehicle age × driver age
- Gradient boosting on tabular data: ratio features, target encoding, lag features
- Fraud detection: velocity features (n_transactions in last 24h)
Trade-offs
- More features increase model complexity and can cause overfitting; apply feature selection or regularisation.
- Leakage is the most dangerous failure mode; always validate features against the timeline.
- Automated feature engineering tools (featuretools, tsfresh) can generate thousands of features; filtering is essential.