Linear Models and GLMs — Implementation
Goal
Implement linear and generalised linear models end-to-end: data prep, fitting, evaluation, and serialisation.
Conceptual Counterpart
- Linear and Generalised Linear Models — bias-variance, regularisation paths, GLM families
- Convex Optimization — normal equations, gradient descent convergence
- Credit Scoring — canonical application of logistic regression and GLMs
Purpose
Practical implementation of linear and generalised linear models with scikit-learn and statsmodels. Synthesised from Linear Models & GLMs.
Examples
- OLS, Ridge, Lasso, ElasticNet regression
- Logistic regression (binary and multi-class)
- Tweedie/Gamma GLM for insurance-style count/severity targets
Architecture
Raw features → ColumnTransformer (scale / encode)
→ Linear estimator (sklearn or statsmodels)
→ Threshold / inverse-link → predictions
Implementation
Setup
pip install scikit-learn statsmodelsOLS and Regularised Regression
import numpy as np
from sklearn.linear_model import Ridge, Lasso, ElasticNet
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import cross_val_score
X, y = fetch_california_housing(return_X_y=True)
# Ridge (L2) — automatically handles multicollinearity
ridge = Pipeline([("scaler", StandardScaler()), ("model", Ridge(alpha=1.0))])
cv_r2 = cross_val_score(ridge, X, y, cv=5, scoring="r2")
print(f"Ridge CV R²: {cv_r2.mean():.3f} ± {cv_r2.std():.3f}")
# Lasso (L1) — sparse solutions
lasso = Pipeline([("scaler", StandardScaler()), ("model", Lasso(alpha=0.01))])
# ElasticNet — combined
enet = Pipeline([
("scaler", StandardScaler()),
("model", ElasticNet(alpha=0.01, l1_ratio=0.5))
])Selecting alpha with cross-validation:
from sklearn.linear_model import RidgeCV, LassoCV
lasso_cv = Pipeline([
("scaler", StandardScaler()),
("model", LassoCV(cv=5, alphas=np.logspace(-4, 2, 50)))
])
lasso_cv.fit(X, y)
print("Best alpha:", lasso_cv["model"].alpha_)
print("Non-zero coefficients:", np.sum(lasso_cv["model"].coef_ != 0))Logistic Regression
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import classification_report, roc_auc_score
X, y = load_breast_cancer(return_X_y=True)
clf = Pipeline([
("scaler", StandardScaler()),
("model", LogisticRegression(
C=1.0, # inverse regularisation strength
penalty="l2", # or "l1", "elasticnet"
solver="lbfgs",
max_iter=200
))
])
clf.fit(X_train, y_train)
proba = clf.predict_proba(X_test)[:, 1]
print(f"AUC: {roc_auc_score(y_test, proba):.3f}")
print(classification_report(y_test, clf.predict(X_test)))Multi-class (multinomial):
LogisticRegression(multi_class="multinomial", solver="lbfgs")GLMs with statsmodels (Tweedie / Gamma)
import statsmodels.api as sm
# Gamma GLM for continuous positive target (e.g., claim severity)
X_sm = sm.add_constant(X_train)
glm = sm.GLM(y_train, X_sm, family=sm.families.Gamma(sm.families.links.Log()))
res = glm.fit()
print(res.summary())
y_pred = res.predict(sm.add_constant(X_test))
# Tweedie (p=1.5 → compound Poisson-Gamma for pure premium)
glm_tw = sm.GLM(
y_train, X_sm,
family=sm.families.Tweedie(var_power=1.5, link=sm.families.links.Log())
)
res_tw = glm_tw.fit()Coefficient Inspection
import pandas as pd
coef_df = pd.DataFrame({
"feature": feature_names,
"coefficient": clf["model"].coef_[0],
"abs": np.abs(clf["model"].coef_[0])
}).sort_values("abs", ascending=False)
print(coef_df.head(10))Trade-offs
- Ridge never produces exactly zero coefficients; Lasso does — prefer Lasso when feature selection matters.
- Logistic regression is fast, interpretable, well-calibrated with Platt scaling; use as a strong baseline.
- Statsmodels GLM provides inference (p-values, CIs) that sklearn does not — important for regulated use.