Lessons Learned — CitiBike NYC Risk & Net Flow Analysis

Working with Large Tabular Data (>80 M rows)

DuckDB over pandas for initial aggregation: loading 80 M trip rows into a DataFrame is unnecessary and slow. DuckDB reads Parquet files column-selectively with SQL, keeping memory usage low. Convert to pandas only after aggregation to the required granularity (e.g., station-day).
Parquet as the canonical intermediate format: PyArrow’s Parquet files compress ~5× vs CSV and support column pruning. All cleaned data should be written to Parquet immediately after preprocessing.
Schema enforcement early: define column types and expected ranges before any joins to catch silent casting errors (e.g., station IDs stored as float in one dataset, string in another).

The net-flow prediction target had ~80% balanced / ~10% importer / ~10% exporter — classic long-tail multi-class imbalance.

Key lessons:

Macro-F1, not accuracy: accuracy is meaningless when 80% of labels are one class. Macro-F1 treats all classes equally and directly measures minority-class performance.
Recall is the right optimisation target for operational use: missing an exporter day (station empties) is more costly than a false alarm. Favour recall on minority classes.
CatBoost outperformed logistic regression on recall (60% vs 50% for exporters) because it learns non-linear interaction effects (e.g., station × day-of-week) without manual feature crossing.
Persistence baseline is a strong floor: yesterday’s class is often correct for a stationary series; any ML model must beat it meaningfully to justify deployment.

Building a risk-per-trip measure at fine granularity (station × time-of-day) introduced sparsity:

Many cells have zero crashes over the study period — not because they are safe, but because they are rarely visited.
Exposure-normalised rates: always express risk as crashes per trip (or per 1,000 trips), not raw crash counts. High-traffic stations appear dangerous in absolute terms but may be safe relative to volume.
Minimum exposure threshold: cells with fewer than a configurable minimum number of trips should be reported as “insufficient data” rather than zero risk, to avoid false confidence.
Smoothing/regularisation: Bayesian shrinkage toward the overall mean risk is appropriate for low-count cells (analogous to MAP estimation). Not implemented here but flagged as future work.

The net-flow classifier depended entirely on engineered features from the raw time series:

Lag features (lag-1 net flow) were the single strongest predictor — station-days are serially correlated.
Rolling statistics (7-day mean net flow) captured weekly seasonality.
Calendar features (day of week, month) captured systematic patterns without stationarity assumptions.
Station identity as a categorical feature (natively handled by CatBoost) captured station-level fixed effects that would otherwise require one-hot encoding.
Pitfall: all lag features must be computed before the train/validation/test split to avoid leaking future information. Split first on time axis, then compute lags within each split’s lookback window — or compute globally with strict temporal ordering.

CitiBike stations are identified by name (not always a stable ID across years). Station name deduplication and coordinate standardisation via median lat/lon is more robust than a direct string join.
Collision data contains GPS coordinates but not station IDs — a spatial join (nearest station within a configurable radius) is required. Using a KD-tree (scipy.spatial.cKDTree) is ~100× faster than brute-force pairwise distance for this join.
Assignment radius matters: a large radius inflates risk at hub stations; a small radius misses collisions on nearby roads. Sensitivity analysis across radii should be reported.

Folium choropleth maps (station-level risk, net flow imbalance by station) communicate spatial patterns far better than tables to stakeholders.
Plotly interactive charts allow stakeholders to drill down into time-of-day patterns without needing to run notebooks.
The static HTML report (index.html with embedded Plotly/folium) is portable and shareable without any server infrastructure.

Tree Ensembles — CatBoost, gradient boosting for tabular classification
Feature Engineering — lag features, rolling statistics, calendar features
Exploratory Data Analysis — exposure-normalised rates, sparse count data patterns
Evaluation and Validation — Macro-F1, class imbalance metrics