CitiBike NYC — Demand, Risk & Net Flow Analysis
A data science study combining CitiBike trip data (2023–2025) with NYPD collision data to produce demand analysis, station-level risk scoring, and a net-flow imbalance predictor. The outputs support insurance pricing, user safety warnings, and operational interventions at the station level.
Published report: https://armoutihansen.xyz/DSC/
Repository: https://github.com/armoutihansen/DSC
Goal
Derive actionable, data-driven signals from CitiBike trip and collision data to help an insurer (AXA) price micro-mobility risk and flag high-risk contexts to users in real time.
Three concrete deliverables:
- Demand characterisation — seasonal, weekly, and hourly patterns; member vs casual split.
- Risk measure — transparent, interpretable risk-per-trip by station, time of day, and their interaction.
- Net-flow prediction — identify tomorrow’s over-supplied (importer) and under-supplied (exporter) stations.
Scope (In / Out)
In:
- CitiBike trip data (2023–2025): ~80 M trips, station coordinates, bike type, membership status, duration, distance
- NYPD Motor Vehicle Collision data filtered to cyclist involvement
- Station-level and time-of-day demand EDA
- Risk score construction: crashes-per-trip, severity weighting, station × time interaction
- Three-class net-flow imbalance classification (importer / balanced / exporter)
- Interactive HTML report published at armoutihansen.xyz/DSC
Out:
- Real-time pipeline / API
- Individual trip-level accident probability model
- Weather-conditioned demand forecasting (exploratory only via meteostat)
- External validation with insurance claim data
Deliverables
| Artefact | Description |
|---|---|
notebooks/EDA_citibike.ipynb | Demand, usage patterns, net flow exploratory analysis |
notebooks/clean_citibike.ipynb | Cleaning workflow for raw CitiBike CSVs |
notebooks/clean_collision_data.ipynb | Collision parsing — cyclist-involvement flag, severity |
notebooks/risk_analysis.ipynb | Station-level, time-of-day, and interaction risk scores |
notebooks/net_flow_analysis.ipynb | Baseline → logistic regression → CatBoost net-flow classifier |
src/download_citibike.py | Helper to bulk-download trip data from S3 |
src/clean_citibike_csv.py | Batch cleaning + Parquet export |
index.html | Self-contained report with all figures (Plotly, folium maps) |
Data
| Dataset | Source | Size |
|---|---|---|
| CitiBike trip data | S3 bucket | ~80 M rows, 2023–2025 |
| NYPD Motor Vehicle Collisions | NYC Open Data | ~2 M rows (filtered to cyclist) |
Key preprocessing steps:
- Parquet conversion via PyArrow for efficient columnar reads
- DuckDB for in-process SQL over Parquet files (avoids pandas memory ceiling)
- Cyclist-involvement flag parsed from free-text contributing factor columns
- Station-day aggregation: total trips, net flow (arrivals − departures), crash count, crash rate
Modeling
Risk Score
A transparent risk-per-trip measure computed as:
Three granularities computed: station-level, time-of-day, and station × time-of-day interaction. Severity weighting applied (fatal > injury > property damage).
Net Flow Imbalance Classifier
Task: predict tomorrow’s net-flow class for each station — importer (arrivals > departures, class +1), balanced (class 0), exporter (departures > arrivals, class −1).
Class distribution: ~80% balanced, ~10% each importer/exporter — severe imbalance.
| Model | Macro-F1 | Exporter Recall |
|---|---|---|
| Trivial baseline (predict majority) | 0.17 | 0.00 |
| Persistence baseline (yesterday = today) | 0.28 | 0.19 |
| Logistic Regression | 0.40 | 0.50 |
| CatBoost | 0.51 | 0.60 |
Features: lag-1 net flow, rolling mean (7-day), day of week, month, station capacity, historical mean net flow per station.
CatBoost was selected for its native handling of categorical features (station ID) and superior recall on the minority classes.
Engineering
| Concern | Choice |
|---|---|
| Language | Python 3.11 |
| Data query | DuckDB (in-process SQL over Parquet) |
| Data frames | pandas 2 + PyArrow backend |
| Modelling | scikit-learn, CatBoost 1.2 |
| Visualisation | matplotlib, seaborn, Plotly, folium |
| Weather features | meteostat |
| Env management | conda (environment.yml) |
| Report | Static HTML (Plotly + folium inline) |
Raw data not included in repo — must be downloaded via the provided helper scripts (~100 GB uncompressed).
Timeline
Completed (AXA data science challenge project).