CitiBike NYC — Demand, Risk & Net Flow Analysis

A data science study combining CitiBike trip data (2023–2025) with NYPD collision data to produce demand analysis, station-level risk scoring, and a net-flow imbalance predictor. The outputs support insurance pricing, user safety warnings, and operational interventions at the station level.

Published report: https://armoutihansen.xyz/DSC/
Repository: https://github.com/armoutihansen/DSC

Goal

Derive actionable, data-driven signals from CitiBike trip and collision data to help an insurer (AXA) price micro-mobility risk and flag high-risk contexts to users in real time.

Three concrete deliverables:

  1. Demand characterisation — seasonal, weekly, and hourly patterns; member vs casual split.
  2. Risk measure — transparent, interpretable risk-per-trip by station, time of day, and their interaction.
  3. Net-flow prediction — identify tomorrow’s over-supplied (importer) and under-supplied (exporter) stations.

Scope (In / Out)

In:

  • CitiBike trip data (2023–2025): ~80 M trips, station coordinates, bike type, membership status, duration, distance
  • NYPD Motor Vehicle Collision data filtered to cyclist involvement
  • Station-level and time-of-day demand EDA
  • Risk score construction: crashes-per-trip, severity weighting, station × time interaction
  • Three-class net-flow imbalance classification (importer / balanced / exporter)
  • Interactive HTML report published at armoutihansen.xyz/DSC

Out:

  • Real-time pipeline / API
  • Individual trip-level accident probability model
  • Weather-conditioned demand forecasting (exploratory only via meteostat)
  • External validation with insurance claim data

Deliverables

ArtefactDescription
notebooks/EDA_citibike.ipynbDemand, usage patterns, net flow exploratory analysis
notebooks/clean_citibike.ipynbCleaning workflow for raw CitiBike CSVs
notebooks/clean_collision_data.ipynbCollision parsing — cyclist-involvement flag, severity
notebooks/risk_analysis.ipynbStation-level, time-of-day, and interaction risk scores
notebooks/net_flow_analysis.ipynbBaseline → logistic regression → CatBoost net-flow classifier
src/download_citibike.pyHelper to bulk-download trip data from S3
src/clean_citibike_csv.pyBatch cleaning + Parquet export
index.htmlSelf-contained report with all figures (Plotly, folium maps)

Data

DatasetSourceSize
CitiBike trip dataS3 bucket~80 M rows, 2023–2025
NYPD Motor Vehicle CollisionsNYC Open Data~2 M rows (filtered to cyclist)

Key preprocessing steps:

  • Parquet conversion via PyArrow for efficient columnar reads
  • DuckDB for in-process SQL over Parquet files (avoids pandas memory ceiling)
  • Cyclist-involvement flag parsed from free-text contributing factor columns
  • Station-day aggregation: total trips, net flow (arrivals − departures), crash count, crash rate

Modeling

Risk Score

A transparent risk-per-trip measure computed as:

Three granularities computed: station-level, time-of-day, and station × time-of-day interaction. Severity weighting applied (fatal > injury > property damage).

Net Flow Imbalance Classifier

Task: predict tomorrow’s net-flow class for each station — importer (arrivals > departures, class +1), balanced (class 0), exporter (departures > arrivals, class −1).

Class distribution: ~80% balanced, ~10% each importer/exporter — severe imbalance.

ModelMacro-F1Exporter Recall
Trivial baseline (predict majority)0.170.00
Persistence baseline (yesterday = today)0.280.19
Logistic Regression0.400.50
CatBoost0.510.60

Features: lag-1 net flow, rolling mean (7-day), day of week, month, station capacity, historical mean net flow per station.

CatBoost was selected for its native handling of categorical features (station ID) and superior recall on the minority classes.

Engineering

ConcernChoice
LanguagePython 3.11
Data queryDuckDB (in-process SQL over Parquet)
Data framespandas 2 + PyArrow backend
Modellingscikit-learn, CatBoost 1.2
Visualisationmatplotlib, seaborn, Plotly, folium
Weather featuresmeteostat
Env managementconda (environment.yml)
ReportStatic HTML (Plotly + folium inline)

Raw data not included in repo — must be downloaded via the provided helper scripts (~100 GB uncompressed).

Timeline

Completed (AXA data science challenge project).