Predictive Maintenance
Problem
Predict when a vehicle, machine, or piece of equipment is likely to fail so that maintenance can be scheduled before the failure occurs — avoiding unplanned downtime, safety incidents, and emergency repair costs. Predictive maintenance (PdM) sits between reactive maintenance (fix after failure) and preventive maintenance (replace on fixed schedule), using sensor data and ML to identify degradation signatures before failure.
Users / Stakeholders
| Role | Decision |
|---|---|
| Fleet manager | Schedule maintenance; vehicle availability planning |
| Maintenance engineer | Plan parts and labour for upcoming interventions |
| Operations director | Total cost of ownership; fleet reliability KPIs |
| Safety officer | Prevent safety-critical failures |
| Customer (vehicle owner) | Warranty claims; service scheduling |
Domain Context
- Failure rarity: Equipment failures are rare events. Data is highly imbalanced and often there are very few historical failure examples for rare failure modes.
- Sensor data streams: OBD-II (automotive), CAN bus, IoT sensors provide continuous telemetry. Volume is high; relevant signal extraction is critical.
- Right-censoring: Vehicles in service have not yet failed — their time to failure is unknown. Survival analysis approaches handle censoring properly.
- Maintenance records: Planned maintenance changes the degradation trajectory. Must condition on maintenance history to avoid confounding.
- Physical domain knowledge: Vibration frequency analysis (FFT) for bearing wear, temperature trends for electrical components, oil viscosity for engine health — physical models inform feature engineering.
- Edge deployment: Vehicles are mobile. Inference may need to run on-vehicle (embedded system, ECU) or at fleet depot.
Inputs and Outputs
Telemetry features:
Engine: oil_temperature, coolant_temperature, RPM, torque, fuel_consumption
Transmission: gear_shifts_per_km, clutch_slip_events, gear_oil_temperature
Brakes: brake_pad_wear_sensor, brake_temperature, ABS_events_per_km
Electrical: battery_voltage, alternator_current, fault_codes_count
Usage: mileage, hours_operated, load_factor, terrain_type
Maintenance: days_since_last_service, km_since_oil_change
Output:
failure_probability_30d: P(failure within 30 days) ∈ [0, 1]
rul_estimate: Remaining Useful Life in days or km
failure_mode: Component most likely to fail
maintenance_urgency: SCHEDULE_NEXT_WINDOW / SCHEDULE_WITHIN_7D / URGENT / CRITICAL
recommended_action: Oil change, brake pad replacement, battery replacement, etc.
Decision or Workflow Role
Telemetry ingested (real-time OBD-II stream or daily sync)
↓
Feature engineering: rolling statistics, FFT features, deviation from fleet average
↓
Failure probability model: P(failure within 30/60/90 days)
↓
CRITICAL / URGENT → fleet manager alert → immediate scheduling
SCHEDULE_7D → added to maintenance planner queue
NORMAL → next scheduled service
↓
Maintenance performed → maintenance record logged → model update
↓
Failure events (if they occur) → confirmed label → retraining
Modeling / System Options
| Approach | Strength | Weakness | When to use |
|---|---|---|---|
| Threshold rules on sensor values | Simple; no training; immediate | Fragile; misses gradual degradation | Baseline; safety-critical hard limits |
| LightGBM on rolling features | Handles tabular sensor features; interpretable | Time-unaware; point-in-time snapshot | Most maintenance prediction use cases |
| LSTM Autoencoder anomaly detection | Detects novel degradation patterns | Threshold calibration; limited labels | Limited failure labels; novel failure mode detection |
| Survival analysis (Cox PH, AFT) | Handles censoring; estimates time-to-failure | Proportional hazards assumption | Time-to-failure estimation |
| Physics-informed ML | Incorporates domain knowledge; extrapolates beyond data | Domain expertise required | Safety-critical with physics models available |
Recommended: LightGBM for failure probability classification. Cox PH for remaining useful life estimation. Threshold rules as safety net.
Deployment Constraints
- Edge vs cloud: Fleet vehicles may be offline. Critical alerts must be triggered locally. Batch sync when connected.
- Latency: Non-critical predictions: daily batch. Safety-critical: real-time (engine temperature overheat → immediate).
- Sensor calibration: Sensor drift over time causes model performance degradation. Periodic sensor calibration checks needed.
- False alarm cost: Unnecessary maintenance = wasted cost + vehicle downtime. Balance with failure cost.
Risks and Failure Modes
| Risk | Description | Mitigation |
|---|---|---|
| Rare failure events | Few historical failures → poor model training | Transfer learning; fleet pooling; synthetic augmentation |
| Sensor failure | Sensor malfunctions → missing data → incorrect prediction | Sensor health monitoring; missing value imputation |
| Failure mode shift | New vehicle generation has different failure patterns | Model per vehicle generation; retraining on new fleet |
| Planned maintenance confounding | Maintenance resets degradation clock; model unaware | Include maintenance event features |
Success Metrics
| Metric | Target | Notes |
|---|---|---|
| Unplanned downtime reduction | > 20% | Primary business metric |
| Failure detection recall | > 85% | Fraction of failures predicted before occurrence |
| False alarm rate | < 15% | Unnecessary maintenance triggers |
| Maintenance cost reduction | > 10% | Labour + parts savings |
| Vehicle availability | > 95% | Fleet utilisation |
References
- Saxena, A. et al. (2008). Damage Propagation Modeling for Aircraft Engine Run-to-Failure Simulation. ICSN.
- Carvalho, T. et al. (2019). A systematic literature review of machine learning methods applied to predictive maintenance. Computers & Industrial Engineering.
Links
Modeling
- Time Series Models — temporal degradation patterns
- Unsupervised Learning — anomaly detection
Reference Implementations
Adjacent Applications