ML Platform Architecture

Purpose

As ML teams scale beyond a handful of models, ad-hoc scripts and manual processes become the primary bottleneck. An ML platform provides shared infrastructure that abstracts the operational complexity of the ML lifecycle — from data ingestion to model serving — so data scientists can focus on modeling rather than plumbing. A mature platform reduces mean time to deploy a model from weeks to days and enables reproducibility, governance, and operational reliability at scale.

Architecture

Core Components

Component	Responsibility	Examples
Data store	Versioned access to raw and processed data	S3 + Delta Lake, BigQuery, Snowflake
Feature store	Compute, store, and serve features with train/serve parity	Feast, Tecton, Vertex AI Feature Store
Experiment tracking	Log parameters, metrics, artifacts per run	MLflow Tracking, W&B, Neptune
Training service	Managed compute for training jobs (GPU/CPU autoscaling)	SageMaker Training, Vertex AI Training, Ray Train
Model registry	Versioned model artifacts + lifecycle state machine	MLflow Registry, SageMaker Registry
Serving infrastructure	Real-time and batch inference endpoints	TorchServe, Triton, KServe, SageMaker Endpoints
Monitoring	Data drift, prediction drift, business metric tracking	Evidently AI, Arize, WhyLogs

Orchestration Tools

Pipelines glue components together into reproducible, schedulable workflows:

Apache Airflow: DAG-based scheduler; Python operators; ubiquitous in data engineering; verbose for ML-specific patterns.
Prefect: Python-native; dynamic task graphs; better ergonomics than Airflow; supports hybrid execution.
Kubeflow Pipelines: Kubernetes-native; containerized steps; strong reproducibility; steep learning curve; good for large ML-specific orgs.
Metaflow: Data-scientist-focused; linear step notation; native versioning of data artifacts; developed at Netflix; lowest barrier to adoption for DS teams.
ZenML: Framework-agnostic; stack-based configuration; designed specifically for ML pipelines with built-in artifact lineage.

Model Registry Lifecycle

A model registry tracks model versions through a promotion lifecycle. Each promotion should be gated by automated evaluation (e.g., shadow test passage, A/B test result) and require human approval for production promotion. MLflow’s registry exposes this via UI and REST API; SageMaker Model Registry integrates with CodePipeline for CI/CD gating.

MLflow 2.x modeled the lifecycle as a fixed state machine: Staging → Production → Archived. MLflow 3.x deprecates fixed stages in favour of named model aliases (e.g., champion, challenger, production-v2). This allows multiple concurrent production variants and removes the implicit single-slot constraint:

from mlflow.tracking import MlflowClient
client = MlflowClient()
client.set_registered_model_alias("fraud-detector", "champion", version=7)
model = mlflow.pyfunc.load_model("models:/fraud-detector@champion")

See Experiment Tracking for the full alias API.

Implementation Notes

Build vs. Buy Decision Framework

Criterion	Managed platform (SageMaker, Vertex AI)	Open-source / custom
Team size	< 10 ML engineers	> 20 ML engineers
Time to value	Weeks	Months
Customizability	Limited	Full
Cost at scale	High (vendor markup)	Lower (infra cost only)
Vendor lock-in	High	Low

Typical recommendation: start with a managed platform, extract components to open-source alternatives as specific pain points emerge (e.g., replace SageMaker Experiments with MLflow once experiment volume grows).

Reference Architectures

AWS SageMaker: Studio (IDE) + Pipelines (orchestration) + Feature Store + Model Registry + Endpoints. Tightest integration; highest lock-in.
GCP Vertex AI: Unified Data & AI platform; AutoML and custom training; Vertex Pipelines (Kubeflow-compatible); Vertex Feature Store; Model Monitoring.
Azure ML: Designer (GUI) + Pipelines + Model Registry + Managed Endpoints; strong integration with Azure DevOps.
Open-source (Kubeflow + MLflow + Feast): Full control; requires dedicated platform team; typical stack at mid-to-large tech companies.

Artifact Management

Every artifact — dataset snapshot, trained model, evaluation report — should be stored with a content-addressable URI (e.g., s3://ml-artifacts/model/abc123/model.pkl), linked to its producing pipeline run, and never mutated in place. MLflow and DVC both implement this pattern. Immutable artifacts are a prerequisite for reproducible model re-evaluation and audit.

Trade-offs

Monolithic platform vs. best-of-breed: A single integrated platform (SageMaker) reduces integration cost but creates lock-in and may lag behind the ecosystem. Best-of-breed tools (MLflow + Feast + Ray) offer flexibility but require significant glue code.
Abstraction level: High-level abstractions (e.g., Metaflow’s @step) reduce cognitive overhead but can obscure what’s happening under the hood, making debugging harder.
Centralized vs. federated: A centralized platform team owning infra creates a bottleneck; federated ownership (each team runs its own stack) leads to duplication. The recommended pattern is a thin platform team owning shared primitives with self-service access.

References

Sculley et al., “Hidden Technical Debt in Machine Learning Systems” (NeurIPS 2015)
Kleppmann, Designing Data-Intensive Applications, Chapter 10 (O’Reilly, 2017)
Huyen, Designing Machine Learning Systems (O’Reilly, 2022), Chapter 10
MLflow documentation: https://mlflow.org/docs/latest/
Kubeflow documentation: https://www.kubeflow.org/docs/

Notes

Explorer

ml_platform_architecture