ML Platform Architecture

Purpose

As ML teams scale beyond a handful of models, ad-hoc scripts and manual processes become the primary bottleneck. An ML platform provides shared infrastructure that abstracts the operational complexity of the ML lifecycle — from data ingestion to model serving — so data scientists can focus on modeling rather than plumbing. A mature platform reduces mean time to deploy a model from weeks to days and enables reproducibility, governance, and operational reliability at scale.

Architecture

Core Components

ComponentResponsibilityExamples
Data storeVersioned access to raw and processed dataS3 + Delta Lake, BigQuery, Snowflake
Feature storeCompute, store, and serve features with train/serve parityFeast, Tecton, Vertex AI Feature Store
Experiment trackingLog parameters, metrics, artifacts per runMLflow Tracking, W&B, Neptune
Training serviceManaged compute for training jobs (GPU/CPU autoscaling)SageMaker Training, Vertex AI Training, Ray Train
Model registryVersioned model artifacts + lifecycle state machineMLflow Registry, SageMaker Registry
Serving infrastructureReal-time and batch inference endpointsTorchServe, Triton, KServe, SageMaker Endpoints
MonitoringData drift, prediction drift, business metric trackingEvidently AI, Arize, WhyLogs

Orchestration Tools

Pipelines glue components together into reproducible, schedulable workflows:

  • Apache Airflow: DAG-based scheduler; Python operators; ubiquitous in data engineering; verbose for ML-specific patterns.
  • Prefect: Python-native; dynamic task graphs; better ergonomics than Airflow; supports hybrid execution.
  • Kubeflow Pipelines: Kubernetes-native; containerized steps; strong reproducibility; steep learning curve; good for large ML-specific orgs.
  • Metaflow: Data-scientist-focused; linear step notation; native versioning of data artifacts; developed at Netflix; lowest barrier to adoption for DS teams.
  • ZenML: Framework-agnostic; stack-based configuration; designed specifically for ML pipelines with built-in artifact lineage.

Model Registry Lifecycle

A model registry tracks model versions through a promotion lifecycle. Each promotion should be gated by automated evaluation (e.g., shadow test passage, A/B test result) and require human approval for production promotion. MLflow’s registry exposes this via UI and REST API; SageMaker Model Registry integrates with CodePipeline for CI/CD gating.

MLflow 2.x modeled the lifecycle as a fixed state machine: Staging → Production → Archived. MLflow 3.x deprecates fixed stages in favour of named model aliases (e.g., champion, challenger, production-v2). This allows multiple concurrent production variants and removes the implicit single-slot constraint:

from mlflow.tracking import MlflowClient
client = MlflowClient()
client.set_registered_model_alias("fraud-detector", "champion", version=7)
model = mlflow.pyfunc.load_model("models:/fraud-detector@champion")

See Experiment Tracking for the full alias API.

Implementation Notes

Build vs. Buy Decision Framework

CriterionManaged platform (SageMaker, Vertex AI)Open-source / custom
Team size< 10 ML engineers> 20 ML engineers
Time to valueWeeksMonths
CustomizabilityLimitedFull
Cost at scaleHigh (vendor markup)Lower (infra cost only)
Vendor lock-inHighLow

Typical recommendation: start with a managed platform, extract components to open-source alternatives as specific pain points emerge (e.g., replace SageMaker Experiments with MLflow once experiment volume grows).

Reference Architectures

  • AWS SageMaker: Studio (IDE) + Pipelines (orchestration) + Feature Store + Model Registry + Endpoints. Tightest integration; highest lock-in.
  • GCP Vertex AI: Unified Data & AI platform; AutoML and custom training; Vertex Pipelines (Kubeflow-compatible); Vertex Feature Store; Model Monitoring.
  • Azure ML: Designer (GUI) + Pipelines + Model Registry + Managed Endpoints; strong integration with Azure DevOps.
  • Open-source (Kubeflow + MLflow + Feast): Full control; requires dedicated platform team; typical stack at mid-to-large tech companies.

Artifact Management

Every artifact — dataset snapshot, trained model, evaluation report — should be stored with a content-addressable URI (e.g., s3://ml-artifacts/model/abc123/model.pkl), linked to its producing pipeline run, and never mutated in place. MLflow and DVC both implement this pattern. Immutable artifacts are a prerequisite for reproducible model re-evaluation and audit.

Trade-offs

  • Monolithic platform vs. best-of-breed: A single integrated platform (SageMaker) reduces integration cost but creates lock-in and may lag behind the ecosystem. Best-of-breed tools (MLflow + Feast + Ray) offer flexibility but require significant glue code.
  • Abstraction level: High-level abstractions (e.g., Metaflow’s @step) reduce cognitive overhead but can obscure what’s happening under the hood, making debugging harder.
  • Centralized vs. federated: A centralized platform team owning infra creates a bottleneck; federated ownership (each team runs its own stack) leads to duplication. The recommended pattern is a thin platform team owning shared primitives with self-service access.

References

  • Sculley et al., “Hidden Technical Debt in Machine Learning Systems” (NeurIPS 2015)
  • Kleppmann, Designing Data-Intensive Applications, Chapter 10 (O’Reilly, 2017)
  • Huyen, Designing Machine Learning Systems (O’Reilly, 2022), Chapter 10
  • MLflow documentation: https://mlflow.org/docs/latest/
  • Kubeflow documentation: https://www.kubeflow.org/docs/