DVC Dataset Versioning Pattern
Purpose
Implements dataset versioning, data pipeline reproducibility, and artifact lineage tracking using DVC (Data Version Control). DVC extends Git to track large data files and ML artifacts without storing them in the repository, using remote storage backends for the actual bytes.
Examples
Training data versioning: Track evolving training sets across preprocessing iterations; reproduce any experiment’s exact dataset with git checkout + dvc pull.
Data pipeline automation: Define multi-stage preprocessing pipelines in dvc.yaml; dvc repro executes only stages whose inputs changed — identical to make for ML data.
Architecture
Installation and Remote Setup
pip install dvc dvc-s3 # or dvc-gs, dvc-azure, dvc-ssh
cd my-ml-project
git init
dvc init # creates .dvc/ directory
# Configure S3 remote
dvc remote add -d myremote s3://my-bucket/dvc-store
dvc remote modify myremote region eu-west-1
# Commit DVC config to git
git add .dvc/config
git commit -m "Add DVC remote"Tracking Data Files
# Add a dataset to DVC tracking
dvc add data/raw/train.parquet
# DVC creates data/raw/train.parquet.dvc:
# outs:
# - md5: a3f8b2c1d9e4...
# size: 1073741824
# path: train.parquet
git add data/raw/train.parquet.dvc data/raw/.gitignore
git commit -m "Add training dataset v1"
# Push data to remote
dvc pushReproducing a Past Dataset Version
# Restore code to an earlier commit
git checkout <earlier-commit>
# Pull the corresponding dataset version
dvc pull
# data/raw/train.parquet is now the version from that commitDefining a Reproducible Data Pipeline
# dvc.yaml
stages:
preprocess:
cmd: python scripts/preprocess.py
deps:
- scripts/preprocess.py
- data/raw/train.parquet
outs:
- data/processed/train_clean.parquet
params:
- params.yaml:
- preprocess.drop_nulls_threshold
- preprocess.date_cutoff
featurise:
cmd: python scripts/featurise.py
deps:
- scripts/featurise.py
- data/processed/train_clean.parquet
outs:
- data/features/train_features.parquet
split:
cmd: python scripts/split.py
deps:
- scripts/split.py
- data/features/train_features.parquet
outs:
- data/splits/train.parquet
- data/splits/val.parquet
params:
- params.yaml:
- split.val_fraction
- split.random_seed# Run only stages with changed inputs
dvc repro
# Force rerun all stages
dvc repro --forceExperiment Tracking with DVC
# Run an experiment (captures params, metrics, and artifacts)
dvc exp run --set-param model.learning_rate=0.01
# Show all experiments
dvc exp show
# Compare two experiments
dvc exp diff exp-abc123 exp-def456
# Promote best experiment to workspace
dvc exp apply exp-abc123Linking Dataset Version to MLflow Run
import mlflow
import subprocess
# Get the current DVC-tracked dataset hash
result = subprocess.run(
["dvc", "params", "diff", "--show-md5"],
capture_output=True, text=True
)
with mlflow.start_run():
mlflow.log_param("data_version", open("data/raw/train.parquet.dvc").read().split("md5: ")[1].split("\n")[0])
mlflow.log_param("git_commit", subprocess.check_output(["git", "rev-parse", "HEAD"]).decode().strip())
# ... rest of trainingCI Integration
# .github/workflows/data-pipeline.yml
name: Validate Data Pipeline
on: [push]
jobs:
validate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Setup DVC
run: pip install dvc dvc-s3
- name: Pull data
run: dvc pull
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
- name: Reproduce pipeline
run: dvc repro --dry-run # check if any stage is out of date