Distributed Training with HuggingFace Accelerate
Purpose
Implements multi-GPU and multi-node distributed training using HuggingFace Accelerate, which provides a unified API over PyTorch DDP, DeepSpeed ZeRO, and FSDP. The same training script runs on a single GPU or across a cluster by changing a configuration file, without code changes.
Examples
Fine-tuning a 7B LLM on 4×A100s: Use Accelerate + DeepSpeed ZeRO-3 with BF16 to distribute parameters, gradients, and optimizer states across GPUs, reducing peak memory from ~56 GB to ~14 GB per GPU.
Multi-GPU training on a single machine: Run standard PyTorch training across 8 GPUs with DDP using 4 lines of code and accelerate launch.
Architecture
Installation and Configuration
pip install accelerate>=0.27.0
# Interactive configuration wizard
accelerate config
# Prompts: machine type, num GPUs, mixed precision, DeepSpeed/FSDP?
# Saves to ~/.cache/huggingface/accelerate/default_config.yamlMinimal 4-Line Conversion
from accelerate import Accelerator
accelerator = Accelerator(mixed_precision="bf16") # 1
model = MyModel()
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)
dataloader = DataLoader(dataset, batch_size=8)
model, optimizer, dataloader = accelerator.prepare( # 2
model, optimizer, dataloader
)
for batch in dataloader:
optimizer.zero_grad()
loss = model(**batch).loss
accelerator.backward(loss) # 3
optimizer.step()
accelerator.wait_for_everyone() # 4# Launch on all available GPUs
accelerate launch train.py
# Launch on N specific GPUs
accelerate launch --num_processes 4 train.pyDeepSpeed ZeRO-3 Configuration
For models that don’t fit on a single GPU, use ZeRO-3 to shard parameters, gradients, and optimizer states:
# accelerate_config.yaml
compute_environment: LOCAL_MACHINE
distributed_type: DEEPSPEED
deepspeed_config:
deepspeed_multinode_launcher: standard
gradient_accumulation_steps: 4
gradient_clipping: 1.0
offload_optimizer_device: none
offload_param_device: none
zero3_init_flag: true
zero3_save_16bit_model: true
zero_stage: 3
mixed_precision: bf16
num_machines: 1
num_processes: 4# train.py — identical to DDP script; Accelerate handles the ZeRO sharding
accelerator = Accelerator()
model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)FSDP Configuration (PyTorch Native)
# accelerate_config_fsdp.yaml
compute_environment: LOCAL_MACHINE
distributed_type: FSDP
fsdp_config:
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_backward_prefetch_policy: BACKWARD_PRE
fsdp_forward_prefetch: false
fsdp_offload_params: false
fsdp_sharding_strategy: 1 # FULL_SHARD = ZeRO-3 equivalent
fsdp_state_dict_type: FULL_STATE_DICT
fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
mixed_precision: bf16
num_processes: 4Gradient Accumulation
accelerator = Accelerator(gradient_accumulation_steps=4)
for batch in dataloader:
with accelerator.accumulate(model): # handles sync/no-sync automatically
optimizer.zero_grad()
loss = model(**batch).loss
accelerator.backward(loss)
optimizer.step()
# Effective batch = local_batch × grad_accum_steps × num_gpusCheckpointing
# Save — only writes on main process
accelerator.save_state("checkpoint/")
# Load — all processes synchronize
accelerator.load_state("checkpoint/")
# Save unwrapped model (for inference)
unwrapped = accelerator.unwrap_model(model)
unwrapped.save_pretrained("saved_model/", save_function=accelerator.save)Learning Rate Scheduling
from transformers import get_linear_schedule_with_warmup
scheduler = get_linear_schedule_with_warmup(
optimizer,
num_warmup_steps=100,
num_training_steps=len(dataloader) * num_epochs,
)
# Prepare scheduler alongside model and optimizer
model, optimizer, dataloader, scheduler = accelerator.prepare(
model, optimizer, dataloader, scheduler
)