Fine-tuning Strategies
Purpose
Adapting pre-trained foundation models to specific tasks or domains is one of the most high-leverage activities in applied AI engineering. Rather than training models from scratch — which requires billions of tokens and enormous compute — fine-tuning starts from a powerful base and redirects its capabilities toward a target distribution: a specific domain vocabulary, a response style, a task format, or a set of safety constraints.
Fine-tuning is not always the right answer. The first decision is RAG vs fine-tune:
| Criterion | Prefer RAG | Prefer Fine-tuning |
|---|---|---|
| Knowledge freshness | Dynamic, frequently updated | Static domain knowledge |
| Citations needed | Yes | No |
| Domain data availability | Limited labeled pairs | Hundreds–thousands of examples |
| Response format/style | Flexible | Critical, highly specific |
| Domain vocabulary | General | Dense, specialized |
| Latency budget | Tolerant of retrieval overhead | Latency-critical path |
These approaches are complementary, not exclusive — a fine-tuned model with RAG is a common production pattern. Fine-tuning teaches the model how to respond; RAG supplies what to respond with.
Architecture
Four major fine-tuning paradigms exist, ordered roughly by compute cost and data requirements:
1. Continued Pre-training Domain adaptation on raw (unlabeled) text before any instruction tuning. Feed the model large corpora in the target domain (e.g., legal filings, clinical notes, scientific papers) to shift the base distribution. Useful when the domain vocabulary or reasoning patterns are far from the pre-training distribution. Requires gigabytes of domain text; typically run on the base model before SFT. See PEFT and LoRA for compute-efficient variants.
2. Instruction Tuning (SFT)
Supervised fine-tuning on (instruction, response) pairs to teach instruction-following behavior. Uses cross-entropy loss on the response tokens. Data formats: Alpaca (single-turn instruction/input/output), ChatML / ShareGPT (multi-turn with role tags). See Instruction Data Design for data considerations.
3. PEFT — Parameter-Efficient Fine-tuning Update fewer than 1% of model parameters using adapter methods (LoRA, QLoRA, prefix tuning, adapter layers). Dramatically reduces GPU memory and training time while preserving most of the performance of full fine-tuning. The standard approach for most practitioners. Detailed in PEFT and LoRA.
4. Full Fine-tuning Update all model weights using a low learning rate (~2e-5). Maximum expressiveness but requires significant GPU memory (e.g., ~80GB+ for a 7B model in BF16 without ZeRO). Risk of catastrophic forgetting — the model loses general capabilities as it over-specializes. Mitigated by mixing general data into the fine-tuning set (data replay).
RL Alignment (RLHF/DPO/GRPO) Post-SFT alignment stage: optimize model behavior against human preference signals. Covered in Reinforcement Learning Fine-tuning.
Implementation Notes
Data scale guidance:
- Fine-tuning requires hundreds to thousands of high-quality examples, not millions (that’s pre-training scale)
- Data quality >> quantity: 1,000 carefully curated pairs routinely outperform 100,000 noisy ones
- Target diversity across task types, lengths, and difficulty levels — see Instruction Data Design
Hyperparameter starting points:
- LoRA learning rate: ~1e-4; full fine-tune learning rate: ~2e-5
- Epochs: 1–3; beyond 3 the model tends to overfit and lose generality
- Warmup ratio: 0.03 is a safe default
- Weight decay: 0.01
Evaluation protocol:
- Task-specific held-out set (primary metric)
- General benchmark suite (MMLU, HellaSwag, GSM8K) to check for catastrophic forgetting
- MT-Bench or LLM-as-judge for open-ended response quality
- Log loss curves — training loss should decrease smoothly; a flat eval loss with rising training loss signals overfitting
Tooling:
- Axolotl: YAML-based, wraps HuggingFace Trainer, supports full/LoRA/QLoRA, 100+ architectures
- LLaMA-Factory: WebUI + CLI, multimodal support, DPO/GRPO built-in
- HuggingFace PEFT + TRL: lower-level, more flexible for custom workflows
Trade-offs
| Strategy | Performance | GPU Cost | Forgetting Risk | Data Needed |
|---|---|---|---|---|
| Full fine-tuning | Highest | Very high | High | Moderate |
| LoRA | Near-full | Low | Low | Moderate |
| QLoRA | Slightly below LoRA | Very low | Low | Moderate |
| Continued pre-training | Domain +, general ~ | Medium | Low | Large (raw text) |
| Instruction tuning only | Task-specific | Low–medium | Medium | Low–medium |
Key design tensions:
- Specialization vs. generality: aggressive fine-tuning improves task performance but risks narrowing the model’s useful range — data replay and gentle LR schedules mitigate this
- Fine-tune vs. RAG: fine-tuning bakes knowledge into weights (fast inference, no retrieval infra, but stale); RAG externalizes knowledge (updatable, citable, but adds latency and infra complexity)
- Adapter merging: LoRA adapters can be merged into base weights at inference time (zero latency cost) or kept separate (hot-swappable but extra compute)
References
- Hu et al. (2021) — LoRA: Low-Rank Adaptation of Large Language Models
- Dettmers et al. (2023) — QLoRA: Efficient Finetuning of Quantized LLMs
- Wei et al. (2021) — Finetuned Language Models Are Zero-Shot Learners (FLAN)
- Ouyang et al. (2022) — Training language models to follow instructions with human feedback (InstructGPT)
- Axolotl: https://github.com/OpenAccess-AI-Collective/axolotl
- LLaMA-Factory: https://github.com/hiyouga/LLaMA-Factory