Fine-Tuning Large Language Models
Before reading: understand the training pipeline, transfer learning, VRAM constraints, quantization, and QLoRA — all covered in Machine Learning.
You’ve been prompting GPT-4 to classify customer support tickets: “urgent,” “billing,” “technical,” “general.” It works on 95% of tickets. But the other 5% — tickets with unusual phrasing, edge-case topics, or mixed intent — consistently misclassify. You’ve tried elaborate system prompts, few-shot examples, chain-of-thought. The edge cases persist.
This is when you fine-tune. Prompt engineering shifts the distribution; fine-tuning changes the model.
When to Fine-Tune vs Prompt-Engineer
| Factor | Favor Prompt Engineering | Favor Fine-Tuning |
|---|---|---|
| Task complexity | Simple, well-defined | Nuanced, domain-specific |
| Latency requirements | Tougher model = better prompt | Smaller fine-tuned model beats big prompted model |
| Cost per inference | Big model pricing | Small model, many calls |
| Data availability | <10 examples | 50–1000+ examples |
| Need for constant iteration | Re-deploy prompts instantly | Train → evaluate → deploy cycle |
| Model behavior consistency | Varies with prompt phrasing | Consistent across phrasings |
Decision rule: start with prompt engineering. If you’ve hit diminishing returns after iterating on prompt design, collect the failure cases and fine-tune.
Full Fine-Tuning
Full fine-tuning updates every parameter of the model. For a 7B model at FP16: 14 GB weights + 14 GB gradients + 28 GB optimizer states (Adam) = ~56 GB VRAM. For a 70B model: ~560 GB — you need 8× H100s or a high-end cluster.
Full fine-tuning is the most expressive method — every weight adapts. It’s worth the cost when:
- You’re adapting to a fundamentally different domain (code → medical knowledge).
- You have a large dataset (10K+ examples) and want maximum quality.
- You have the compute budget and the task justifies it.
For most practical use cases, parameter-efficient methods (next section) get you 95% of the quality at 10–30% of the VRAM cost.
LoRA (Low-Rank Adaptation)
LoRA freezes the original model weights and inserts small trainable matrices (“adapters”) into attention layers. Instead of updating a d × d weight matrix, LoRA learns two smaller matrices: d × r and r × d, where r (rank) is typically 8–64.
The math: original output is h = W·x. With LoRA: h = W·x + (B·A)·x where A is r × d and B is d × r. Only A and B are trained. The rank-r update captures the task-specific adaptation while the base model remains frozen.
Practical numbers for a 7B model (FP16):
- Full fine-tuning: ~56 GB
- LoRA (r=16): ~15 GB (weights) + 2 GB (adapters) = ~17 GB — fits on a single 24 GB consumer GPU
Rank selection — Higher rank = more capacity to adapt, but diminishing returns. r=16 is a safe default. For simple tasks (binary classification), r=4–8 suffices. For complex generation (writing in a specific style), r=32–64 may help.
Target modules — Apply LoRA to query and value projection matrices (q_proj, v_proj) as a minimum. Full attention (q_proj, k_proj, v_proj, o_proj) plus feed-forward layers gives best results but 2–3× more adapter params.
QLoRA (Quantized LoRA)
QLoRA quantizes the base model to 4-bit (NF4 format) and adds LoRA adapters on top. The base model’s 4-bit weights are dequantized to BF16 on the fly during the forward pass — you never store the full-precision weights in memory.
VRAM comparison for fine-tuning:
| Model | Full FT (BF16) | LoRA (BF16) | QLoRA (4-bit, r=16) |
|---|---|---|---|
| LLaMA 3 7B | ~56 GB | ~17 GB | ~9 GB |
| LLaMA 3 13B | ~104 GB | ~28 GB | ~14 GB |
| LLaMA 3 34B | ~272 GB | ~60 GB | ~24 GB |
| LLaMA 3 70B | ~560 GB | ~120 GB | ~40 GB |
QLoRA on a 7B model fits on a $300 consumer GPU. QLoRA on a 70B model fits on a single A100 (80 GB). The quality gap between QLoRA and full fine-tuning is typically 0–3% on downstream task metrics — negligible for most applications.
Unsloth — Optimized CUDA kernels that speed up QLoRA training 2–4× and further reduce VRAM by fusing operations. Open-source, supports LLaMA, Mistral, and Phi families. If you’re running QLoRA on consumer hardware, use Unsloth.
Other PEFT Methods
| Method | Mechanism | VRAM vs Full | Quality vs Full | Best For |
|---|---|---|---|---|
| LoRA | Low-rank adapters in attention | ~4× less | 95–98% | General fine-tuning |
| QLoRA | 4-bit base + LoRA | ~6× less | 93–97% | Consumer GPUs, cost-sensitive |
| Prefix Tuning | Learnable prefix vectors prepended to input | ~10× less | 85–92% | Very low VRAM, simple tasks |
| Prompt Tuning | Learnable soft prompts (no architecture change) | ~20× less | 80–90% | Multi-task serving, smallest delta |
| IA3 | Learned scaling vectors on keys, values, FFN | ~10× less | 90–95% | Comparable to LoRA with fewer params |
LoRA is the safe default. QLoRA when VRAM-constrained. Prompt tuning when you need to serve dozens of task-specific variants from one base model (swap soft prompts, not adapters).
Dataset Curation
Quality trumps quantity. The LIMA paper showed that 1,000 carefully curated examples can produce a fine-tune that matches models trained on 50× more data.
Format — Each example should match your deployment prompt format exactly:
{"messages": [
{"role": "system", "content": "You classify support tickets."},
{"role": "user", "content": "My payment won't go through, I've tried 3 cards"},
{"role": "assistant", "content": "billing"}
]}
Size guidelines:
- Classification: 50–200 examples of each class
- Structured extraction: 100–500 examples
- Instruct following: 500–1,000 diverse examples
- Style adaptation: 1,000–5,000 examples
Quality checks:
- De-duplicate: near-duplicate examples cause overfitting. Cosine-similarity check your training set.
- Balance classes: a 10:1 class imbalance → model learns to always predict the majority class.
- Include failures: the most valuable training examples are the ones your current prompt gets wrong.
- Audit for errors: a single mislabeled example in a 100-example dataset pollutes 1% of your training signal.
Hyperparameters
| Parameter | Full Fine-Tuning | LoRA/QLoRA | Rationale |
|---|---|---|---|
| Learning rate | 1e-5 to 5e-5 | 1e-4 to 5e-4 | LoRA adapters start from zero, need higher LR to converge |
| Epochs | 1–3 | 2–5 | More epochs for LoRA since it’s less expressive |
| Batch size | As large as VRAM allows (4–16) | 8–32 effective via gradient accumulation | Larger = more stable gradients |
| LR schedule | Cosine with warmup (10% of steps) | Cosine with warmup (10% of steps) | Warmup prevents early instability |
| Weight decay | 0.01–0.1 | 0.0–0.01 | LoRA barely regularizes; lower is fine |
The golden rule for epochs: for fine-tuning datasets under 1,000 examples, 1–3 epochs is almost always enough. Beyond that, the model transitions from generalizing to memorizing — loss on training data keeps dropping while validation performance degrades silently.
Validation during training: log eval loss every N steps. Generate sample outputs on a held-out set and inspect them manually. A model with lower validation loss can still produce worse outputs — the loss only measures next-token prediction, not task quality.
Catastrophic Forgetting
Fine-tuning on a narrow task can overwrite general capabilities. The model becomes great at classifying support tickets but forgets how to summarize text or write code.
Detection: run a small benchmark (MMLU subset, HumanEval subset, or a custom eval) before and after fine-tuning. A >5% drop in general capabilities means catastrophic forgetting is happening.
Mitigation:
- Data mixing: add 5–10% general-domain examples to your fine-tuning dataset. The model stays anchored to its original distribution.
- Lower learning rate: 1e-5 instead of 5e-5 — gentler updates preserve more of the base model.
- Early stopping: stop training the moment task accuracy plateaus, not when training loss reaches zero.
- LoRA implicit regularization: LoRA naturally resists catastrophic forgetting because it only updates a tiny fraction of parameters. This is an underappreciated advantage of PEFT methods.
Practical Workflow
Axolotl — YAML-config-based fine-tuning framework. Define model, dataset, LoRA rank, hyperparameters in a config file. Handles data formatting, distributed training, checkpointing. Good for reproducible experiments.
Unsloth — Drop-in replacement for HuggingFace Transformers with optimized kernels. 2–4× faster QLoRA, lower VRAM. Use when iterating rapidly on consumer hardware.
HuggingFace TRL — SFTTrainer for supervised fine-tuning, DPOTrainer for DPO, PPOTrainer for RLHF. The standard library for training loops. Works with Axolotl.
Typical iteration cycle: curate 50 examples → QLoRA with Unsloth (r=16, 2 epochs, ~10 minutes on a consumer GPU) → evaluate on held-out set → inspect failure cases → add 20 more examples targeting failures → retrain → repeat until plateau → increase dataset to 500 examples → run overnight with higher rank.
Key Things
- Start with prompting, not fine-tuning. Fine-tune only when prompt engineering has reached diminishing returns.
- QLoRA is the default. For 95% of use cases, QLoRA on consumer hardware matches full fine-tuning quality. Reserve full FT for specialized domains with 10K+ examples.
- 50 high-quality examples beat 500 noisy ones. Curation time is the bottleneck, not GPU time. Audit every example.
- 1–3 epochs max for datasets under 1,000 examples. Beyond that you’re memorizing, not generalizing.
- Validate beyond loss. Generate sample outputs and inspect them. Loss can decrease while quality degrades.
- Test for catastrophic forgetting. Run a small benchmark suite before and after every fine-tune. Add 5–10% general data if performance drops.
- LoRA rank 16 is the safe default. Higher rank rarely helps for single-task fine-tuning. Lower rank can work for simple tasks.
References
- LoRA: Hu et al., 2021 — LoRA: Low-Rank Adaptation of Large Language Models — arXiv
- QLoRA: Dettmers et al., 2023 — QLoRA: Efficient Finetuning of Quantized Language Models — arXiv
- LIMA: Zhou et al., 2023 — LIMA: Less Is More for Alignment — arXiv
- Unsloth: https://github.com/unslothai/unsloth
- Axolotl: https://github.com/axolotl-ai-cloud/axolotl
- TRL: https://github.com/huggingface/trl
- Prefix Tuning: Li & Liang, 2021 — Prefix-Tuning: Optimizing Continuous Prompts for Generation — arXiv
- Prompt Tuning: Lester et al., 2021 — The Power of Scale for Parameter-Efficient Prompt Tuning — arXiv
- IA3: Liu et al., 2022 — Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning — arXiv