Fine-Tuning & Model Training
When to fine-tune vs prompt vs RAG. LoRA, QLoRA, RLHF, DPO, dataset preparation, and evaluation — the complete guide to model customisation.
The Decision Framework
Before fine-tuning, ask yourself:
| Need | Solution |
|---|---|
| Add knowledge (company docs, recent events) | RAG |
| Improve task performance with examples | Better prompting (few-shot) |
| Enforce a specific output format | Structured output / JSON mode |
| Change the model’s style, tone, or behaviour | Fine-tuning |
| Teach domain-specific terminology/patterns | Fine-tuning |
| Reduce token usage (shorter prompts) | Fine-tuning (distil system prompt into weights) |
Most teams should try prompting and RAG first. Fine-tuning is the last resort — not the first.
LoRA: Efficient Fine-Tuning
The Problem
Full fine-tuning of a 70B parameter model requires enormous compute — multiple high-end GPUs, days of training, and thousands of dollars.
The Solution
LoRA (Low-Rank Adaptation) freezes the original weights and trains small adapter matrices:
$$W’ = W + BA$$
Where $W$ is the frozen original weight matrix, $B$ and $A$ are small trainable matrices with rank $r$ (typically 8-64).
Result: Train <1% of parameters, achieve ~95% of full fine-tuning quality.
QLoRA: Even More Efficient
QLoRA quantises the base model to 4-bit precision and trains LoRA adapters on top. This lets you fine-tune a 70B model on a single GPU with 24GB VRAM.
Dataset Preparation
This is where fine-tuning succeeds or fails.
Format
Most fine-tuning uses conversation format:
{
"messages": [
{"role": "system", "content": "You are a helpful medical assistant."},
{"role": "user", "content": "What are common symptoms of the flu?"},
{"role": "assistant", "content": "The most common flu symptoms include..."}
]
}
Quality Guidelines
- Minimum: 50-100 high-quality examples (for LoRA)
- Ideal: 500-5,000 examples for robust performance
- Diversity: Cover the full range of expected inputs
- Consistency: All examples should follow the same format and style
- Review: Have domain experts validate every example
Common Mistakes
- Too few examples
- Inconsistent formatting across examples
- Training on examples the model already handles well (wasted compute)
- Not including edge cases and error handling
Alignment: RLHF vs DPO
After supervised fine-tuning, you may want to align the model with human preferences.
RLHF (Reinforcement Learning from Human Feedback)
- Collect human preference data (A is better than B)
- Train a reward model on preferences
- Use PPO to optimise the LLM against the reward model
Complex but well-established. Used by OpenAI, Anthropic.
DPO (Direct Preference Optimisation)
Skip the reward model entirely. Directly optimise the LLM using preference pairs:
$$\mathcal{L}{\text{DPO}} = -\log \sigma\left(\beta \log \frac{\pi\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)}\right)$$
Where $y_w$ is the preferred response and $y_l$ is the rejected one.
DPO is simpler, more stable, and increasingly preferred for smaller-scale alignment.
Evaluation
Automated Metrics
- Perplexity — how surprised the model is by test data (lower = better)
- BLEU/ROUGE — overlap with reference answers (limited usefulness)
- LLM-as-Judge — use a stronger model to evaluate outputs
Human Evaluation
The gold standard. Have domain experts rate outputs on:
- Accuracy
- Helpfulness
- Format adherence
- Safety
A/B Testing
Deploy the fine-tuned model alongside the base model and compare real-world performance with actual users.
Practical Tips
- Start with the smallest capable model — fine-tuning Llama 8B is much cheaper than 70B
- Use LoRA/QLoRA unless you have a very strong reason for full fine-tuning
- Invest 80% of your time in data quality
- Evaluate rigorously — track metrics before and after fine-tuning
- Version control your datasets — you’ll iterate many times
Key takeaway: Fine-tuning is a powerful tool but an expensive one. The decision of whether to fine-tune is as important as how you fine-tune. Most use cases are better served by RAG or improved prompting.
Quick Quiz
Test what you just learned. Pick the best answer for each question.
Q1 When should you fine-tune instead of using RAG or better prompting?
Q2 What is LoRA (Low-Rank Adaptation)?
Q3 What is DPO (Direct Preference Optimisation)?
Q4 What's the most critical factor in fine-tuning success?