LESSON 5 of 6 Expert

Fine-Tuning & Model Training

When to fine-tune vs prompt vs RAG. LoRA, QLoRA, RLHF, DPO, dataset preparation, and evaluation — the complete guide to model customisation.

5 min read 4 quiz questions

The Decision Framework

Before fine-tuning, ask yourself:

NeedSolution
Add knowledge (company docs, recent events)RAG
Improve task performance with examplesBetter prompting (few-shot)
Enforce a specific output formatStructured output / JSON mode
Change the model’s style, tone, or behaviourFine-tuning
Teach domain-specific terminology/patternsFine-tuning
Reduce token usage (shorter prompts)Fine-tuning (distil system prompt into weights)

Most teams should try prompting and RAG first. Fine-tuning is the last resort — not the first.

LoRA: Efficient Fine-Tuning

The Problem

Full fine-tuning of a 70B parameter model requires enormous compute — multiple high-end GPUs, days of training, and thousands of dollars.

The Solution

LoRA (Low-Rank Adaptation) freezes the original weights and trains small adapter matrices:

$$W’ = W + BA$$

Where $W$ is the frozen original weight matrix, $B$ and $A$ are small trainable matrices with rank $r$ (typically 8-64).

Result: Train <1% of parameters, achieve ~95% of full fine-tuning quality.

QLoRA: Even More Efficient

QLoRA quantises the base model to 4-bit precision and trains LoRA adapters on top. This lets you fine-tune a 70B model on a single GPU with 24GB VRAM.

Dataset Preparation

This is where fine-tuning succeeds or fails.

Format

Most fine-tuning uses conversation format:

{
  "messages": [
    {"role": "system", "content": "You are a helpful medical assistant."},
    {"role": "user", "content": "What are common symptoms of the flu?"},
    {"role": "assistant", "content": "The most common flu symptoms include..."}
  ]
}

Quality Guidelines

  • Minimum: 50-100 high-quality examples (for LoRA)
  • Ideal: 500-5,000 examples for robust performance
  • Diversity: Cover the full range of expected inputs
  • Consistency: All examples should follow the same format and style
  • Review: Have domain experts validate every example

Common Mistakes

  • Too few examples
  • Inconsistent formatting across examples
  • Training on examples the model already handles well (wasted compute)
  • Not including edge cases and error handling

Alignment: RLHF vs DPO

After supervised fine-tuning, you may want to align the model with human preferences.

RLHF (Reinforcement Learning from Human Feedback)

  1. Collect human preference data (A is better than B)
  2. Train a reward model on preferences
  3. Use PPO to optimise the LLM against the reward model

Complex but well-established. Used by OpenAI, Anthropic.

DPO (Direct Preference Optimisation)

Skip the reward model entirely. Directly optimise the LLM using preference pairs:

$$\mathcal{L}{\text{DPO}} = -\log \sigma\left(\beta \log \frac{\pi\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)}\right)$$

Where $y_w$ is the preferred response and $y_l$ is the rejected one.

DPO is simpler, more stable, and increasingly preferred for smaller-scale alignment.

Evaluation

Automated Metrics

  • Perplexity — how surprised the model is by test data (lower = better)
  • BLEU/ROUGE — overlap with reference answers (limited usefulness)
  • LLM-as-Judge — use a stronger model to evaluate outputs

Human Evaluation

The gold standard. Have domain experts rate outputs on:

  • Accuracy
  • Helpfulness
  • Format adherence
  • Safety

A/B Testing

Deploy the fine-tuned model alongside the base model and compare real-world performance with actual users.

Practical Tips

  1. Start with the smallest capable model — fine-tuning Llama 8B is much cheaper than 70B
  2. Use LoRA/QLoRA unless you have a very strong reason for full fine-tuning
  3. Invest 80% of your time in data quality
  4. Evaluate rigorously — track metrics before and after fine-tuning
  5. Version control your datasets — you’ll iterate many times

Key takeaway: Fine-tuning is a powerful tool but an expensive one. The decision of whether to fine-tune is as important as how you fine-tune. Most use cases are better served by RAG or improved prompting.

Quick Quiz

Test what you just learned. Pick the best answer for each question.

Q1 When should you fine-tune instead of using RAG or better prompting?

Q2 What is LoRA (Low-Rank Adaptation)?

Q3 What is DPO (Direct Preference Optimisation)?

Q4 What's the most critical factor in fine-tuning success?