Fine-Tuning & Model Training

When to fine-tune vs prompt vs RAG. LoRA, QLoRA, RLHF, DPO, dataset preparation, and evaluation — the complete guide to model customisation.

The Decision Framework

Before fine-tuning, ask yourself:

Need	Solution
Add knowledge (company docs, recent events)	RAG
Improve task performance with examples	Better prompting (few-shot)
Enforce a specific output format	Structured output / JSON mode
Change the model’s style, tone, or behaviour	Fine-tuning
Teach domain-specific terminology/patterns	Fine-tuning
Reduce token usage (shorter prompts)	Fine-tuning (distil system prompt into weights)

Most teams should try prompting and RAG first. Fine-tuning is the last resort — not the first.

LoRA: Efficient Fine-Tuning

The Problem

Full fine-tuning of a 70B parameter model requires enormous compute — multiple high-end GPUs, days of training, and thousands of dollars.

The Solution

LoRA (Low-Rank Adaptation) freezes the original weights and trains small adapter matrices:

$$W’ = W + BA$$

Where $W$ is the frozen original weight matrix, $B$ and $A$ are small trainable matrices with rank $r$ (typically 8-64).

Result: Train <1% of parameters, achieve ~95% of full fine-tuning quality.

QLoRA: Even More Efficient

QLoRA quantises the base model to 4-bit precision and trains LoRA adapters on top. This lets you fine-tune a 70B model on a single GPU with 24GB VRAM.

Dataset Preparation

This is where fine-tuning succeeds or fails.

Format

Most fine-tuning uses conversation format:

{
  "messages": [
    {"role": "system", "content": "You are a helpful medical assistant."},
    {"role": "user", "content": "What are common symptoms of the flu?"},
    {"role": "assistant", "content": "The most common flu symptoms include..."}
  ]
}

Quality Guidelines

Minimum: 50-100 high-quality examples (for LoRA)
Ideal: 500-5,000 examples for robust performance
Diversity: Cover the full range of expected inputs
Consistency: All examples should follow the same format and style
Review: Have domain experts validate every example

Common Mistakes

Too few examples
Inconsistent formatting across examples
Training on examples the model already handles well (wasted compute)
Not including edge cases and error handling

Alignment: RLHF vs DPO

After supervised fine-tuning, you may want to align the model with human preferences.

RLHF (Reinforcement Learning from Human Feedback)

Collect human preference data (A is better than B)
Train a reward model on preferences
Use PPO to optimise the LLM against the reward model

Complex but well-established. Used by OpenAI, Anthropic.

DPO (Direct Preference Optimisation)

Skip the reward model entirely. Directly optimise the LLM using preference pairs:

$$\mathcal{L}{\text{DPO}} = -\log \sigma\left(\beta \log \frac{\pi\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)}\right)$$

Where $y_w$ is the preferred response and $y_l$ is the rejected one.

DPO is simpler, more stable, and increasingly preferred for smaller-scale alignment.

Evaluation

Automated Metrics

Perplexity — how surprised the model is by test data (lower = better)
BLEU/ROUGE — overlap with reference answers (limited usefulness)
LLM-as-Judge — use a stronger model to evaluate outputs

Human Evaluation

The gold standard. Have domain experts rate outputs on:

Accuracy
Helpfulness
Format adherence
Safety

A/B Testing

Deploy the fine-tuned model alongside the base model and compare real-world performance with actual users.

Practical Tips

Start with the smallest capable model — fine-tuning Llama 8B is much cheaper than 70B
Use LoRA/QLoRA unless you have a very strong reason for full fine-tuning
Invest 80% of your time in data quality
Evaluate rigorously — track metrics before and after fine-tuning
Version control your datasets — you’ll iterate many times

Key takeaway: Fine-tuning is a powerful tool but an expensive one. The decision of whether to fine-tune is as important as how you fine-tune. Most use cases are better served by RAG or improved prompting.