From Base Models to Deployable LLMs: A Complete Fine-Tuning Playbook
The Problem
Most organizations experimenting with large language models hit the same wall: base models don't follow domain tone, policy, or structure. RAG alone doesn't fix behavior—JSON schemas, refusals, and style remain inconsistent. Full fine-tuning is too expensive, and even good models fail if latency and VRAM costs aren't considered.
Techniques (what they are + why they matter)
| Technique | What it is | Problem it solves |
|---|---|---|
| Knowledge Distillation | Teacher → student learning with soft targets | Large models are too slow/costly—compress to a smaller model with most quality. |
| PEFT (LoRA / QLoRA) | Train tiny adapters, freeze base (QLoRA loads base in 4-bit) | Full fine-tuning is expensive—adapt behavior with minimal compute/VRAM. |
| Instruction Fine-Tuning (SFT) | Train on instruction → input → response examples | Base models don’t follow format/tone—enforce consistent structure and instruction following. |
| Preference Alignment (DPO) | Train on (prompt, chosen, rejected) pairs | Make outputs more preferred (clearer, more helpful); safety needs safety data + eval + guardrails. |
| Quantization | Reduce precision (8-bit/4-bit) | Models too big for production—lower VRAM and faster inference (format/runtime dependent). |
| Domain Adaptation (continued pretraining) | Train on domain text (extracted from docs/PDFs) | Model lacks domain vocabulary/patterns—improve fluency before instruction tuning. |
| Hugging Face ecosystem | Datasets + Transformers + PEFT + TRL + Hub | Avoid scattered workflows—standardize training, versioning, and sharing. |
How each technique works
1) Knowledge Distillation
Use a strong teacher (pretrained or fine-tuned), freeze it, then train a student to match the teacher’s output distribution. Great for lowering latency/cost.
2) PEFT (LoRA / QLoRA)
Freeze the base model and train low-rank adapters. QLoRA loads the base in 4-bit to reduce memory. Works best when you need efficient fine-tuning.
Common knobs: lora_r, lora_alpha, target_modules, cutoff_len, gradient_accumulation_steps.
3) Instruction Fine-Tuning (SFT)
Train on instruction datasets (Alpaca/ShareGPT/custom JSONL) to lock in formatting, tone, and response structure.
4) DPO
Use preference pairs to move from “correct” to “better.” Don’t oversell it: DPO aligns to your preference data; safety requires explicit safety work and eval gates.
5) Quantization
Convert FP16 weights to 8-bit/4-bit to reduce VRAM and speed inference. Match the artifact to the runtime:
- GPTQ/AWQ → GPU runtimes (support-dependent)
- GGUF → llama.cpp (CPU/edge/Apple Silicon)
6) Domain adaptation
Continue pretraining on domain text to internalize jargon and conventions. This is not RAG—it changes model priors.
7) Hugging Face
Use HF tools end-to-end: load datasets, train with PEFT/TRL, and version models/adapters for reproducibility.
The Full Pipeline
- Choose base model (e.g., TinyLlama, Gemma)
- Prepare instruction data (Alpaca, ShareGPT, or custom JSONL)
- Configure PEFT (LoRA/QLoRA) and training
- Train with SFTTrainer or LLaMA Factory
- Optional: Preference alignment (DPO) for style/safety
- Quantize (GPTQ/AWQ/GGUF) for deployment
- Serve via vLLM, llama.cpp, or Inference API
Decision Guide
| Need | Solution |
|---|---|
| Wrong behavior | Fine-tuning (SFT + PEFT) |
| Missing knowledge | RAG |
| Multi-step reasoning / tools | Agents |
| Smaller, faster model | Knowledge distillation |
| Limited GPU memory | QLoRA + quantization |
References
- Hugging Face Transformers, PEFT, TRL
- LLaMA Factory documentation
- Hands-On Large Language Models, O'Reilly
- Sunny Savita LLM Finetuning
