From Base Models to Deployable LLMs: A Complete Fine-Tuning Playbook

The Problem

Most organizations experimenting with large language models hit the same wall: base models don't follow domain tone, policy, or structure. RAG alone doesn't fix behavior—JSON schemas, refusals, and style remain inconsistent. Full fine-tuning is too expensive, and even good models fail if latency and VRAM costs aren't considered.

Github

Techniques (what they are + why they matter)

Technique	What it is	Problem it solves
Knowledge Distillation	Teacher → student learning with soft targets	Large models are too slow/costly—compress to a smaller model with most quality.
PEFT (LoRA / QLoRA)	Train tiny adapters, freeze base (QLoRA loads base in 4-bit)	Full fine-tuning is expensive—adapt behavior with minimal compute/VRAM.
Instruction Fine-Tuning (SFT)	Train on instruction → input → response examples	Base models don’t follow format/tone—enforce consistent structure and instruction following.
Preference Alignment (DPO)	Train on (prompt, chosen, rejected) pairs	Make outputs more preferred (clearer, more helpful); safety needs safety data + eval + guardrails.
Quantization	Reduce precision (8-bit/4-bit)	Models too big for production—lower VRAM and faster inference (format/runtime dependent).
Domain Adaptation (continued pretraining)	Train on domain text (extracted from docs/PDFs)	Model lacks domain vocabulary/patterns—improve fluency before instruction tuning.
Hugging Face ecosystem	Datasets + Transformers + PEFT + TRL + Hub	Avoid scattered workflows—standardize training, versioning, and sharing.

How each technique works

1) Knowledge Distillation

Use a strong teacher (pretrained or fine-tuned), freeze it, then train a student to match the teacher’s output distribution. Great for lowering latency/cost.

2) PEFT (LoRA / QLoRA)

Freeze the base model and train low-rank adapters. QLoRA loads the base in 4-bit to reduce memory. Works best when you need efficient fine-tuning.

Common knobs: lora_r, lora_alpha, target_modules, cutoff_len, gradient_accumulation_steps.

3) Instruction Fine-Tuning (SFT)

Train on instruction datasets (Alpaca/ShareGPT/custom JSONL) to lock in formatting, tone, and response structure.

4) DPO

Use preference pairs to move from “correct” to “better.” Don’t oversell it: DPO aligns to your preference data; safety requires explicit safety work and eval gates.

5) Quantization

Convert FP16 weights to 8-bit/4-bit to reduce VRAM and speed inference. Match the artifact to the runtime:

GPTQ/AWQ → GPU runtimes (support-dependent)
GGUF → llama.cpp (CPU/edge/Apple Silicon)

6) Domain adaptation

Continue pretraining on domain text to internalize jargon and conventions. This is not RAG—it changes model priors.

7) Hugging Face

Use HF tools end-to-end: load datasets, train with PEFT/TRL, and version models/adapters for reproducibility.

The Full Pipeline

Choose base model (e.g., TinyLlama, Gemma)
Prepare instruction data (Alpaca, ShareGPT, or custom JSONL)
Configure PEFT (LoRA/QLoRA) and training
Train with SFTTrainer or LLaMA Factory
Optional: Preference alignment (DPO) for style/safety
Quantize (GPTQ/AWQ/GGUF) for deployment
Serve via vLLM, llama.cpp, or Inference API

Decision Guide

Need	Solution
Wrong behavior	Fine-tuning (SFT + PEFT)
Missing knowledge	RAG
Multi-step reasoning / tools	Agents
Smaller, faster model	Knowledge distillation
Limited GPU memory	QLoRA + quantization

References

Hugging Face Transformers, PEFT, TRL
LLaMA Factory documentation
Hands-On Large Language Models, O'Reilly
Sunny Savita LLM Finetuning

Leo ooooo

From Base Models to Deployable LLMs: A Complete Fine-Tuning Playbook

The Problem

Leo ooooo