top of page

From Base Models to Deployable LLMs: A Complete Fine-Tuning Playbook

The Problem

Most organizations experimenting with large language models hit the same wall: base models don't follow domain tone, policy, or structure. RAG alone doesn't fix behavior—JSON schemas, refusals, and style remain inconsistent. Full fine-tuning is too expensive, and even good models fail if latency and VRAM costs aren't considered.

Techniques (what they are + why they matter)

Technique What it is Problem it solves
Knowledge Distillation Teacher → student learning with soft targets Large models are too slow/costly—compress to a smaller model with most quality.
PEFT (LoRA / QLoRA) Train tiny adapters, freeze base (QLoRA loads base in 4-bit) Full fine-tuning is expensive—adapt behavior with minimal compute/VRAM.
Instruction Fine-Tuning (SFT) Train on instruction → input → response examples Base models don’t follow format/tone—enforce consistent structure and instruction following.
Preference Alignment (DPO) Train on (prompt, chosen, rejected) pairs Make outputs more preferred (clearer, more helpful); safety needs safety data + eval + guardrails.
Quantization Reduce precision (8-bit/4-bit) Models too big for production—lower VRAM and faster inference (format/runtime dependent).
Domain Adaptation (continued pretraining) Train on domain text (extracted from docs/PDFs) Model lacks domain vocabulary/patterns—improve fluency before instruction tuning.
Hugging Face ecosystem Datasets + Transformers + PEFT + TRL + Hub Avoid scattered workflows—standardize training, versioning, and sharing.

How each technique works

1) Knowledge Distillation

Use a strong teacher (pretrained or fine-tuned), freeze it, then train a student to match the teacher’s output distribution. Great for lowering latency/cost.

2) PEFT (LoRA / QLoRA)

Freeze the base model and train low-rank adapters. QLoRA loads the base in 4-bit to reduce memory. Works best when you need efficient fine-tuning.

Common knobs: lora_r, lora_alpha, target_modules, cutoff_len, gradient_accumulation_steps.

3) Instruction Fine-Tuning (SFT)

Train on instruction datasets (Alpaca/ShareGPT/custom JSONL) to lock in formatting, tone, and response structure.

4) DPO

Use preference pairs to move from “correct” to “better.” Don’t oversell it: DPO aligns to your preference data; safety requires explicit safety work and eval gates.

5) Quantization

Convert FP16 weights to 8-bit/4-bit to reduce VRAM and speed inference. Match the artifact to the runtime:

  • GPTQ/AWQ → GPU runtimes (support-dependent)
  • GGUF → llama.cpp (CPU/edge/Apple Silicon)

6) Domain adaptation

Continue pretraining on domain text to internalize jargon and conventions. This is not RAG—it changes model priors.

7) Hugging Face

Use HF tools end-to-end: load datasets, train with PEFT/TRL, and version models/adapters for reproducibility.


The Full Pipeline

  1. Choose base model (e.g., TinyLlama, Gemma)
  2. Prepare instruction data (Alpaca, ShareGPT, or custom JSONL)
  3. Configure PEFT (LoRA/QLoRA) and training
  4. Train with SFTTrainer or LLaMA Factory
  5. Optional: Preference alignment (DPO) for style/safety
  6. Quantize (GPTQ/AWQ/GGUF) for deployment
  7. Serve via vLLM, llama.cpp, or Inference API

Decision Guide

Need Solution
Wrong behavior Fine-tuning (SFT + PEFT)
Missing knowledge RAG
Multi-step reasoning / tools Agents
Smaller, faster model Knowledge distillation
Limited GPU memory QLoRA + quantization

References

bottom of page