Semantic ID Recommendation System with Bounded Preference Learning

Recommendation systems are everywhere — e-commerce, streaming, social feeds, search. The hard part isn't retrieving similar items. It's doing it at scale while learning what users actually prefer.

This project builds a complete semantic ID recommendation pipeline, trains hybrid retrieval (prefix routing + ANN) with bounded DPO reranking, and measures not just whether accuracy improved — but whether the system maintained coverage and stability under preference learning.

Github

What is Semantic ID Retrieval?

In traditional recommendation, items are arbitrary integers (1, 2, 3...). The system scores every item, sorts by score, returns top-K. This doesn't scale:

Traditional: User → Compute scores for ALL items → Sort → Top-K (billions of dot products) Semantic ID: User → Predict semantic codes → Index lookup → Candidates → Rank → Top-K (hundreds of predictions)

The critical difference: retrieval becomes a generation problem. Instead of scoring everything, predict which hierarchical codes the user wants next. That single shift creates three compounding benefits:

Scalability — O(vocab) predictions instead of O(catalog) scoring. Serves billions of items without exhaustive search.

Interpretability — codes [5, 42, 187] capture "Genre/Region/Artist" hierarchy. Debugging shows why an item was retrieved, not just that it scored high.

Cold-start resilience — new items share codes with existing items. Prefix [5, 42] retrieves both seen and unseen items in "Pop/Korean".

Key Design Decisions

Semantic IDs via RQ-VAE — 4-level residual quantization converts 832d embeddings (CLIP 512d + Text-T5 320d) into hierarchical codes [c₁,c₂,c₃,c₄]. Each level has 512 codes. Innovation: K-means initialization prevents codebook collapse (92.4% utilization vs 30% random init).

Hybrid retrieval — prefix-based alone gets 76% coverage, ANN-based alone gets 88%. Union achieves 96.5%. The trade-off: prefix is fast but coarse, ANN is precise but expensive. Combining them captures both popular patterns (prefix) and long-tail items (ANN).

Bounded DPO corrections — standard DPO with unbounded delta destroyed rankings (-59% performance). Architectural constraint: score = base + 0.05 * tanh(delta). The delta can adjust rankings by at most ±5%, keeping the base score dominant. Result: +31% improvement with stable training.

Hard negative sampling — random negatives from 127K items don't match the distribution seen during inference. Solution: sample negatives from the same candidate pool (top-100 of retrieved items). Training distribution now matches evaluation distribution. Result: 81% DPO accuracy vs 60% with random negatives.

Prefix router — Transformer (4 layers, 256d, 8 heads) predicts top-50 prefixes from user history. Each prefix maps to ~500 items via index lookup. Total: ~10K candidates from prefix channel. Performance: 29.3% Val@50 (target prefix in top-50 predictions).

Results

What the training achieved

System trained end-to-end with two evaluation modes: baseline (cosine similarity ranking) and DPO (bounded preference learning).

Metric	Baseline	DPO	Δ
Recall@10	15.54%	20.36%	+31.0%
NDCG@10	8.12%	10.69%	+31.6%
Coverage	96.5%	96.5%	Maintained
Avg candidates	1,996	1,996	Stable

DPO improved ranking quality without changing retrieval. Coverage and candidate pool size remained stable — the reranker operated on the same items, just ordered them better.

The gap comes down to two design choices: bounded corrections prevent rank collapse, and hard negatives align training with the actual inference distribution.

Why Coverage Alone Is Not Enough

A high coverage number hides the failure modes that matter in production:

What coverage misses	How to measure it
Rank collapse under preference learning	Bounded vs unbounded delta comparison
Training/inference distribution mismatch	Hard vs random negative accuracy
Codebook collapse in quantization	Utilization % across all code levels
Single-method retrieval brittleness	Prefix-only vs ANN-only vs union coverage
Preference model overfitting	Train vs val accuracy gap

This project implements all five. The result isn't just "coverage was high" — it's a diagnostic showing exactly where the architecture succeeds, where it's structurally constrained, and what to fix next.

Real-World Transfer

The same principles — bounded learning, distribution alignment, hybrid strategies — map directly onto production problems:

Search ranking: bounded rerankers prevent query-level rank collapse
Content moderation: hybrid detection (rule-based + ML) improves coverage
Fraud detection: hard negatives from near-miss cases improve precision
Ad serving: semantic IDs enable fast retrieval at billions-of-impressions scale

The music recommendation task is a testbed. The architectural patterns are the transferable artifact.

Production Deployment

This system is designed for scale. Components:

Serving (<50ms total):

OpenSearch: Hosts both ANN (FAISS) and prefix index in single service
DynamoDB: USID lookups at <1ms latency
EKS GPU: DPO ranking inference at ~20ms
API Gateway: Aggregates and serves results

Training pipeline:

SageMaker: Orchestrates multi-stage training (embeddings → RQ-VAE → router → DPO)
Model Registry: Versions all artifacts (codebooks, checkpoints, embeddings)
S3: Stores raw logs for offline training

MLOps loop:

CloudWatch: Monitors Recall@10, codebook utilization, latency
Triggers: Retrain when Recall@10 drops >2% or utilization <80%
Gates: Models must pass eval threshold before production
Rollback: One-click via CodePipeline with pinned configs

References

This work builds on:

RQ-VAE — Lee et al. (CVPR 2022): Residual quantization for hierarchical codes
TIGER — Rajput et al. (2024): Prefix-based generative retrieval
DPO — Rafailov et al. (NeurIPS 2023): Direct preference optimization
ActionPiece — Hou et al. (ICML 2025): Context-aware tokenization
LightGCN — He et al. (SIGIR 2020): Simplified graph collaborative filtering
LIGER — Yang et al. (2024): Hybrid retrieval concept

Leo ooooo

Semantic ID Recommendation System with Bounded Preference Learning

Leo ooooo