Semantic ID Recommendation System with Bounded Preference Learning
Recommendation systems are everywhere — e-commerce, streaming, social feeds, search. The hard part isn't retrieving similar items. It's doing it at scale while learning what users actually prefer.
This project builds a complete semantic ID recommendation pipeline, trains hybrid retrieval (prefix routing + ANN) with bounded DPO reranking, and measures not just whether accuracy improved — but whether the system maintained coverage and stability under preference learning.
What is Semantic ID Retrieval?
In traditional recommendation, items are arbitrary integers (1, 2, 3...). The system scores every item, sorts by score, returns top-K. This doesn't scale:
Traditional: User → Compute scores for ALL items → Sort → Top-K (billions of dot products) Semantic ID: User → Predict semantic codes → Index lookup → Candidates → Rank → Top-K (hundreds of predictions)
The critical difference: retrieval becomes a generation problem. Instead of scoring everything, predict which hierarchical codes the user wants next. That single shift creates three compounding benefits:
Scalability — O(vocab) predictions instead of O(catalog) scoring. Serves billions of items without exhaustive search.
Interpretability — codes [5, 42, 187] capture "Genre/Region/Artist" hierarchy. Debugging shows why an item was retrieved, not just that it scored high.
Cold-start resilience — new items share codes with existing items. Prefix [5, 42] retrieves both seen and unseen items in "Pop/Korean".
Key Design Decisions
Semantic IDs via RQ-VAE — 4-level residual quantization converts 832d embeddings (CLIP 512d + Text-T5 320d) into hierarchical codes [c₁,c₂,c₃,c₄]. Each level has 512 codes. Innovation: K-means initialization prevents codebook collapse (92.4% utilization vs 30% random init).
Hybrid retrieval — prefix-based alone gets 76% coverage, ANN-based alone gets 88%. Union achieves 96.5%. The trade-off: prefix is fast but coarse, ANN is precise but expensive. Combining them captures both popular patterns (prefix) and long-tail items (ANN).
Bounded DPO corrections — standard DPO with unbounded delta destroyed rankings (-59% performance). Architectural constraint: score = base + 0.05 * tanh(delta). The delta can adjust rankings by at most ±5%, keeping the base score dominant. Result: +31% improvement with stable training.
Hard negative sampling — random negatives from 127K items don't match the distribution seen during inference. Solution: sample negatives from the same candidate pool (top-100 of retrieved items). Training distribution now matches evaluation distribution. Result: 81% DPO accuracy vs 60% with random negatives.
Prefix router — Transformer (4 layers, 256d, 8 heads) predicts top-50 prefixes from user history. Each prefix maps to ~500 items via index lookup. Total: ~10K candidates from prefix channel. Performance: 29.3% Val@50 (target prefix in top-50 predictions).

Results
What the training achieved
System trained end-to-end with two evaluation modes: baseline (cosine similarity ranking) and DPO (bounded preference learning).
| Metric | Baseline | DPO | Δ |
|---|---|---|---|
| Recall@10 | 15.54% | 20.36% | +31.0% |
| NDCG@10 | 8.12% | 10.69% | +31.6% |
| Coverage | 96.5% | 96.5% | Maintained |
| Avg candidates | 1,996 | 1,996 | Stable |
DPO improved ranking quality without changing retrieval. Coverage and candidate pool size remained stable — the reranker operated on the same items, just ordered them better.
The gap comes down to two design choices: bounded corrections prevent rank collapse, and hard negatives align training with the actual inference distribution.
Why Coverage Alone Is Not Enough
A high coverage number hides the failure modes that matter in production:
| What coverage misses | How to measure it |
|---|---|
| Rank collapse under preference learning | Bounded vs unbounded delta comparison |
| Training/inference distribution mismatch | Hard vs random negative accuracy |
| Codebook collapse in quantization | Utilization % across all code levels |
| Single-method retrieval brittleness | Prefix-only vs ANN-only vs union coverage |
| Preference model overfitting | Train vs val accuracy gap |
This project implements all five. The result isn't just "coverage was high" — it's a diagnostic showing exactly where the architecture succeeds, where it's structurally constrained, and what to fix next.
Real-World Transfer
The same principles — bounded learning, distribution alignment, hybrid strategies — map directly onto production problems:
- Search ranking: bounded rerankers prevent query-level rank collapse
- Content moderation: hybrid detection (rule-based + ML) improves coverage
- Fraud detection: hard negatives from near-miss cases improve precision
- Ad serving: semantic IDs enable fast retrieval at billions-of-impressions scale
The music recommendation task is a testbed. The architectural patterns are the transferable artifact.

Production Deployment
This system is designed for scale. Components:
Serving (<50ms total):
- OpenSearch: Hosts both ANN (FAISS) and prefix index in single service
- DynamoDB: USID lookups at <1ms latency
- EKS GPU: DPO ranking inference at ~20ms
- API Gateway: Aggregates and serves results
Training pipeline:
- SageMaker: Orchestrates multi-stage training (embeddings → RQ-VAE → router → DPO)
- Model Registry: Versions all artifacts (codebooks, checkpoints, embeddings)
- S3: Stores raw logs for offline training
MLOps loop:
- CloudWatch: Monitors Recall@10, codebook utilization, latency
- Triggers: Retrain when Recall@10 drops >2% or utilization <80%
- Gates: Models must pass eval threshold before production
- Rollback: One-click via CodePipeline with pinned configs
References
This work builds on:
- RQ-VAE — Lee et al. (CVPR 2022): Residual quantization for hierarchical codes
- TIGER — Rajput et al. (2024): Prefix-based generative retrieval
- DPO — Rafailov et al. (NeurIPS 2023): Direct preference optimization
- ActionPiece — Hou et al. (ICML 2025): Context-aware tokenization
- LightGCN — He et al. (SIGIR 2020): Simplified graph collaborative filtering
- LIGER — Yang et al. (2024): Hybrid retrieval concept
