top of page

Semantic ID Recommendation System with Bounded Preference Learning

Recommendation systems are everywhere — e-commerce, streaming, social feeds, search. The hard part isn't retrieving similar items. It's doing it at scale while learning what users actually prefer.

This project builds a complete semantic ID recommendation pipeline, trains hybrid retrieval (prefix routing + ANN) with bounded DPO reranking, and measures not just whether accuracy improved — but whether the system maintained coverage and stability under preference learning.

What is Semantic ID Retrieval?

In traditional recommendation, items are arbitrary integers (1, 2, 3...). The system scores every item, sorts by score, returns top-K. This doesn't scale:

Traditional: User → Compute scores for ALL items → Sort → Top-K (billions of dot products) Semantic ID: User → Predict semantic codes → Index lookup → Candidates → Rank → Top-K (hundreds of predictions)

The critical difference: retrieval becomes a generation problem. Instead of scoring everything, predict which hierarchical codes the user wants next. That single shift creates three compounding benefits:

Scalability — O(vocab) predictions instead of O(catalog) scoring. Serves billions of items without exhaustive search.

Interpretability — codes [5, 42, 187] capture "Genre/Region/Artist" hierarchy. Debugging shows why an item was retrieved, not just that it scored high.

Cold-start resilience — new items share codes with existing items. Prefix [5, 42] retrieves both seen and unseen items in "Pop/Korean".


Key Design Decisions

Semantic IDs via RQ-VAE — 4-level residual quantization converts 832d embeddings (CLIP 512d + Text-T5 320d) into hierarchical codes [c₁,c₂,c₃,c₄]. Each level has 512 codes. Innovation: K-means initialization prevents codebook collapse (92.4% utilization vs 30% random init).

Hybrid retrieval — prefix-based alone gets 76% coverage, ANN-based alone gets 88%. Union achieves 96.5%. The trade-off: prefix is fast but coarse, ANN is precise but expensive. Combining them captures both popular patterns (prefix) and long-tail items (ANN).

Bounded DPO corrections — standard DPO with unbounded delta destroyed rankings (-59% performance). Architectural constraint: score = base + 0.05 * tanh(delta). The delta can adjust rankings by at most ±5%, keeping the base score dominant. Result: +31% improvement with stable training.

Hard negative sampling — random negatives from 127K items don't match the distribution seen during inference. Solution: sample negatives from the same candidate pool (top-100 of retrieved items). Training distribution now matches evaluation distribution. Result: 81% DPO accuracy vs 60% with random negatives.

Prefix router — Transformer (4 layers, 256d, 8 heads) predicts top-50 prefixes from user history. Each prefix maps to ~500 items via index lookup. Total: ~10K candidates from prefix channel. Performance: 29.3% Val@50 (target prefix in top-50 predictions).


image

Results

What the training achieved

System trained end-to-end with two evaluation modes: baseline (cosine similarity ranking) and DPO (bounded preference learning).

Metric Baseline DPO Δ
Recall@10 15.54% 20.36% +31.0%
NDCG@10 8.12% 10.69% +31.6%
Coverage 96.5% 96.5% Maintained
Avg candidates 1,996 1,996 Stable

DPO improved ranking quality without changing retrieval. Coverage and candidate pool size remained stable — the reranker operated on the same items, just ordered them better.

The gap comes down to two design choices: bounded corrections prevent rank collapse, and hard negatives align training with the actual inference distribution.

Why Coverage Alone Is Not Enough

A high coverage number hides the failure modes that matter in production:

What coverage misses How to measure it
Rank collapse under preference learning Bounded vs unbounded delta comparison
Training/inference distribution mismatch Hard vs random negative accuracy
Codebook collapse in quantization Utilization % across all code levels
Single-method retrieval brittleness Prefix-only vs ANN-only vs union coverage
Preference model overfitting Train vs val accuracy gap

This project implements all five. The result isn't just "coverage was high" — it's a diagnostic showing exactly where the architecture succeeds, where it's structurally constrained, and what to fix next.

Real-World Transfer

The same principles — bounded learning, distribution alignment, hybrid strategies — map directly onto production problems:

  • Search ranking: bounded rerankers prevent query-level rank collapse
  • Content moderation: hybrid detection (rule-based + ML) improves coverage
  • Fraud detection: hard negatives from near-miss cases improve precision
  • Ad serving: semantic IDs enable fast retrieval at billions-of-impressions scale

The music recommendation task is a testbed. The architectural patterns are the transferable artifact.


image

system_design.png

Production Deployment

This system is designed for scale. Components:

Serving (<50ms total):

  • OpenSearch: Hosts both ANN (FAISS) and prefix index in single service
  • DynamoDB: USID lookups at <1ms latency
  • EKS GPU: DPO ranking inference at ~20ms
  • API Gateway: Aggregates and serves results

Training pipeline:

  • SageMaker: Orchestrates multi-stage training (embeddings → RQ-VAE → router → DPO)
  • Model Registry: Versions all artifacts (codebooks, checkpoints, embeddings)
  • S3: Stores raw logs for offline training

MLOps loop:

  • CloudWatch: Monitors Recall@10, codebook utilization, latency
  • Triggers: Retrain when Recall@10 drops >2% or utilization <80%
  • Gates: Models must pass eval threshold before production
  • Rollback: One-click via CodePipeline with pinned configs

References

This work builds on:

  1. RQ-VAE — Lee et al. (CVPR 2022): Residual quantization for hierarchical codes
  2. TIGER — Rajput et al. (2024): Prefix-based generative retrieval
  3. DPO — Rafailov et al. (NeurIPS 2023): Direct preference optimization
  4. ActionPiece — Hou et al. (ICML 2025): Context-aware tokenization
  5. LightGCN — He et al. (SIGIR 2020): Simplified graph collaborative filtering
  6. LIGER — Yang et al. (2024): Hybrid retrieval concept
bottom of page