Deep Neural Networks for YouTube Recommendations
Paper
Deep Neural Networks for YouTube Recommendations — Covington et al., KDD 2016
1. Overall Architecture
Two-stage recommendation pipeline:
User Input → Retrieval (Candidate Generation) → Ranking → Final Recommendations- Retrieval: High recall, find videos user might watch
- Ranking: High precision, predict which videos user will engage with
2. Retrieval Stage (Candidate Generation)
Goal
Find ~1000 candidate videos from billions (high recall)
Input Features
User-side:
- Watched video IDs (history)
- Search tokens (past queries)
- Demographics (age, location)
- Context (time of day, device, geography)
Processing:
- Categorical features → learned embeddings
- Continuous features → passed as-is
Model Architecture
User embedding:
u = f(watch_history, search_tokens, demographics, context)Video embedding:
v_i ∈ ℝ^dMatching score:
score(u, v_i) = u · v_iThis generalizes matrix factorization by making user embeddings dynamic rather than fixed.
Training Objective
Softmax classification over all videos:
P(v_i | u) = exp(u · v_i) / Σ_j exp(u · v_j)- Label: One-hot vector (watched video = 1, others = 0)
- Loss: Cross-entropy
Inference
- Compute user embedding
u - Search nearest neighbor index for top-k videos maximizing
u · v_i - Return top candidates
3. Ranking Stage
Goal
Predict expected watch time (precision optimization)
Model
Input: (user, video, context)
Output: Watch-time probability
z = f(user, video, context)
p = sigmoid(z) = 1 / (1 + exp(-z))Training Objective
Time-weighted logistic regression:
L = - w+ · y · log(p) - w- · (1-y) · log(1-p)Where:
w+ = actual_watch_time(positive: watched videos)w- = 1(negative: skipped/ignored)y ∈ {0, 1}(watched or not)
Key insight: Weight positives by watch duration. A 30-second watch ≠ a 10-minute watch. This aligns ranking with engagement.
4. Retrieval vs Ranking Comparison
| Aspect | Retrieval | Ranking |
|---|---|---|
| Goal | Recall | Precision |
| Input | User features | User + video + context |
| Output | P(video | user) | Expected watch time |
| Loss | Softmax cross-entropy | Weighted logistic regression |
| Serve | Nearest neighbor search | Sort candidates |
| Scale | 1000s of outputs | Hundreds of candidates |
5. Key Technical Insights
DNN vs Matrix Factorization
Matrix Factorization:
score = u_fixed · v_iUser embedding is constant across contexts.
DNN (YouTube approach):
u = f(history, tokens, demographics, time, device, geo)User embedding adapts to context.
Benefits:
- Incorporates categorical features (device type, geography)
- Captures temporal dynamics (example age → recency bias)
- Enables search token generalization
- Better cold-start (demographics + context)
Embedding Dimensionality
embedding_dim ≈ log(|vocabulary_size|)Intuition: embedding capacity grows exponentially with dimension.
Training Data Sampling
Per-user sampling: Sample K examples per user
Why: Prevents power users (who watch many videos) from dominating the training set.
Train on ALL Watches
Use complete watch history, not just recommendation impressions.
Why: Enables collaborative filtering — similar users discover similar content, recommendations generalize.
Example Age Feature
example_age = t_current - t_watchedHelps model:
- Adapt to trends
- Prefer recent training data
- Handle temporal drift
6. Gradient Update for Ranking
Logistic Regression Backprop
Forward pass:
z = w^T x + b
p = sigmoid(z)Loss:
L = - (y log(p) + (1-y) log(1-p))Gradients:
dL/dz = p - y
dL/dw = (p - y) · x
dL/db = p - yWeight update:
w ← w - η(p - y)xThe derivative simplifies to p - y, same as linear regression.
7. Summary of Key Contributions
- Two-stage architecture separates recall (retrieval) from precision (ranking)
- Dynamic embeddings via DNN beat fixed embeddings from matrix factorization
- Watch-time weighting optimizes for engagement, not just clicks
- Collaborative signal from all historical watches (not just impressions)
- Scalable inference via approximate nearest neighbor search
- Feature engineering (example age, demographics, context) critical for adapting to shifts
8. Takeaway
Retrieval:
- Learn embeddings u and v using softmax classification
- Serve via fast nearest-neighbor lookup
- Optimize for recall (get relevant candidates)
Ranking:
- Predict watch time using feature-rich DNN
- Weight loss by actual watch duration
- Optimize for precision (rank best candidates first)
The combination achieves both scale (retrieval) and personalization (ranking) for billions of videos.