Deep Neural Networks for YouTube Recommendations

Paper

Deep Neural Networks for YouTube Recommendations — Covington et al., KDD 2016

1. Overall Architecture

Two-stage recommendation pipeline:

User Input → Retrieval (Candidate Generation) → Ranking → Final Recommendations

Retrieval: High recall, find videos user might watch
Ranking: High precision, predict which videos user will engage with

2. Retrieval Stage (Candidate Generation)

Goal

Find ~1000 candidate videos from billions (high recall)

Input Features

User-side:

Watched video IDs (history)
Search tokens (past queries)
Demographics (age, location)
Context (time of day, device, geography)

Processing:

Categorical features → learned embeddings
Continuous features → passed as-is

Model Architecture

User embedding:

u = f(watch_history, search_tokens, demographics, context)

Video embedding:

v_i ∈ ℝ^d

Matching score:

score(u, v_i) = u · v_i

This generalizes matrix factorization by making user embeddings dynamic rather than fixed.

Training Objective

Softmax classification over all videos:

P(v_i | u) = exp(u · v_i) / Σ_j exp(u · v_j)

Label: One-hot vector (watched video = 1, others = 0)
Loss: Cross-entropy

Inference

Compute user embedding u
Search nearest neighbor index for top-k videos maximizing u · v_i
Return top candidates

3. Ranking Stage

Goal

Predict expected watch time (precision optimization)

Model

Input: (user, video, context)

Output: Watch-time probability

z = f(user, video, context)
p = sigmoid(z) = 1 / (1 + exp(-z))

Training Objective

Time-weighted logistic regression:

L = - w+ · y · log(p) - w- · (1-y) · log(1-p)

Where:

w+ = actual_watch_time (positive: watched videos)
w- = 1 (negative: skipped/ignored)
y ∈ {0, 1} (watched or not)

Key insight: Weight positives by watch duration. A 30-second watch ≠ a 10-minute watch. This aligns ranking with engagement.

4. Retrieval vs Ranking Comparison

Aspect	Retrieval	Ranking
Goal	Recall	Precision
Input	User features	User + video + context
Output	P(video \| user)	Expected watch time
Loss	Softmax cross-entropy	Weighted logistic regression
Serve	Nearest neighbor search	Sort candidates
Scale	1000s of outputs	Hundreds of candidates

5. Key Technical Insights

DNN vs Matrix Factorization

Matrix Factorization:

score = u_fixed · v_i

User embedding is constant across contexts.

DNN (YouTube approach):

u = f(history, tokens, demographics, time, device, geo)

User embedding adapts to context.

Benefits:

Incorporates categorical features (device type, geography)
Captures temporal dynamics (example age → recency bias)
Enables search token generalization
Better cold-start (demographics + context)

Embedding Dimensionality

embedding_dim ≈ log(|vocabulary_size|)

Intuition: embedding capacity grows exponentially with dimension.

Training Data Sampling

Per-user sampling: Sample K examples per user

Why: Prevents power users (who watch many videos) from dominating the training set.

Train on ALL Watches

Use complete watch history, not just recommendation impressions.

Why: Enables collaborative filtering — similar users discover similar content, recommendations generalize.

Example Age Feature

example_age = t_current - t_watched

Helps model:

Adapt to trends
Prefer recent training data
Handle temporal drift

6. Gradient Update for Ranking

Logistic Regression Backprop

Forward pass:

z = w^T x + b
p = sigmoid(z)

Loss:

L = - (y log(p) + (1-y) log(1-p))

Gradients:

dL/dz = p - y
dL/dw = (p - y) · x
dL/db = p - y

Weight update:

w ← w - η(p - y)x

The derivative simplifies to p - y, same as linear regression.

7. Summary of Key Contributions

Two-stage architecture separates recall (retrieval) from precision (ranking)
Dynamic embeddings via DNN beat fixed embeddings from matrix factorization
Watch-time weighting optimizes for engagement, not just clicks
Collaborative signal from all historical watches (not just impressions)
Scalable inference via approximate nearest neighbor search
Feature engineering (example age, demographics, context) critical for adapting to shifts

8. Takeaway

Retrieval:

Learn embeddings u and v using softmax classification
Serve via fast nearest-neighbor lookup
Optimize for recall (get relevant candidates)

Ranking:

Predict watch time using feature-rich DNN
Weight loss by actual watch duration
Optimize for precision (rank best candidates first)

The combination achieves both scale (retrieval) and personalization (ranking) for billions of videos.

Heather's Blog

Title here

Deep Neural Networks for YouTube Recommendations

Paper

1. Overall Architecture

2. Retrieval Stage (Candidate Generation)

Goal

Input Features

Model Architecture

Training Objective

Inference

3. Ranking Stage

Goal

Model

Training Objective

4. Retrieval vs Ranking Comparison

5. Key Technical Insights

DNN vs Matrix Factorization

Embedding Dimensionality

Training Data Sampling

Train on ALL Watches

Example Age Feature

6. Gradient Update for Ranking

Logistic Regression Backprop

7. Summary of Key Contributions

8. Takeaway

Deep Neural Networks for YouTube Recommendations

Paper#

1. Overall Architecture#

2. Retrieval Stage (Candidate Generation)#

Goal#

Input Features#

Model Architecture#

Training Objective#

Inference#

3. Ranking Stage#

Goal#

Model#

Training Objective#

4. Retrieval vs Ranking Comparison#

5. Key Technical Insights#

DNN vs Matrix Factorization#

Embedding Dimensionality#

Training Data Sampling#

Train on ALL Watches#

Example Age Feature#

6. Gradient Update for Ranking#

Logistic Regression Backprop#

7. Summary of Key Contributions#

8. Takeaway#

Paper

1. Overall Architecture

2. Retrieval Stage (Candidate Generation)

Goal

Input Features

Model Architecture

Training Objective

Inference

3. Ranking Stage

Goal

Model

Training Objective

4. Retrieval vs Ranking Comparison

5. Key Technical Insights

DNN vs Matrix Factorization

Embedding Dimensionality

Training Data Sampling

Train on ALL Watches

Example Age Feature

6. Gradient Update for Ranking

Logistic Regression Backprop

7. Summary of Key Contributions

8. Takeaway