Paper

Deep Neural Networks for YouTube Recommendations — Covington et al., KDD 2016


1. Overall Architecture

Two-stage recommendation pipeline:

User Input → Retrieval (Candidate Generation) → Ranking → Final Recommendations
  • Retrieval: High recall, find videos user might watch
  • Ranking: High precision, predict which videos user will engage with

2. Retrieval Stage (Candidate Generation)

Goal

Find ~1000 candidate videos from billions (high recall)

Input Features

User-side:

  • Watched video IDs (history)
  • Search tokens (past queries)
  • Demographics (age, location)
  • Context (time of day, device, geography)

Processing:

  • Categorical features → learned embeddings
  • Continuous features → passed as-is

Model Architecture

User embedding:

u = f(watch_history, search_tokens, demographics, context)

Video embedding:

v_i ∈ ℝ^d

Matching score:

score(u, v_i) = u · v_i

This generalizes matrix factorization by making user embeddings dynamic rather than fixed.

Training Objective

Softmax classification over all videos:

P(v_i | u) = exp(u · v_i) / Σ_j exp(u · v_j)
  • Label: One-hot vector (watched video = 1, others = 0)
  • Loss: Cross-entropy

Inference

  1. Compute user embedding u
  2. Search nearest neighbor index for top-k videos maximizing u · v_i
  3. Return top candidates

3. Ranking Stage

Goal

Predict expected watch time (precision optimization)

Model

Input: (user, video, context)

Output: Watch-time probability

z = f(user, video, context)
p = sigmoid(z) = 1 / (1 + exp(-z))

Training Objective

Time-weighted logistic regression:

L = - w+ · y · log(p) - w- · (1-y) · log(1-p)

Where:

  • w+ = actual_watch_time (positive: watched videos)
  • w- = 1 (negative: skipped/ignored)
  • y ∈ {0, 1} (watched or not)

Key insight: Weight positives by watch duration. A 30-second watch ≠ a 10-minute watch. This aligns ranking with engagement.


4. Retrieval vs Ranking Comparison

AspectRetrievalRanking
GoalRecallPrecision
InputUser featuresUser + video + context
OutputP(video | user)Expected watch time
LossSoftmax cross-entropyWeighted logistic regression
ServeNearest neighbor searchSort candidates
Scale1000s of outputsHundreds of candidates

5. Key Technical Insights

DNN vs Matrix Factorization

Matrix Factorization:

score = u_fixed · v_i

User embedding is constant across contexts.

DNN (YouTube approach):

u = f(history, tokens, demographics, time, device, geo)

User embedding adapts to context.

Benefits:

  • Incorporates categorical features (device type, geography)
  • Captures temporal dynamics (example age → recency bias)
  • Enables search token generalization
  • Better cold-start (demographics + context)

Embedding Dimensionality

embedding_dim ≈ log(|vocabulary_size|)

Intuition: embedding capacity grows exponentially with dimension.

Training Data Sampling

Per-user sampling: Sample K examples per user

Why: Prevents power users (who watch many videos) from dominating the training set.

Train on ALL Watches

Use complete watch history, not just recommendation impressions.

Why: Enables collaborative filtering — similar users discover similar content, recommendations generalize.

Example Age Feature

example_age = t_current - t_watched

Helps model:

  • Adapt to trends
  • Prefer recent training data
  • Handle temporal drift

6. Gradient Update for Ranking

Logistic Regression Backprop

Forward pass:

z = w^T x + b
p = sigmoid(z)

Loss:

L = - (y log(p) + (1-y) log(1-p))

Gradients:

dL/dz = p - y
dL/dw = (p - y) · x
dL/db = p - y

Weight update:

w ← w - η(p - y)x

The derivative simplifies to p - y, same as linear regression.


7. Summary of Key Contributions

  1. Two-stage architecture separates recall (retrieval) from precision (ranking)
  2. Dynamic embeddings via DNN beat fixed embeddings from matrix factorization
  3. Watch-time weighting optimizes for engagement, not just clicks
  4. Collaborative signal from all historical watches (not just impressions)
  5. Scalable inference via approximate nearest neighbor search
  6. Feature engineering (example age, demographics, context) critical for adapting to shifts

8. Takeaway

Retrieval:

  • Learn embeddings u and v using softmax classification
  • Serve via fast nearest-neighbor lookup
  • Optimize for recall (get relevant candidates)

Ranking:

  • Predict watch time using feature-rich DNN
  • Weight loss by actual watch duration
  • Optimize for precision (rank best candidates first)

The combination achieves both scale (retrieval) and personalization (ranking) for billions of videos.