# Deep Neural Networks for YouTube RecommendationsTwo-stage retrieval + ranking architecture, dynamic user embeddings, and watch-time optimization for recommendation systems

## Paper

[Deep Neural Networks for YouTube Recommendations](https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/45530.pdf) — Covington et al., KDD 2016

---

## 1. Overall Architecture

**Two-stage recommendation pipeline:**

```
User Input → Retrieval (Candidate Generation) → Ranking → Final Recommendations
```

- **Retrieval**: High recall, find videos user *might* watch
- **Ranking**: High precision, predict which videos user *will* engage with

---

## 2. Retrieval Stage (Candidate Generation)

### Goal
Find ~1000 candidate videos from billions (high recall)

### Input Features

**User-side:**
- Watched video IDs (history)
- Search tokens (past queries)
- Demographics (age, location)
- Context (time of day, device, geography)

**Processing:**
- Categorical features → learned embeddings
- Continuous features → passed as-is

### Model Architecture

**User embedding:**
```
u = f(watch_history, search_tokens, demographics, context)
```

**Video embedding:**
```
v_i ∈ ℝ^d
```

**Matching score:**
```
score(u, v_i) = u · v_i
```

This generalizes matrix factorization by making user embeddings **dynamic** rather than fixed.

### Training Objective

**Softmax classification** over all videos:

```
P(v_i | u) = exp(u · v_i) / Σ_j exp(u · v_j)
```

- **Label**: One-hot vector (watched video = 1, others = 0)
- **Loss**: Cross-entropy

### Inference

1. Compute user embedding `u`
2. Search nearest neighbor index for top-k videos maximizing `u · v_i`
3. Return top candidates

---

## 3. Ranking Stage

### Goal
Predict expected watch time (precision optimization)

### Model

**Input:** (user, video, context)

**Output:** Watch-time probability

```
z = f(user, video, context)
p = sigmoid(z) = 1 / (1 + exp(-z))
```

### Training Objective

**Time-weighted logistic regression:**

```
L = - w+ · y · log(p) - w- · (1-y) · log(1-p)
```

Where:
- `w+ = actual_watch_time` (positive: watched videos)
- `w- = 1` (negative: skipped/ignored)
- `y ∈ {0, 1}` (watched or not)

**Key insight:** Weight positives by watch duration. A 30-second watch ≠ a 10-minute watch. This aligns ranking with engagement.

---

## 4. Retrieval vs Ranking Comparison

| Aspect | Retrieval | Ranking |
|--------|-----------|---------|
| **Goal** | Recall | Precision |
| **Input** | User features | User + video + context |
| **Output** | P(video \| user) | Expected watch time |
| **Loss** | Softmax cross-entropy | Weighted logistic regression |
| **Serve** | Nearest neighbor search | Sort candidates |
| **Scale** | 1000s of outputs | Hundreds of candidates |

---

## 5. Key Technical Insights

### DNN vs Matrix Factorization

**Matrix Factorization:**
```
score = u_fixed · v_i
```
User embedding is constant across contexts.

**DNN (YouTube approach):**
```
u = f(history, tokens, demographics, time, device, geo)
```
User embedding adapts to context.

**Benefits:**
- Incorporates categorical features (device type, geography)
- Captures temporal dynamics (example age → recency bias)
- Enables search token generalization
- Better cold-start (demographics + context)

### Embedding Dimensionality

```
embedding_dim ≈ log(|vocabulary_size|)
```

Intuition: embedding capacity grows exponentially with dimension.

### Training Data Sampling

**Per-user sampling:** Sample K examples per user

**Why:** Prevents power users (who watch many videos) from dominating the training set.

### Train on ALL Watches

Use complete watch history, not just recommendation impressions.

**Why:** Enables collaborative filtering — similar users discover similar content, recommendations generalize.

### Example Age Feature

```
example_age = t_current - t_watched
```

Helps model:
- Adapt to trends
- Prefer recent training data
- Handle temporal drift

---

## 6. Gradient Update for Ranking

### Logistic Regression Backprop

**Forward pass:**
```
z = w^T x + b
p = sigmoid(z)
```

**Loss:**
```
L = - (y log(p) + (1-y) log(1-p))
```

**Gradients:**
```
dL/dz = p - y
dL/dw = (p - y) · x
dL/db = p - y
```

**Weight update:**
```
w ← w - η(p - y)x
```

The derivative simplifies to `p - y`, same as linear regression.

---

## 7. Summary of Key Contributions

1. **Two-stage architecture** separates recall (retrieval) from precision (ranking)
2. **Dynamic embeddings via DNN** beat fixed embeddings from matrix factorization
3. **Watch-time weighting** optimizes for engagement, not just clicks
4. **Collaborative signal** from all historical watches (not just impressions)
5. **Scalable inference** via approximate nearest neighbor search
6. **Feature engineering** (example age, demographics, context) critical for adapting to shifts

---

## 8. Takeaway

**Retrieval:**
- Learn embeddings u and v using softmax classification
- Serve via fast nearest-neighbor lookup
- Optimize for recall (get relevant candidates)

**Ranking:**
- Predict watch time using feature-rich DNN
- Weight loss by actual watch duration
- Optimize for precision (rank best candidates first)

The combination achieves both scale (retrieval) and personalization (ranking) for billions of videos.
