Week 9 Part III · Architectures & Representation Learning
Convolutional Networks II
Instructor lesson plan: lecture (3 h) and practice (2 h).
Learning objectives
- Add normalization and residual connections.
- Understand why these help deeper networks train.
- Measure the effect of each with an ablation.
🎓Lecture · 3 hours
| 0:00–0:10 | 10 min | Recap & retrievalOpen with two quick questions on last week's material (retrieval practice), then state this week's objectives. |
| 0:10–0:25 | 15 min | MotivationWhy naively deeper networks train worse, and the two ideas that fixed it. |
| 0:25–1:10 | 45 min | Normalization- Batch normalization standardizes each channel over the minibatch, stabilizing and speeding training.
- It allows higher learning rates and reduces sensitivity to initialization.
- It behaves differently at train and eval (batch statistics versus running statistics).
- Layer normalization is the batch-independent alternative used in sequence and transformer models.
|
| 1:10–1:20 | 10 min | Break |
| 1:20–2:05 | 45 min | Residual connections- Very deep plain networks train worse, not better (the degradation problem).
- A residual block adds the input back to the output: out = F(x) + x.
- The skip path gives gradients a direct route, so very deep networks train.
- Match shapes on the skip path with a 1x1 convolution when the channel count changes.
|
| 2:05–2:35 | 30 min | Live demo (predict, then run)Before the ablation, ask the class to predict which deep network trains, with residuals or without, then show both curves. Add batch norm and residual blocks, ablate each, and show a deep network training only with residuals. |
| 2:35–2:50 | 15 min | Wrap-up & practice previewRevisit the misconception and concept checks below, recap the takeaways, and preview the practice lesson. |
| 2:50–3:00 | 10 min | Buffer & questions |
Common misconception to confront. Students often think: Adding more layers to a plain network can only help, or at worst do nothing.
Set it straight: Very deep plain nets train worse (the degradation problem) because gradients struggle to propagate; residual connections give gradients a direct path so depth helps again.
Check for understanding (pose during the concept blocks; let students answer before revealing).
Why does batch norm behave differently at train and eval?
At train it uses the minibatch mean and variance; at eval it uses running statistics, so a single test example is normalized consistently.
Write a residual block output in one line and say why the skip helps.
out = F(x) + x; the +x gives gradients a direct route back, so the block can learn the identity and very deep nets still train.
Key takeaways.- Normalization stabilizes and accelerates training.
- Residual connections let gradients flow through deep nets.
- Compare training curves, not just final accuracy.
💻Practice · 2 hours
In the practice lesson the instructor demonstrates implementations, runs code, and works through examples, using the practice notebook linked below. The weekly lab is then set as homework, where students apply this themselves.
| 0:00–0:10 | 10 min | Setup & recapRecap the lecture's key ideas and open the working notebook. |
| 0:10–1:00 | 50 min | Instructor demonstrations- Add batch normalization and residual blocks to the CNN.
- Ablate normalization and residuals live and compare the training curves.
|
| 1:00–1:05 | 5 min | Break |
| 1:05–1:45 | 40 min | Instructor demonstrations (continued)- Show a deeper network failing without residuals and training with them.
|
| 1:45–2:00 | 15 min | Wrap-up & lab briefSummarize the patterns shown and brief the weekly lab (homework), which students complete on their own. |
Common pitfalls to pre-empt.- Residual paths need matching shapes; use a 1x1 conv to match channels.
- Compare training curves, not just final accuracy.
Open the practice notebook in Colab Curated references Lab (homework)