Introduction to Deep Learning · HIT

Week 9   Part III · Architectures & Representation Learning

Convolutional Networks II

Instructor lesson plan: lecture (3 h) and practice (2 h).

Learning objectives

🎓Lecture · 3 hours

0:00–0:1010 minRecap & retrievalOpen with two quick questions on last week's material (retrieval practice), then state this week's objectives.
0:10–0:2515 minMotivationWhy naively deeper networks train worse, and the two ideas that fixed it.
0:25–1:1045 minNormalization
  • Batch normalization standardizes each channel over the minibatch, stabilizing and speeding training.
  • It allows higher learning rates and reduces sensitivity to initialization.
  • It behaves differently at train and eval (batch statistics versus running statistics).
  • Layer normalization is the batch-independent alternative used in sequence and transformer models.
1:10–1:2010 minBreak
1:20–2:0545 minResidual connections
  • Very deep plain networks train worse, not better (the degradation problem).
  • A residual block adds the input back to the output: out = F(x) + x.
  • The skip path gives gradients a direct route, so very deep networks train.
  • Match shapes on the skip path with a 1x1 convolution when the channel count changes.
2:05–2:3530 minLive demo (predict, then run)Before the ablation, ask the class to predict which deep network trains, with residuals or without, then show both curves. Add batch norm and residual blocks, ablate each, and show a deep network training only with residuals.
2:35–2:5015 minWrap-up & practice previewRevisit the misconception and concept checks below, recap the takeaways, and preview the practice lesson.
2:50–3:0010 minBuffer & questions
Common misconception to confront.

Students often think: Adding more layers to a plain network can only help, or at worst do nothing.
Set it straight: Very deep plain nets train worse (the degradation problem) because gradients struggle to propagate; residual connections give gradients a direct path so depth helps again.

Check for understanding (pose during the concept blocks; let students answer before revealing).
Why does batch norm behave differently at train and eval?
At train it uses the minibatch mean and variance; at eval it uses running statistics, so a single test example is normalized consistently.
Write a residual block output in one line and say why the skip helps.
out = F(x) + x; the +x gives gradients a direct route back, so the block can learn the identity and very deep nets still train.
Key takeaways.

💻Practice · 2 hours

In the practice lesson the instructor demonstrates implementations, runs code, and works through examples, using the practice notebook linked below. The weekly lab is then set as homework, where students apply this themselves.

0:00–0:1010 minSetup & recapRecap the lecture's key ideas and open the working notebook.
0:10–1:0050 minInstructor demonstrations
  • Add batch normalization and residual blocks to the CNN.
  • Ablate normalization and residuals live and compare the training curves.
1:00–1:055 minBreak
1:05–1:4540 minInstructor demonstrations (continued)
  • Show a deeper network failing without residuals and training with them.
1:45–2:0015 minWrap-up & lab briefSummarize the patterns shown and brief the weekly lab (homework), which students complete on their own.
Common pitfalls to pre-empt.

Open the practice notebook in Colab Curated references Lab (homework)

PreviousWeek 8: Convolutional Networks INextWeek 10: Recurrent Networks (RNNs)