Week 3 Part I · Foundations
MLPs & Backpropagation
Instructor lesson plan: lecture (3 h) and practice (2 h).
Learning objectives
- Build and train a multilayer perceptron.
- Understand the forward pass and backpropagation.
- Use autograd correctly and verify a gradient by hand.
🎓Lecture · 3 hours
| 0:00–0:10 | 10 min | Recap & retrievalOpen with two quick questions on last week's material (retrieval practice), then state this week's objectives. |
| 0:10–0:25 | 15 min | MotivationFrom a single linear layer to a universal function approximator, and how learning actually happens. |
| 0:25–1:10 | 45 min | The multilayer perceptron- An MLP stacks linear layers with nonlinear activations (ReLU, sigmoid, tanh).
- The nonlinearity is essential: stacked linear layers are still just a linear map.
- nn.Module holds the parameters and defines the forward pass; calling the module runs forward.
- Width and depth set capacity; more is not always better.
|
| 1:10–1:20 | 10 min | Break |
| 1:20–2:05 | 45 min | Backpropagation and autograd- The forward pass builds a computational graph of the operations applied.
- Backpropagation applies the chain rule backward through that graph to get each parameter's gradient.
- Autograd records the graph automatically; .backward() fills .grad; optimizer.step() updates the weights.
- Gradients accumulate, so call zero_grad() each iteration; verify a gradient against a finite-difference estimate.
|
| 2:05–2:35 | 30 min | Live demo (predict, then run)Ask the class to predict .grad after the second backward (with no zero_grad) before revealing it. Build an MLP, inspect .grad before and after zero_grad, and check a hand-derived gradient against autograd. |
| 2:35–2:50 | 15 min | Wrap-up & practice previewRevisit the misconception and concept checks below, recap the takeaways, and preview the practice lesson. |
| 2:50–3:00 | 10 min | Buffer & questions |
Common misconception to confront. Students often think: Backpropagation is a separate, mysterious algorithm.
Set it straight: Backprop is exactly the chain rule applied backward over the computational graph; autograd just records the graph and applies it automatically.
Check for understanding (pose during the concept blocks; let students answer before revealing).
After two backward() calls on the same loss with no zero_grad(), what is in .grad?
Twice the single-step gradient: gradients accumulate, which is why you zero them each step.
How would you sanity-check that autograd is correct?
Compare against a finite-difference estimate, or use torch.autograd.gradcheck.
Key takeaways.- Nonlinearity is what makes depth worthwhile.
- Backpropagation is the chain rule on a graph.
- Autograd automates gradients, but they must be zeroed each step.
💻Practice · 2 hours
In the practice lesson the instructor demonstrates implementations, runs code, and works through examples, using the practice notebook linked below. The weekly lab is then set as homework, where students apply this themselves.
| 0:00–0:10 | 10 min | Setup & recapRecap the lecture's key ideas and open the working notebook. |
| 0:10–1:00 | 50 min | Instructor demonstrations- Build an MLP with nn.Module live and train it on a small task.
- Inspect .grad after a backward pass and show the effect of zero_grad.
|
| 1:00–1:05 | 5 min | Break |
| 1:05–1:45 | 40 min | Instructor demonstrations (continued)- Compare a hand-computed gradient with autograd on a tiny example.
|
| 1:45–2:00 | 15 min | Wrap-up & lab briefSummarize the patterns shown and brief the weekly lab (homework), which students complete on their own. |
Common pitfalls to pre-empt.- Zero the gradients each step; compare against torch.autograd.gradcheck.
- Without a nonlinearity an MLP collapses to a linear model.
Open the practice notebook in Colab Curated references Lab (homework)