Week 3 Part I · Foundations

MLPs & Backpropagation

Instructor lesson plan: lecture (3 h) and practice (2 h).

Learning objectives

Build and train a multilayer perceptron.
Understand the forward pass and backpropagation.
Use autograd correctly and verify a gradient by hand.

🎓Lecture · 3 hours

0:00–0:10	10 min	Recap & retrievalOpen with two quick questions on last week's material (retrieval practice), then state this week's objectives.
0:10–0:25	15 min	MotivationFrom a single linear layer to a universal function approximator, and how learning actually happens.
0:25–1:10	45 min	The multilayer perceptron An MLP stacks linear layers with nonlinear activations (ReLU, sigmoid, tanh). The nonlinearity is essential: stacked linear layers are still just a linear map. nn.Module holds the parameters and defines the forward pass; calling the module runs forward. Width and depth set capacity; more is not always better.
1:10–1:20	10 min	Break
1:20–2:05	45 min	Backpropagation and autograd The forward pass builds a computational graph of the operations applied. Backpropagation applies the chain rule backward through that graph to get each parameter's gradient. Autograd records the graph automatically; .backward() fills .grad; optimizer.step() updates the weights. Gradients accumulate, so call zero_grad() each iteration; verify a gradient against a finite-difference estimate.
2:05–2:35	30 min	Live demo (predict, then run)Ask the class to predict .grad after the second backward (with no zero_grad) before revealing it. Build an MLP, inspect .grad before and after zero_grad, and check a hand-derived gradient against autograd.
2:35–2:50	15 min	Wrap-up & practice previewRevisit the misconception and concept checks below, recap the takeaways, and preview the practice lesson.
2:50–3:00	10 min	Buffer & questions

Common misconception to confront.

Students often think: Backpropagation is a separate, mysterious algorithm.
Set it straight: Backprop is exactly the chain rule applied backward over the computational graph; autograd just records the graph and applies it automatically.

Check for understanding (pose during the concept blocks; let students answer before revealing).

After two backward() calls on the same loss with no zero_grad(), what is in .grad?

Twice the single-step gradient: gradients accumulate, which is why you zero them each step.

How would you sanity-check that autograd is correct?

Compare against a finite-difference estimate, or use torch.autograd.gradcheck.

Key takeaways.

Nonlinearity is what makes depth worthwhile.
Backpropagation is the chain rule on a graph.
Autograd automates gradients, but they must be zeroed each step.

💻Practice · 2 hours

In the practice lesson the instructor demonstrates implementations, runs code, and works through examples, using the practice notebook linked below. The weekly lab is then set as homework, where students apply this themselves.

0:00–0:10	10 min	Setup & recapRecap the lecture's key ideas and open the working notebook.
0:10–1:00	50 min	Instructor demonstrations Build an MLP with nn.Module live and train it on a small task. Inspect .grad after a backward pass and show the effect of zero_grad.
1:00–1:05	5 min	Break
1:05–1:45	40 min	Instructor demonstrations (continued) Compare a hand-computed gradient with autograd on a tiny example.
1:45–2:00	15 min	Wrap-up & lab briefSummarize the patterns shown and brief the weekly lab (homework), which students complete on their own.

Common pitfalls to pre-empt.

Zero the gradients each step; compare against torch.autograd.gradcheck.
Without a nonlinearity an MLP collapses to a linear model.

Open the practice notebook in Colab Curated references Lab (homework)