Week 6 Part II · Training Infrastructure
Optimization
Gradient descent; SGD, momentum, and Adam; learning rates and optimization dynamics.
Learning goals
- Understand SGD, momentum, and Adam.
- Reason about learning rates and convergence.
- Tune an optimizer to a target.
This is the weekly
homework lab, completed independently after the lecture and the practice lesson. It follows the course's
Build / Predict & probe / Explain & defend model: use an AI assistant freely for the Build; the graded learning is in Predict and Explain. See the
AI-use policy and a
fully worked sample submission.
⚙Exercise
Part A · AI assistant welcomeBuild
- Build an optimizer-comparison harness that trains the same model with SGD, SGD with momentum, and Adam.
Part B · student reasoningPredict & probe
- For three learning rates (too small, good, too large), predict the loss-curve shape before running.
Part C · in plain languageExplain & defend
- Explain divergence versus slow convergence in terms of step size, then tune to hit a target validation accuracy.
✓Deliverables
- Comparison plots across optimizers.
- The learning-rate prediction table and a tuned run hitting the target.
Hints.- Log loss every step; a diverging loss usually means the learning rate is too high.
- Adam is forgiving but still needs its learning rate tuned.
❓Self-check
Answer each before expanding it. If one is unclear, revisit the lab and the references.
What is the single most important optimization hyperparameter?
The learning rate.
What does momentum do?
Accumulates a velocity over past gradients to smooth and accelerate descent.
How does Adam differ from plain SGD?
It adapts a per-parameter learning rate using gradient moment estimates.
A rising (diverging) loss usually means what?
The learning rate is too high.
What is a learning-rate schedule?
A rule that changes the learning rate over training (e.g. decay or warmup).
Instructor lesson plan (with references)