Week 6 Part II · Training Infrastructure

Optimization

Gradient descent; SGD, momentum, and Adam; learning rates and optimization dynamics.

Learning goals

Understand SGD, momentum, and Adam.
Reason about learning rates and convergence.
Tune an optimizer to a target.

This is the weekly homework lab, completed independently after the lecture and the practice lesson. It follows the course's Build / Predict & probe / Explain & defend model: use an AI assistant freely for the Build; the graded learning is in Predict and Explain. See the AI-use policy and a fully worked sample submission.

⚙Exercise

Part A · AI assistant welcomeBuild

Build an optimizer-comparison harness that trains the same model with SGD, SGD with momentum, and Adam.

Part B · student reasoningPredict & probe

For three learning rates (too small, good, too large), predict the loss-curve shape before running.

Part C · in plain languageExplain & defend

Explain divergence versus slow convergence in terms of step size, then tune to hit a target validation accuracy.

✓Deliverables

Comparison plots across optimizers.
The learning-rate prediction table and a tuned run hitting the target.

Hints.

Log loss every step; a diverging loss usually means the learning rate is too high.
Adam is forgiving but still needs its learning rate tuned.

❓Self-check

Answer each before expanding it. If one is unclear, revisit the lab and the references.

What is the single most important optimization hyperparameter?

The learning rate.

What does momentum do?

Accumulates a velocity over past gradients to smooth and accelerate descent.

How does Adam differ from plain SGD?

It adapts a per-parameter learning rate using gradient moment estimates.

A rising (diverging) loss usually means what?

The learning rate is too high.

What is a learning-rate schedule?

A rule that changes the learning rate over training (e.g. decay or warmup).

Instructor lesson plan (with references)