Week 6 Part II · Training Infrastructure

Optimization

Instructor lesson plan: lecture (3 h) and practice (2 h).

Learning objectives

Understand SGD, momentum, and Adam.
Reason about learning rates and convergence.
Tune an optimizer to a target.

🎓Lecture · 3 hours

0:00–0:10	10 min	Recap & retrievalOpen with two quick questions on last week's material (retrieval practice), then state this week's objectives.
0:10–0:25	15 min	MotivationSame model, different optimizer or learning rate, wildly different results.
0:25–1:10	45 min	Gradient descent and its variants Batch gradient descent uses all data per step; SGD uses one minibatch, adding useful noise. Momentum accumulates a velocity over past gradients to smooth and accelerate descent. Adam adapts a per-parameter learning rate from gradient moment estimates. Each optimizer has its own sensible default learning rate.
1:10–1:20	10 min	Break
1:20–2:05	45 min	Learning rate and dynamics The learning rate is the single most important hyperparameter. Too large diverges or oscillates; too small crawls. Schedules (step, cosine, warmup) lower the rate over training for a cleaner finish. Reading the loss curve diagnoses the cause: spikes mean too large, a flat line means too small.
2:05–2:35	30 min	Live demo (predict, then run)Before the three-rate sweep, ask the class to rank the three curves (too small, good, too large) and then reveal them. SGD versus momentum versus Adam on one model, a three-rate sweep, and a step-decay schedule.
2:35–2:50	15 min	Wrap-up & practice previewRevisit the misconception and concept checks below, recap the takeaways, and preview the practice lesson.
2:50–3:00	10 min	Buffer & questions

Common misconception to confront.

Students often think: A smaller learning rate is always the safer choice.
Set it straight: Too small crawls or stalls in a poor region; the rate must be in the right range, and schedules help. There is no universally safe tiny value.

Check for understanding (pose during the concept blocks; let students answer before revealing).

The loss spikes and diverges after a few steps. Learning rate too large or too small?

Too large: the updates overshoot. Lower it, or add warmup.

What does momentum add over plain SGD?

A velocity accumulated from past gradients: it damps oscillation across ravines and accelerates along consistent directions.

Key takeaways.

The learning rate is the most important hyperparameter.
Momentum and Adam smooth and adapt the updates.
Read the loss curve to diagnose what went wrong.

💻Practice · 2 hours

In the practice lesson the instructor demonstrates implementations, runs code, and works through examples, using the practice notebook linked below. The weekly lab is then set as homework, where students apply this themselves.

0:00–0:10	10 min	Setup & recapRecap the lecture's key ideas and open the working notebook.
0:10–1:00	50 min	Instructor demonstrations Train the same model with SGD, momentum, and Adam, and compare the curves. Sweep three learning rates live and read the resulting curves.
1:00–1:05	5 min	Break
1:05–1:45	40 min	Instructor demonstrations (continued) Add a learning-rate schedule and show its effect.
1:45–2:00	15 min	Wrap-up & lab briefSummarize the patterns shown and brief the weekly lab (homework), which students complete on their own.

Common pitfalls to pre-empt.

Log loss every step; a diverging loss usually means the learning rate is too high.
Adam is forgiving but still needs its learning rate tuned.

Open the practice notebook in Colab Curated references Lab (homework)