Introduction to Deep Learning · HIT

Week 6   Part II · Training Infrastructure

Optimization

Instructor lesson plan: lecture (3 h) and practice (2 h).

Learning objectives

🎓Lecture · 3 hours

0:00–0:1010 minRecap & retrievalOpen with two quick questions on last week's material (retrieval practice), then state this week's objectives.
0:10–0:2515 minMotivationSame model, different optimizer or learning rate, wildly different results.
0:25–1:1045 minGradient descent and its variants
  • Batch gradient descent uses all data per step; SGD uses one minibatch, adding useful noise.
  • Momentum accumulates a velocity over past gradients to smooth and accelerate descent.
  • Adam adapts a per-parameter learning rate from gradient moment estimates.
  • Each optimizer has its own sensible default learning rate.
1:10–1:2010 minBreak
1:20–2:0545 minLearning rate and dynamics
  • The learning rate is the single most important hyperparameter.
  • Too large diverges or oscillates; too small crawls.
  • Schedules (step, cosine, warmup) lower the rate over training for a cleaner finish.
  • Reading the loss curve diagnoses the cause: spikes mean too large, a flat line means too small.
2:05–2:3530 minLive demo (predict, then run)Before the three-rate sweep, ask the class to rank the three curves (too small, good, too large) and then reveal them. SGD versus momentum versus Adam on one model, a three-rate sweep, and a step-decay schedule.
2:35–2:5015 minWrap-up & practice previewRevisit the misconception and concept checks below, recap the takeaways, and preview the practice lesson.
2:50–3:0010 minBuffer & questions
Common misconception to confront.

Students often think: A smaller learning rate is always the safer choice.
Set it straight: Too small crawls or stalls in a poor region; the rate must be in the right range, and schedules help. There is no universally safe tiny value.

Check for understanding (pose during the concept blocks; let students answer before revealing).
The loss spikes and diverges after a few steps. Learning rate too large or too small?
Too large: the updates overshoot. Lower it, or add warmup.
What does momentum add over plain SGD?
A velocity accumulated from past gradients: it damps oscillation across ravines and accelerates along consistent directions.
Key takeaways.

💻Practice · 2 hours

In the practice lesson the instructor demonstrates implementations, runs code, and works through examples, using the practice notebook linked below. The weekly lab is then set as homework, where students apply this themselves.

0:00–0:1010 minSetup & recapRecap the lecture's key ideas and open the working notebook.
0:10–1:0050 minInstructor demonstrations
  • Train the same model with SGD, momentum, and Adam, and compare the curves.
  • Sweep three learning rates live and read the resulting curves.
1:00–1:055 minBreak
1:05–1:4540 minInstructor demonstrations (continued)
  • Add a learning-rate schedule and show its effect.
1:45–2:0015 minWrap-up & lab briefSummarize the patterns shown and brief the weekly lab (homework), which students complete on their own.
Common pitfalls to pre-empt.

Open the practice notebook in Colab Curated references Lab (homework)

PreviousWeek 5: Loss Functions & MetricsNextWeek 7: Regularization & Generalization