Week 6 Part II · Training Infrastructure

Optimization

Gradient descent; SGD, momentum, and Adam; learning rates and optimization dynamics.

Curated, free, canonical references for this week: a course or lecture, a book chapter, a video, and an authoritative blog post or official tutorial. Each opens in a new tab.

Course

Stanford CS231n: Neural Networks Part 3 (Optimization) cs231n.github.io

Covers SGD, momentum/Nesterov, Adagrad/RMSProp/Adam, and learning-rate annealing.

Book

Dive into Deep Learning, 12.4 Stochastic Gradient Descent d2l.ai

SGD, dynamic learning-rate schedules, and convergence behavior.

Video

3Blue1Brown: Gradient descent, how neural networks learn youtube.com

Visual, intuition-first explanation of gradient descent and the negative-gradient update.

Blog / Docs

Sebastian Ruder: An overview of gradient descent optimization algorithms ruder.io

The authoritative comparison of SGD variants, momentum, Adagrad/RMSProp/Adam, with practical guidance.

← Back to the Week 6 lab