Week 10 Part III · Architectures & Representation Learning

Recurrent Networks (RNNs)

Instructor lesson plan: lecture (3 h) and practice (2 h).

Learning objectives

Build a plain RNN for sequence data.
Understand recurrence and backpropagation through time.
Observe the vanishing-gradient problem directly.

🎓Lecture · 3 hours

0:00–0:10	10 min	Recap & retrievalOpen with two quick questions on last week's material (retrieval practice), then state this week's objectives.
0:10–0:25	15 min	MotivationSequences need memory; how recurrence shares parameters across time, and why it struggles.
0:25–1:10	45 min	Recurrence and BPTT An RNN processes a sequence step by step, carrying a hidden state forward. The same weights are reused at every time step (weight sharing across time). Unrolling over time turns the recurrence into a deep feedforward graph. Backpropagation through time applies the chain rule across that unrolled graph.
1:10–1:20	10 min	Break
1:20–2:05	45 min	Vanishing and exploding gradients Repeated multiplication by the recurrent weights shrinks or grows gradients exponentially. Vanishing gradients make long-range dependencies nearly unlearnable. Exploding gradients are tamed with gradient clipping. These problems motivate the gated units in the next lecture.
2:05–2:35	30 min	Live demo (predict, then run)Ask the class to predict how the gradient reaching the first step changes as the sequence gets longer, then plot it. Train a plain RNN, plot the gradient reaching the first step versus sequence length, and apply clipping.
2:35–2:50	15 min	Wrap-up & practice previewRevisit the misconception and concept checks below, recap the takeaways, and preview the practice lesson.
2:50–3:00	10 min	Buffer & questions

Common misconception to confront.

Students often think: An RNN has a separate set of weights for each time step.
Set it straight: The same weights are reused at every step (weight sharing across time); unrolling only makes it look deep, it is one shared cell applied repeatedly.

Check for understanding (pose during the concept blocks; let students answer before revealing).

Why do gradients vanish or explode in a plain RNN over long sequences?

BPTT multiplies by the recurrent matrix once per step; repeated multiplication shrinks (<1) or grows (>1) the gradient exponentially with length.

Does gradient clipping fix vanishing or exploding gradients?

Exploding: it caps the gradient norm. Vanishing needs an architectural fix (gating), next week.

Key takeaways.

RNNs share weights across time steps.
Long-range gradients vanish or explode.
Clip gradients to stabilize training.

💻Practice · 2 hours

In the practice lesson the instructor demonstrates implementations, runs code, and works through examples, using the practice notebook linked below. The weekly lab is then set as homework, where students apply this themselves.

0:00–0:10	10 min	Setup & recapRecap the lecture's key ideas and open the working notebook.
0:10–1:00	50 min	Instructor demonstrations Build a plain RNN on a short sequence task and run it. Plot gradient norms across time steps to expose vanishing gradients.
1:00–1:05	5 min	Break
1:05–1:45	40 min	Instructor demonstrations (continued) Demonstrate gradient clipping.
1:45–2:00	15 min	Wrap-up & lab briefSummarize the patterns shown and brief the weekly lab (homework), which students complete on their own.

Common pitfalls to pre-empt.

Clip gradients to avoid explosion; start with short sequences.
Log the gradient norm at the earliest time steps to see vanishing.

Open the practice notebook in Colab Curated references Lab (homework)