Introduction to Deep Learning · HIT

Week 10   Part III · Architectures & Representation Learning

Recurrent Networks (RNNs)

Instructor lesson plan: lecture (3 h) and practice (2 h).

Learning objectives

🎓Lecture · 3 hours

0:00–0:1010 minRecap & retrievalOpen with two quick questions on last week's material (retrieval practice), then state this week's objectives.
0:10–0:2515 minMotivationSequences need memory; how recurrence shares parameters across time, and why it struggles.
0:25–1:1045 minRecurrence and BPTT
  • An RNN processes a sequence step by step, carrying a hidden state forward.
  • The same weights are reused at every time step (weight sharing across time).
  • Unrolling over time turns the recurrence into a deep feedforward graph.
  • Backpropagation through time applies the chain rule across that unrolled graph.
1:10–1:2010 minBreak
1:20–2:0545 minVanishing and exploding gradients
  • Repeated multiplication by the recurrent weights shrinks or grows gradients exponentially.
  • Vanishing gradients make long-range dependencies nearly unlearnable.
  • Exploding gradients are tamed with gradient clipping.
  • These problems motivate the gated units in the next lecture.
2:05–2:3530 minLive demo (predict, then run)Ask the class to predict how the gradient reaching the first step changes as the sequence gets longer, then plot it. Train a plain RNN, plot the gradient reaching the first step versus sequence length, and apply clipping.
2:35–2:5015 minWrap-up & practice previewRevisit the misconception and concept checks below, recap the takeaways, and preview the practice lesson.
2:50–3:0010 minBuffer & questions
Common misconception to confront.

Students often think: An RNN has a separate set of weights for each time step.
Set it straight: The same weights are reused at every step (weight sharing across time); unrolling only makes it look deep, it is one shared cell applied repeatedly.

Check for understanding (pose during the concept blocks; let students answer before revealing).
Why do gradients vanish or explode in a plain RNN over long sequences?
BPTT multiplies by the recurrent matrix once per step; repeated multiplication shrinks (<1) or grows (>1) the gradient exponentially with length.
Does gradient clipping fix vanishing or exploding gradients?
Exploding: it caps the gradient norm. Vanishing needs an architectural fix (gating), next week.
Key takeaways.

💻Practice · 2 hours

In the practice lesson the instructor demonstrates implementations, runs code, and works through examples, using the practice notebook linked below. The weekly lab is then set as homework, where students apply this themselves.

0:00–0:1010 minSetup & recapRecap the lecture's key ideas and open the working notebook.
0:10–1:0050 minInstructor demonstrations
  • Build a plain RNN on a short sequence task and run it.
  • Plot gradient norms across time steps to expose vanishing gradients.
1:00–1:055 minBreak
1:05–1:4540 minInstructor demonstrations (continued)
  • Demonstrate gradient clipping.
1:45–2:0015 minWrap-up & lab briefSummarize the patterns shown and brief the weekly lab (homework), which students complete on their own.
Common pitfalls to pre-empt.

Open the practice notebook in Colab Curated references Lab (homework)

PreviousWeek 9: Convolutional Networks IINextWeek 11: LSTMs, GRUs & Sequence Tasks