Week 11 Part III · Architectures & Representation Learning

LSTMs, GRUs & Sequence Tasks

Instructor lesson plan: lecture (3 h) and practice (2 h).

Learning objectives

Build an LSTM or GRU and compare it to the plain RNN.
Understand how gates restore gradient flow.
Apply gated networks to a sequence task.

🎓Lecture · 3 hours

0:00–0:10	10 min	Recap & retrievalOpen with two quick questions on last week's material (retrieval practice), then state this week's objectives.
0:10–0:25	15 min	MotivationGates: a learned mechanism to keep or forget information across long sequences.
0:25–1:10	45 min	LSTM and GRU An LSTM adds a cell state plus input, forget, and output gates that control information flow. The cell state provides a near-linear path that preserves gradients over long sequences. A GRU merges gates and states into a lighter unit with similar performance. The gates are learned, so the network decides what to keep and what to forget.
1:10–1:20	10 min	Break
1:20–2:05	45 min	Sequence tasks Sequence classification produces one label for a whole sequence. Sequence-to-sequence uses an encoder and a decoder for variable-length outputs. Teacher forcing feeds the true previous token to the decoder during training. Match the architecture (many-to-one, many-to-many) to the task.
2:05–2:35	30 min	Live demo (predict, then run)Ask the class to predict whether the LSTM or the plain RNN holds the long-range signal better before comparing them. Swap an RNN for an LSTM, inspect the gates and cell state, and compare long- versus short-sequence gradients.
2:35–2:50	15 min	Wrap-up & practice previewRevisit the misconception and concept checks below, recap the takeaways, and preview the practice lesson.
2:50–3:00	10 min	Buffer & questions

Common misconception to confront.

Students often think: LSTMs beat plain RNNs because they are bigger and have more parameters.
Set it straight: It is the cell state’s near-linear, gated path, not the parameter count, that preserves gradients across long sequences; the gates learn what to keep and forget.

Check for understanding (pose during the concept blocks; let students answer before revealing).

What is the role of the cell state versus the hidden state in an LSTM?

The cell state is a protected, near-linear memory carried across steps (a gradient highway); the hidden state is the gated output exposed to the next step or layer.

Sentiment classification of a whole sentence: many-to-one or many-to-many?

Many-to-one: a whole sequence in, a single label out.

Key takeaways.

Gates preserve the gradient signal across long sequences.
A GRU is lighter than an LSTM.
Match the architecture to the sequence task.

💻Practice · 2 hours

In the practice lesson the instructor demonstrates implementations, runs code, and works through examples, using the practice notebook linked below. The weekly lab is then set as homework, where students apply this themselves.

0:00–0:10	10 min	Setup & recapRecap the lecture's key ideas and open the working notebook.
0:10–1:00	50 min	Instructor demonstrations Swap the RNN for an LSTM or GRU on the same task and compare. Walk through the gates and the cell state on the board and in code.
1:00–1:05	5 min	Break
1:05–1:45	40 min	Instructor demonstrations (continued) Show behavior on long versus short sequences.
1:45–2:00	15 min	Wrap-up & lab briefSummarize the patterns shown and brief the weekly lab (homework), which students complete on their own.

Common pitfalls to pre-empt.

Keep the task identical to Week 10 for a fair comparison.
A GRU is lighter than an LSTM; watch long-sequence accuracy.

Open the practice notebook in Colab Curated references Lab (homework)