Week 4 Part II · Training Infrastructure

Data Pipelines

Instructor lesson plan: lecture (3 h) and practice (2 h).

Learning objectives

Build a custom Dataset and DataLoader.
Reason about batching, shuffling, and clean splits.
Recognize and avoid data leakage.

🎓Lecture · 3 hours

0:00–0:10	10 min	Recap & retrievalOpen with two quick questions on last week's material (retrieval practice), then state this week's objectives.
0:10–0:25	15 min	MotivationModels are only as good as the data pipeline; leakage silently inflates results.
0:25–1:10	45 min	Dataset and DataLoader A Dataset implements __len__ and __getitem__ to return one (input, label) at a time. A DataLoader batches, shuffles, and can parallelize loading with worker processes. Transforms preprocess each sample (to tensor, normalize, augment) as it is read. Iterating the DataLoader yields batches shaped (batch, ...).
1:10–1:20	10 min	Break
1:20–2:05	45 min	Splits and leakage Split into train, validation, and test; tune on validation, report once on test. Fit normalization statistics on the training split only. Leakage is any test or future information reaching training; it silently inflates results. Batch size trades gradient noise against speed and memory.
2:05–2:35	30 min	Live demo (predict, then run)Ask the class to predict whether the leaked-normalization accuracy will be higher or lower than the honest one, then show the inflated number. Write a custom Dataset and DataLoader, iterate batches, then show a normalization leak inflating accuracy and fix it.
2:35–2:50	15 min	Wrap-up & practice previewRevisit the misconception and concept checks below, recap the takeaways, and preview the practice lesson.
2:50–3:00	10 min	Buffer & questions

Common misconception to confront.

Students often think: Normalizing the whole dataset before splitting is harmless, it is just scaling.
Set it straight: Fitting normalization statistics on all data leaks test information into training and silently inflates results; fit on the training split only, then apply to val and test.

Check for understanding (pose during the concept blocks; let students answer before revealing).

You scale features with the mean and std of the entire dataset, then split. What is wrong?

Test statistics leaked into the scaler, so the model has seen test information; compute mean and std on train only.

Shuffle the validation set each epoch? The training set?

Training: yes, to decorrelate minibatches. Validation and test: no need, order does not change the metric.

Key takeaways.

Fit preprocessing on the training split only.
Shuffle training data, not validation or test.
Batch size trades gradient noise against speed.

💻Practice · 2 hours

In the practice lesson the instructor demonstrates implementations, runs code, and works through examples, using the practice notebook linked below. The weekly lab is then set as homework, where students apply this themselves.

0:00–0:10	10 min	Setup & recapRecap the lecture's key ideas and open the working notebook.
0:10–1:00	50 min	Instructor demonstrations Write a custom Dataset and DataLoader live and iterate over batches. Show how batch size and shuffling change each epoch.
1:00–1:05	5 min	Break
1:05–1:45	40 min	Instructor demonstrations (continued) Introduce a data leak (normalizing on the full dataset), show the inflated metric, then fix it.
1:45–2:00	15 min	Wrap-up & lab briefSummarize the patterns shown and brief the weekly lab (homework), which students complete on their own.

Common pitfalls to pre-empt.

Fit normalization on the training split only.
shuffle=True for training, False for validation and test.

Open the practice notebook in Colab Curated references Lab (homework)