Introduction to Deep Learning · HIT

Week 4   Part II · Training Infrastructure

Data Pipelines

Instructor lesson plan: lecture (3 h) and practice (2 h).

Learning objectives

🎓Lecture · 3 hours

0:00–0:1010 minRecap & retrievalOpen with two quick questions on last week's material (retrieval practice), then state this week's objectives.
0:10–0:2515 minMotivationModels are only as good as the data pipeline; leakage silently inflates results.
0:25–1:1045 minDataset and DataLoader
  • A Dataset implements __len__ and __getitem__ to return one (input, label) at a time.
  • A DataLoader batches, shuffles, and can parallelize loading with worker processes.
  • Transforms preprocess each sample (to tensor, normalize, augment) as it is read.
  • Iterating the DataLoader yields batches shaped (batch, ...).
1:10–1:2010 minBreak
1:20–2:0545 minSplits and leakage
  • Split into train, validation, and test; tune on validation, report once on test.
  • Fit normalization statistics on the training split only.
  • Leakage is any test or future information reaching training; it silently inflates results.
  • Batch size trades gradient noise against speed and memory.
2:05–2:3530 minLive demo (predict, then run)Ask the class to predict whether the leaked-normalization accuracy will be higher or lower than the honest one, then show the inflated number. Write a custom Dataset and DataLoader, iterate batches, then show a normalization leak inflating accuracy and fix it.
2:35–2:5015 minWrap-up & practice previewRevisit the misconception and concept checks below, recap the takeaways, and preview the practice lesson.
2:50–3:0010 minBuffer & questions
Common misconception to confront.

Students often think: Normalizing the whole dataset before splitting is harmless, it is just scaling.
Set it straight: Fitting normalization statistics on all data leaks test information into training and silently inflates results; fit on the training split only, then apply to val and test.

Check for understanding (pose during the concept blocks; let students answer before revealing).
You scale features with the mean and std of the entire dataset, then split. What is wrong?
Test statistics leaked into the scaler, so the model has seen test information; compute mean and std on train only.
Shuffle the validation set each epoch? The training set?
Training: yes, to decorrelate minibatches. Validation and test: no need, order does not change the metric.
Key takeaways.

💻Practice · 2 hours

In the practice lesson the instructor demonstrates implementations, runs code, and works through examples, using the practice notebook linked below. The weekly lab is then set as homework, where students apply this themselves.

0:00–0:1010 minSetup & recapRecap the lecture's key ideas and open the working notebook.
0:10–1:0050 minInstructor demonstrations
  • Write a custom Dataset and DataLoader live and iterate over batches.
  • Show how batch size and shuffling change each epoch.
1:00–1:055 minBreak
1:05–1:4540 minInstructor demonstrations (continued)
  • Introduce a data leak (normalizing on the full dataset), show the inflated metric, then fix it.
1:45–2:0015 minWrap-up & lab briefSummarize the patterns shown and brief the weekly lab (homework), which students complete on their own.
Common pitfalls to pre-empt.

Open the practice notebook in Colab Curated references Lab (homework)

PreviousWeek 3: MLPs & BackpropagationNextWeek 5: Loss Functions & Metrics