Introduction to Deep Learning · HIT

Week 4   Part II · Training Infrastructure

Data Pipelines

The Dataset and DataLoader abstractions; batching, shuffling, transforms, and splits.

Learning goals

This is the weekly homework lab, completed independently after the lecture and the practice lesson. It follows the course's Build / Predict & probe / Explain & defend model: use an AI assistant freely for the Build; the graded learning is in Predict and Explain. See the AI-use policy and a fully worked sample submission.

Exercise

Part A · AI assistant welcomeBuild

  1. Implement a custom Dataset and wrap it in a DataLoader with transforms, batching, and shuffling.
  2. Create a clean train, validation, and test split with a fixed seed.

Part B · student reasoningPredict & probe

  1. Predict the effect of batch size on loss-curve smoothness and steps per epoch.
  2. Predict what happens to validation accuracy if normalization stats are computed on the full dataset.

Part C · in plain languageExplain & defend

  1. Introduce a deliberate data leak, observe the inflated metric, and explain the mechanism and the fix.

Deliverables

Hints.

Self-check

Answer each before expanding it. If one is unclear, revisit the lab and the references.

Which two methods must a custom Dataset implement?
__len__ and __getitem__.
Why fit normalization statistics on the training split only?
Using validation/test statistics leaks information and inflates metrics.
Should the validation set be shuffled?
No. Shuffle the training set only.
What does batch size trade off?
Gradient noise/stability against compute, memory, and steps per epoch.
Give one concrete example of data leakage.
Computing scaling or feature selection over the whole dataset before splitting.

Instructor lesson plan (with references)

PreviousWeek 3: MLPs & BackpropagationNextWeek 5: Loss Functions & Metrics