Week 4 Part II · Training Infrastructure

Data Pipelines

The Dataset and DataLoader abstractions; batching, shuffling, transforms, and splits.

Learning goals

Build a custom Dataset and DataLoader.
Reason about batching, shuffling, and clean splits.
Recognize and avoid data leakage.

This is the weekly homework lab, completed independently after the lecture and the practice lesson. It follows the course's Build / Predict & probe / Explain & defend model: use an AI assistant freely for the Build; the graded learning is in Predict and Explain. See the AI-use policy and a fully worked sample submission.

⚙Exercise

Part A · AI assistant welcomeBuild

Implement a custom Dataset and wrap it in a DataLoader with transforms, batching, and shuffling.
Create a clean train, validation, and test split with a fixed seed.

Part B · student reasoningPredict & probe

Predict the effect of batch size on loss-curve smoothness and steps per epoch.
Predict what happens to validation accuracy if normalization stats are computed on the full dataset.

Part C · in plain languageExplain & defend

Introduce a deliberate data leak, observe the inflated metric, and explain the mechanism and the fix.

✓Deliverables

A pipeline notebook with a working DataLoader.
A short batch-size experiment and the leak demonstration with explanation.

Hints.

Fit normalization on the training split only.
shuffle=True for training, False for validation and test.

❓Self-check

Answer each before expanding it. If one is unclear, revisit the lab and the references.

Which two methods must a custom Dataset implement?

__len__ and __getitem__.

Why fit normalization statistics on the training split only?

Using validation/test statistics leaks information and inflates metrics.

Should the validation set be shuffled?

No. Shuffle the training set only.

What does batch size trade off?

Gradient noise/stability against compute, memory, and steps per epoch.

Give one concrete example of data leakage.

Computing scaling or feature selection over the whole dataset before splitting.

Instructor lesson plan (with references)