Sample submission Week 7 lab · Regularization and generalization
An illustrative, complete lab submission following the course's Build / Predict & probe / Explain & defend model. It shows the expected structure, depth, and tone of a strong submission, including a prediction that turned out wrong and what was learned from it.
The training pipeline below was drafted with an AI assistant and then read line by line. The dataset is deliberately small (60 training points) with 34 uninformative noise features, so an unregularized network memorizes the training set and generalizes poorly.
import torch, torch.nn as nn
torch.manual_seed(0)
# Synthetic task: the label depends on only the first 6 of 40 features;
# the other 34 are pure noise the model can latch onto and overfit.
def make(n, d=40, k=6):
X = torch.randn(n, d)
w = torch.zeros(d)
w[:k] = torch.tensor([2.0, -1.8, 1.5, -1.4, 1.2, -1.0])
y = (X @ w + 0.2 * torch.randn(n) > 0).long()
return X, y
Xtr, ytr = make(60) # small training set -> easy to overfit
Xval, yval = make(2000) # large validation set -> stable estimate
def mlp(p_drop=0.0):
return nn.Sequential(
nn.Linear(40, 256), nn.ReLU(), nn.Dropout(p_drop),
nn.Linear(256, 256), nn.ReLU(), nn.Dropout(p_drop),
nn.Linear(256, 2))
def train(model, weight_decay=0.0, epochs=600):
opt = torch.optim.Adam(model.parameters(), lr=1e-3, weight_decay=weight_decay)
loss_fn = nn.CrossEntropyLoss()
for _ in range(epochs):
model.train(); opt.zero_grad()
loss_fn(model(Xtr), ytr).backward(); opt.step()
model.eval()
with torch.no_grad():
tr = (model(Xtr).argmax(1) == ytr).float().mean().item()
va = (model(Xval).argmax(1) == yval).float().mean().item()
return tr, va
torch.manual_seed(0)
tr, va = train(mlp()) # baseline: no regularization
print(f"baseline train {tr:.2f} val {va:.3f}")
# -> baseline train 1.00 val 0.717The baseline reaches 100% training accuracy but only 71.7% validation accuracy: a 28-point gap, the signature of overfitting. The network has fit the noise features.
Four hypotheses were written down before running anything, then tested with a controlled ablation that changes one regularizer at a time.
| # | Hypothesis (written before running) | Predicted outcome |
|---|---|---|
| H1 | Dropout 0.5 regularizes the network. | Validation accuracy rises; training accuracy may dip. |
| H2 | Weight decay 5e-2 suppresses the 34 noise features. | Validation rises more than dropout alone. |
| H3 | Dropout and weight decay together. | Best validation accuracy of all configurations. |
| H4 | Very strong regularization (dropout 0.8 + wd 5e-1). | The model underfits; training accuracy collapses. |
configs = [
("dropout 0.5", dict(p_drop=0.5)),
("weight decay 5e-2", dict(weight_decay=5e-2)),
("dropout 0.5 + wd 5e-2", dict(p_drop=0.5, weight_decay=5e-2)),
("over-regularized", dict(p_drop=0.8, weight_decay=5e-1)),
]
for name, kw in configs:
torch.manual_seed(0)
p = kw.get("p_drop", 0.0); wd = kw.get("weight_decay", 0.0)
tr, va = train(mlp(p), weight_decay=wd)
print(f"{name:22s} train {tr:.2f} val {va:.3f}")| Configuration | Train acc | Val acc | vs baseline |
|---|---|---|---|
| baseline (no regularization) | 1.00 | 0.717 | – |
| dropout 0.5 | 1.00 | 0.740 | +0.023 |
| weight decay 5e-2 | 1.00 | 0.759 | +0.042 |
| dropout 0.5 + weight decay 5e-2 | 1.00 | 0.757 | +0.040 |
| over-regularized (drop 0.8 + wd 5e-1) | 0.52 | 0.502 | −0.215 |
| # | Predicted | Observed | Verdict |
|---|---|---|---|
| H1 | Val rises, train may dip. | Val +0.023; train stayed 1.00 (60 points still memorized). | confirmed |
| H2 | Weight decay beats dropout. | Weight decay best single regularizer (+0.042). | confirmed |
| H3 | Combining is best. | 0.757, just below weight decay alone (0.759). Dropout added nothing on top. | refuted |
| H4 | Underfits, train collapses. | Train fell to 0.52, val to chance (0.50). | confirmed |
Why regularization helped. A 60-point training set with 34 noise features is a high-variance regime: the unregularized network fits the noise (training accuracy 1.00) and carries it to validation. Weight decay penalizes large weights, shrinking the reliance on the noise features; dropout forces the network to spread its prediction across redundant units. Both lower variance, so validation accuracy rises from 0.72 to 0.76.
Where it would break. With abundant training data the gap would be small and regularization would matter little. Too strong, and the model underfits, the 0.50 validation case, no better than guessing. Dropout must be off at evaluation (model.eval()); leaving it on would corrupt the reported validation accuracy.
What changed and the final choice. Validation accuracy improved by 4 points (0.717 to 0.759) by suppressing reliance on the 34 noise features. The chosen configuration is weight decay 5e-2 alone: it gives the best validation accuracy with the simplest setup, and the ablation shows dropout adds nothing on top of it for this dataset.