Introduction to Deep Learning · HIT

Sample submission   Week 7 lab · Regularization and generalization

Closing the generalization gap on a small classifier

An illustrative, complete lab submission following the course's Build / Predict & probe / Explain & defend model. It shows the expected structure, depth, and tone of a strong submission, including a prediction that turned out wrong and what was learned from it.

Lab task. Train a classifier on a small dataset that overfits, then reduce the overfitting with regularization. Report validation accuracy before and after, design an ablation that isolates each regularizer, and justify the final choice.

Part A · AI assistant welcomeBuild

The training pipeline below was drafted with an AI assistant and then read line by line. The dataset is deliberately small (60 training points) with 34 uninformative noise features, so an unregularized network memorizes the training set and generalizes poorly.

import torch, torch.nn as nn
torch.manual_seed(0)

# Synthetic task: the label depends on only the first 6 of 40 features;
# the other 34 are pure noise the model can latch onto and overfit.
def make(n, d=40, k=6):
    X = torch.randn(n, d)
    w = torch.zeros(d)
    w[:k] = torch.tensor([2.0, -1.8, 1.5, -1.4, 1.2, -1.0])
    y = (X @ w + 0.2 * torch.randn(n) > 0).long()
    return X, y

Xtr, ytr   = make(60)      # small training set  -> easy to overfit
Xval, yval = make(2000)    # large validation set -> stable estimate

def mlp(p_drop=0.0):
    return nn.Sequential(
        nn.Linear(40, 256), nn.ReLU(), nn.Dropout(p_drop),
        nn.Linear(256, 256), nn.ReLU(), nn.Dropout(p_drop),
        nn.Linear(256, 2))

def train(model, weight_decay=0.0, epochs=600):
    opt = torch.optim.Adam(model.parameters(), lr=1e-3, weight_decay=weight_decay)
    loss_fn = nn.CrossEntropyLoss()
    for _ in range(epochs):
        model.train(); opt.zero_grad()
        loss_fn(model(Xtr), ytr).backward(); opt.step()
    model.eval()
    with torch.no_grad():
        tr = (model(Xtr).argmax(1) == ytr).float().mean().item()
        va = (model(Xval).argmax(1) == yval).float().mean().item()
    return tr, va

torch.manual_seed(0)
tr, va = train(mlp())          # baseline: no regularization
print(f"baseline  train {tr:.2f}  val {va:.3f}")
# -> baseline  train 1.00  val 0.717

The baseline reaches 100% training accuracy but only 71.7% validation accuracy: a 28-point gap, the signature of overfitting. The network has fit the noise features.

Part B · reasoningPredict & probe

Four hypotheses were written down before running anything, then tested with a controlled ablation that changes one regularizer at a time.

Predictions (before running)

#Hypothesis (written before running)Predicted outcome
H1Dropout 0.5 regularizes the network.Validation accuracy rises; training accuracy may dip.
H2Weight decay 5e-2 suppresses the 34 noise features.Validation rises more than dropout alone.
H3Dropout and weight decay together.Best validation accuracy of all configurations.
H4Very strong regularization (dropout 0.8 + wd 5e-1).The model underfits; training accuracy collapses.

The experiment

configs = [
    ("dropout 0.5",           dict(p_drop=0.5)),
    ("weight decay 5e-2",     dict(weight_decay=5e-2)),
    ("dropout 0.5 + wd 5e-2", dict(p_drop=0.5, weight_decay=5e-2)),
    ("over-regularized",      dict(p_drop=0.8, weight_decay=5e-1)),
]
for name, kw in configs:
    torch.manual_seed(0)
    p = kw.get("p_drop", 0.0); wd = kw.get("weight_decay", 0.0)
    tr, va = train(mlp(p), weight_decay=wd)
    print(f"{name:22s} train {tr:.2f}  val {va:.3f}")

Results

ConfigurationTrain accVal accvs baseline
baseline (no regularization)1.000.717
dropout 0.51.000.740+0.023
weight decay 5e-21.000.759+0.042
dropout 0.5 + weight decay 5e-21.000.757+0.040
over-regularized (drop 0.8 + wd 5e-1)0.520.502−0.215

Predicted vs observed

#PredictedObservedVerdict
H1Val rises, train may dip.Val +0.023; train stayed 1.00 (60 points still memorized).confirmed
H2Weight decay beats dropout.Weight decay best single regularizer (+0.042).confirmed
H3Combining is best.0.757, just below weight decay alone (0.759). Dropout added nothing on top.refuted
H4Underfits, train collapses.Train fell to 0.52, val to chance (0.50).confirmed
The instructive miss (H3). Combining dropout with weight decay was predicted to be best, but it did not beat weight decay alone. With weight decay already constraining the weights, the extra dropout removed capacity the model could not spare on 60 points; the two regularizers overlap here rather than stack. Catching this is the point of writing the prediction down first.

Part C · in plain languageExplain & defend

Why regularization helped. A 60-point training set with 34 noise features is a high-variance regime: the unregularized network fits the noise (training accuracy 1.00) and carries it to validation. Weight decay penalizes large weights, shrinking the reliance on the noise features; dropout forces the network to spread its prediction across redundant units. Both lower variance, so validation accuracy rises from 0.72 to 0.76.

Where it would break. With abundant training data the gap would be small and regularization would matter little. Too strong, and the model underfits, the 0.50 validation case, no better than guessing. Dropout must be off at evaluation (model.eval()); leaving it on would corrupt the reported validation accuracy.

What changed and the final choice. Validation accuracy improved by 4 points (0.717 to 0.759) by suppressing reliance on the 34 noise features. The chosen configuration is weight decay 5e-2 alone: it gives the best validation accuracy with the simplest setup, and the ablation shows dropout adds nothing on top of it for this dataset.

AI-use disclosure. An AI assistant drafted the training loop and the metric code (Part A). The ablation design, the four predictions, the interpretation of the H3 surprise, and the final choice are the author's own work, and every line can be explained and defended on request.
How this maps to the rubric. Part A shows working, understood code. Part B shows reasoning made falsifiable: predictions first, a clean one-variable-at-a-time ablation, and an honest account of the prediction that failed. Part C shows mechanism, failure modes, and a defended decision. The graded weight is on Parts B and C, the parts an assistant cannot do for the student.

← Course home The Week 7 lab Week 7 practice notebook