Prerequisite Review & refresh

🧮 Basic Machine Learning Concepts

This course assumes an introductory machine-learning course. Deep learning reuses its vocabulary, models, losses, splits, and the overfitting story, so the network material lands on familiar ground.

Supervised learning

Regression predicts a continuous number; classification predicts a category.
Features (inputs), labels (targets), and the train/test split that estimates how well a model generalizes.
Linear and logistic regression as the simplest baseline models.

Error, cost, and loss

A loss measures the error on one example; the cost (objective) aggregates it over the whole dataset.
Mean squared error for regression; cross-entropy for classification.
The loss is what training minimizes; the metric (such as accuracy) is the number reported to people.

Overfitting and generalization

Training error (on data already seen) versus test error (on new data), and why the gap matters.
Train, validation, and test splits, and cross-validation for a more stable estimate.
Learning curves and the gap between training and validation performance.

Regularization

L2 (weight decay) penalizes large weights; L1 also drives some weights to exactly zero (sparsity).
Early stopping (halting before overfitting) and data augmentation.
Regularization trades a little training fit for better generalization.

The bias-variance dilemma

High bias means underfitting (too simple); high variance means overfitting (too complex).
Model capacity and dataset size move a model between these two failure modes.
The aim is the balance point that minimizes error on new data.

Other essentials

Gradient descent and the learning rate as the core idea of the training loop.
Feature scaling and normalization so inputs sit on comparable ranges.
Evaluation metrics: accuracy, precision, recall, F1, ROC/AUC, and the confusion matrix.
Parameters (learned) versus hyperparameters (set by hand), baselines, data leakage, and class imbalance.

Readiness check

State whether a task is regression or classification and pick a loss.
Explain overfitting using the train/validation gap.
Describe how L2 regularization changes the objective.
Read a confusion matrix and compute precision and recall.

❓Self-check questions

Multiple-choice questions on the topic itself. Pick an answer, then reveal it. If several are unclear, work through the review above first.

1. Predicting a house price (a continuous number) is a:

classification task
regression task
clustering task
ranking task

Show answer

Correct: B. Regression predicts continuous values; classification predicts discrete classes.

2. A typical loss for binary classification is:

mean squared error
binary cross-entropy
R-squared
accuracy

Show answer

Correct: B. Cross-entropy (log loss) matches probabilistic classification; accuracy is a metric, not a loss.

3. The difference between a loss and a metric is:

they are the same
the loss is optimized in training; the metric is the reported measure
the metric is always the loss
the loss is only for testing

Show answer

Correct: B. Training minimizes the differentiable loss; the metric is the human-facing measure, and the two can differ.

4. Overfitting is indicated by:

high training error and high test error
low training error and high test error
low error on both
high error on training only

Show answer

Correct: B. The model memorizes the training data (low train error) but generalizes poorly (high test error).

5. The validation set is used to:

train the weights
tune hyperparameters and select models
report the final result
increase the data size

Show answer

Correct: B. The validation set guides model and hyperparameter choices; the test set is the untouched final measure.

6. k-fold cross-validation:

trains once on all data
splits data into k folds and rotates the validation fold
only works for images
removes the need for a test set

Show answer

Correct: B. Each fold serves as validation once; the results are averaged for a more robust estimate.

7. L2 regularization (weight decay) penalizes:

the sum of absolute weights
the sum of squared weights
the number of layers
the learning rate

Show answer

Correct: B. L2 penalizes squared weight magnitude; L1 penalizes absolute values.

8. Compared with L2, L1 regularization tends to produce:

denser weights
sparser weights (more exact zeros)
larger weights
no effect

Show answer

Correct: B. The L1 penalty drives some weights exactly to zero, giving sparse solutions.

9. The bias-variance trade-off says that:

bias and variance are unrelated
reducing one often increases the other
both always decrease together
variance is always zero

Show answer

Correct: B. Simple models have high bias and low variance; complex models the reverse. The goal balances them.

10. A model that is too simple for the data has:

high variance
high bias (underfitting)
perfect accuracy
too many parameters

Show answer

Correct: B. Underfitting is high bias: the model cannot capture the underlying pattern.

11. Why can accuracy be misleading on imbalanced data?

it cannot be computed
a trivial majority-class predictor can score high
it requires probabilities
it only works for regression

Show answer

Correct: B. With 95% of one class, always predicting that class scores 95% while missing the minority entirely.

12. Precision is defined as:

TP / (TP + FN)
TP / (TP + FP)
(TP + TN) / all
FP / all

Show answer

Correct: B. Precision is true positives over predicted positives; recall is TP / (TP + FN).

13. The F1 score is:

the sum of precision and recall
the harmonic mean of precision and recall
accuracy minus error
the area under the ROC curve

Show answer

Correct: B. F1 = 2 P R / (P + R), balancing precision and recall in one number.

14. A parameter and a hyperparameter differ in that:

they are identical
parameters are learned from data; hyperparameters are set before training
hyperparameters are learned by backprop
parameters are chosen by hand

Show answer

Correct: B. Weights are parameters learned in training; learning rate, depth, etc. are hyperparameters set beforehand.

15. Data leakage is:

losing data files
information from the test set or future leaking into training
a memory error
having too little data

Show answer

Correct: B. For example, scaling with statistics computed over the whole dataset before splitting inflates results.

📚Refresher resources

Refresh

Google Machine Learning Crash Course developers.google.com

Refresh

An Introduction to Statistical Learning (free book) statlearning.com

Refresh

StatQuest: Machine Learning playlist youtube.com

Refresh

scikit-learn: Metrics and scoring scikit-learn.org

← All prerequisites Course home