Prerequisite Review & refresh

∑ Mathematics

Deep learning is applied linear algebra and calculus with a probabilistic flavor. A mathematician's depth is not required, but these ideas should feel familiar so the course can move quickly from notation to networks.

Linear algebra

Vectors and matrices: ordered lists and grids of numbers, and how to add and scale them.
Matrix multiplication and the dot product, and which shapes are compatible (the inner dimensions must match).
Vector norms (L1 sums absolute values, L2 is Euclidean length) and the dot product as a measure of angle and projection.
Rank as the number of independent directions in a matrix, with a light acquaintance with eigenvalues and the SVD.

Probability and statistics

Random variables and distributions: how outcomes are spread over possible values, discrete or continuous.
Expectation (the long-run average), variance, and standard deviation (how far values spread from the mean).
Conditional probability (the chance of one event given another), independence, and Bayes' rule.
Likelihood: how probable observed data is under a model, and why logarithms make it convenient to work with.

Multivariable functions

Functions of several variables and the surfaces they trace out.
Partial derivatives (the slope in one variable while the others are held fixed) and the gradient (the vector of all of them).
The chain rule for differentiating a function of a function (a composition).
A light feel for the Jacobian (first derivatives of several outputs) and the Hessian (second derivatives).

Gradients and optimization

The gradient points in the direction in which a function increases fastest.
Gradient descent: repeatedly step in the opposite (downhill) direction to reach a minimum.
Step-size (learning-rate) intuition: too large overshoots and diverges, too small is very slow.
Local versus global minima, and why convex functions are easy to minimize (every local minimum is global).

Readiness check

Multiply two matrices by hand and state the result shape.
Compute a gradient of a simple multivariable function.
Explain what expectation and variance measure.
Describe how gradient descent uses the gradient.

❓Self-check questions

Multiple-choice questions on the topic itself. Pick an answer, then reveal it. If several are unclear, work through the review above first.

1. If A is 3x4 and B is 4x2, the product AB has shape:

4x4
3x2
2x3
undefined

Show answer

Correct: B. The inner dimensions (4) match, and the result takes the outer dimensions: 3x2.

2. Matrix multiplication is:

commutative (AB = BA)
generally not commutative
only defined for square matrices
the same as the elementwise product

Show answer

Correct: B. In general AB does not equal BA; order matters, and the elementwise (Hadamard) product is a different operation.

3. The transpose of a product (AB) equals:

A B transposed each
B-transpose times A-transpose
AB
BA

Show answer

Correct: B. Transposing a product reverses the order: (AB)^T = B^T A^T.

4. The dot product of two non-zero vectors is zero when they are:

parallel
of unit length
orthogonal (perpendicular)
equal

Show answer

Correct: C. u . v = |u||v|cos(theta); it is zero when the angle is 90 degrees.

5. The L2 (Euclidean) norm of the vector (3, 4) is:

Show answer

Correct: B. sqrt(3^2 + 4^2) = sqrt(25) = 5.

6. For a fair six-sided die, the expected value of one roll is:

Show answer

Correct: B. (1+2+3+4+5+6)/6 = 21/6 = 3.5.

7. Variance measures:

the average value
the most frequent value
the spread around the mean
the maximum value

Show answer

Correct: C. Variance is the expected squared deviation from the mean.

8. Two events A and B are independent when:

P(A and B) = P(A) + P(B)
P(A and B) = P(A) P(B)
they are mutually exclusive
P(A given B) = 0

Show answer

Correct: B. Independence means the joint probability factorizes into the product of the marginals.

9. Bayes' rule writes P(A given B) in terms of:

P(B given A), P(A), P(B)
P(A) + P(B)
P(A) P(B) only
P(A - B)

Show answer

Correct: A. P(A|B) = P(B|A) P(A) / P(B).

10. The partial derivative of f(x, y) = x^2 y with respect to x treats:

both x and y as variables
y as a constant
x as a constant
f as constant

Show answer

Correct: B. A partial derivative w.r.t. x holds the other variables (y) constant, giving 2xy.

11. The gradient of a scalar function points in the direction of:

steepest descent
zero change
steepest ascent
the x-axis

Show answer

Correct: C. The gradient points toward the greatest rate of increase; its negative is the steepest-descent direction.

12. The chain rule gives the derivative of f(g(x)) as:

f'(x) g'(x)
f'(g(x)) times g'(x)
f'(g(x))
g'(f(x))

Show answer

Correct: B. Differentiate the outer function at the inner value, times the derivative of the inner function.

13. To minimize a function, gradient descent moves a parameter:

along the gradient
opposite the gradient
perpendicular to the gradient
randomly

Show answer

Correct: B. It steps in the negative-gradient direction, scaled by the learning rate.

14. If the learning rate is far too large, gradient descent typically:

converges faster with no downside
diverges or oscillates
stops immediately
ignores the gradient

Show answer

Correct: B. Overshooting the minimum makes the loss oscillate or blow up.

15. A function is convex if:

it has many local minima
any local minimum is also global
its gradient is always zero
it is always increasing

Show answer

Correct: B. For convex functions every local minimum is global, which makes optimization reliable.

📚Refresher resources

Refresh

3Blue1Brown: Essence of Linear Algebra 3blue1brown.com

Refresh

3Blue1Brown: Essence of Calculus 3blue1brown.com

Refresh

Khan Academy: Multivariable Calculus khanacademy.org

Refresh

Mathematics for Machine Learning (free book) mml-book.github.io

← All prerequisites Course home