Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

PHASE 1: FOUNDATIONAL NEURONS & BACKPROP

import sys

sys.version
'3.14.0 free-threading build (main, Oct 28 2025, 12:10:48) [Clang 20.1.4 ]'

Environment preparation

# bash
venv_name="slm_from_scratch"
venv_path="${HOME}/venv/${venv_name}"

create_jupyter_venv -p 3.14t -n "${venv_name}"

uv pip install -p "${venv_path}" \
    matplotlib \
    numpy \
    seaborn

# remove_jupyter_venv "${venv_name}"

The Core Idea: A neural network is just a mathematical function that can be represented as a computational graph. The “learning” happens by adjusting the parameters of this function to minimize some error.

1. Forward Pass

We begin with the absolute building block of all deep learning: the single neuron.

1.1 The Mathematical Neuron

A neuron computes:

z=wx+bz = \mathbf{w}^\top \mathbf{x} + b
a=σ(z)a = \sigma(z)

Where:

  • xRd\mathbf{x} \in \mathbb{R}^d is the input vector,

  • wRd\mathbf{w} \in \mathbb{R}^d is the weight vector,

  • bRb \in \mathbb{R} is the bias,

  • σ\sigma is a nonlinear activation function (e.g., ReLU, tanh),

  • aa is the output (activation).

This is not a biological metaphor—it is a differentiable function that enables composition and gradient flow.

Think about processing one input vector vs many inputs:

Single input: [x1,x2,x3][x_1, x_2, x_3] with weights [θ1,θ2,θ3][\theta_1, \theta_2, \theta_3]
output=x1θ1+x2θ2+x3θ3+b\text {output} = x_1\theta_1 + x_2\theta_2 + x_3\theta_3 + b

Multiple inputs (as matrix):

Inputs: [ [x₁₁, x₁₂, x₁₃],   Weights: [𝜃₁, 𝜃₂, 𝜃₃]ᵀ
          [x₂₁, x₂₂, x₂₃],
          [x₃₁, x₃₂, x₃₃] ]

What linear algebra operation would efficiently compute all outputs at once? Matrix multiplication.

1.3 Activation Function

Now, consider this:
Why do we require σ\sigma in (2) to be nonlinear? What happens to the representational capacity of a multi-layer network if σ\sigma is linear?

Correct. If all activations are linear, the composition of layers collapses to a single linear transformation:

f(x)=WLW2W1x+btotal=Weffx+befff(\mathbf{x}) = \mathbf{W}_L \cdots \mathbf{W}_2 \mathbf{W}_1 \mathbf{x} + \mathbf{b}_{\text{total}} = \mathbf{W}_{\text{eff}} \mathbf{x} + \mathbf{b}_{\text{eff}}

This cannot model nonlinear decision boundaries—hence the necessity of nonlinear σ\sigma.

Activation Function Options

FunctionFormulaRangeKey Properties
Sigmoid1/(1+e⁻ˣ)(0,1)Smooth, bounded, but can saturate (vanishing gradients)
Tanh(eˣ-e⁻ˣ)/(eˣ+e⁻ˣ)(-1,1)Zero-centered, but still can saturate
ReLUmax(0,x)[0,∞)Simple, avoids saturation, but “dying ReLU” problem
Leaky ReLUmax(0.01x,x)(-∞,∞)Fixes dying ReLU, small gradient for negatives

Historical Context & Modern Practice

  • 1980s-2000s: Sigmoid/tanh were dominant (biological plausibility)

  • 2010s: ReLU became standard for hidden layers (training speed)

  • Today: Variants like Leaky ReLU, GELU are common

tanh()

The hyperbolic tangent function is defined as:

tanh(z)=ezezez+ez\tanh(z) = \frac{e^{z} - e^{-z}}{e^{z} + e^{-z}}

This is the complete closed-form formula. It maps zRz \in \mathbb{R} to a(1,1)a \in (-1, 1).

For computation by hand, you can evaluate it numerically using known values or a calculator.

This is how the function looks like:

import numpy as np
import matplotlib.pyplot as plt

def plot_tanh():
    # Generate x values from -10 to 10
    x = np.linspace(-10, 10, 400)
    
    # Compute tanh for each x value
    y = np.tanh(x)
    
    # Create the plot
    plt.figure(figsize=(8, 6))
    plt.plot(x, y, label='tanh(x)')
    
    # Add title and labels
    plt.title('Hyperbolic Tangent Function')
    plt.xlabel('x')
    plt.ylabel('tanh(x)')
    
    # Add a legend
    plt.legend()
    
    # Show the plot
    plt.grid(True)
    plt.axhline(0, color='black',linewidth=0.5)
    plt.axvline(0, color='black',linewidth=0.5)
    plt.show()
    
# Call the function to display the plot
plot_tanh()

1.4 What the Single Neuron really is

You already have prior experience (CNNs, NumPy backprop), so we are not teaching you deep learning from absolute zero. Instead, we are recalibrating your foundation at the level required for an AI Architect, where every operation must be understood in three layers simultaneously:

  1. Mathematical identity (e.g., chain rule, matrix derivatives),

  2. Computational implementation (e.g., NumPy/PyTorch code),

  3. Hardware implication (e.g., how this maps to GEMM in VRAM).

Starting with a single neuron—not a full matmul—is intentional. Why?

Because matmul is just a batched collection of dot products, and a dot product is just a sum of scaled inputs. If you cannot derive the gradient of one scalar output with respect to one weight, you will misapply vectorized gradients later—even if your code “runs.”

This is a mastery gate: prove you can do the atomic unit correctly, and we immediately scale to matmul.

A linear layer computes:

Z=XW+b\mathbf{Z} = \mathbf{X} \mathbf{W}^\top + \mathbf{b}

Each row of Z\mathbf{Z} is

z(i)=x(i)W+b\mathbf{z}^{(i)} = \mathbf{x}^{(i)} \mathbf{W}^\top + \mathbf{b}

and each element

zj(i)=k=1Kxk(i)wj,k+bjz^{(i)}_j = \sum_{k=1}^K x^{(i)}_k w_{j,k} + b_j

or in Einstein Summation form

zj(i)=xk(i)wj,k+bjz^{(i)}_j = x^{(i)}_k w_{j,k} + b_j
  • exactly a single neuron

Thus, the gradient of a full layer is just the aggregate of single-neuron gradients across batch and output dimensions.

2. Backward Pass — Scalar Case

2.1 No micrograd in the course

Regarding your question about the Value object:

You are likely referring to micrograd-style implementations (e.g., Andrej Karpathy’s micrograd), where a Value class tracks:

  • A scalar data value,

  • Its computational graph (parents),

  • And implements backward() for scalar autodiff.

We will not use Value.

Reason: Our goal is not to build a toy autodiff engine, but to understand how backpropagation maps to efficient, vectorized operations on real hardware (NumPy → PyTorch → CUDA). The Value abstraction obscures memory layout, batched computation, and the link to matrix calculus—precisely what an AI Architect must master.

Instead, we will:

  1. Derive gradients analytically using matrix calculus,

  2. Implement them explicitly in NumPy (no autograd),

  3. Then transition to PyTorch with manual gradient checks,

  4. Finally, analyze how these map to CUDA kernels (e.g., GEMM for linear layers).

This path ensures you understand what PyTorch’s .backward() actually does under the hood—something Value hides.

2.2 Backpropagation

Idea

Backpropagation is the algorithm for efficiently computing the gradients through the entire computational graph.

Think of it this way: If your neuron’s computation is:

input → linear → activation → output

Backpropagation answers:

How much did each weight contribute to the final error?

The entire forward pass is this:

xθ,bz=θx+btanha=tanh(z)LL=L(a,ytrue)x \xrightarrow{\theta, b} z = \theta x + b \xrightarrow{\text{tanh}} a = \tanh(z) \xrightarrow{\mathcal{L}} L = \mathcal{L}(a, y_{\text{true}})

Let’s break this down into manageable steps. If you were to implement gradient calculation for your single neuron, what would be the minimal information you’d need?

Consider:

  • The error at the output

  • How the activation function affects the gradient

  • How the linear combination affects the gradient

The Error Signal Flow

The error signal begins with the difference between predicted and true values.

You’re correct: error = true_y - predicted_y (or some function of this difference)

Now here’s the key insight for backpropagation: We need to trace how this error flows backward through each operation.

Think about your neuron’s forward pass:

x → linear (w·x + b) → tanh → y_pred

For backpropagation, we go backward:

error ← d_tanh ← d_linear ← gradients for w and b

If you were to implement this gradient calculation manually for a single input, what operations would you need to compute?

Consider:

  1. Gradient of the error with respect to the output

  2. How the activation function transforms this gradient

  3. How the linear operation distributes gradients to weights and bias

What’s your thinking about the mathematical operations involved?

Chain Rule Intuition

Think about your neuron’s computation:

output = leaky_relu(w·x + b)
error = some_cost_function()

If we want to know how much to change ω1\omega_1, we need to answer: “How does changing ω1\omega_1 affect the final error?”

This is where the chain rule (производная сложной функции) from calculus comes in. We break the problem into steps:

  1. How does error change with output?

  2. How does output change with activation input?

  3. How does activation input change with ω1\omega_1?

We use the chain rule to compute gradients through the computational graph.

Think about your neuron:

x → z = 𝜃·x + b → a = tanh(z) → J = loss_function(a, y_true)

where a is y_pred.

To find θJ(θ)\displaystyle \frac {\partial}{\partial \theta}J(\theta), we can compute:

θJ(θ)=J(θ)a×az×zθ\frac {\partial}{\partial \theta}J(\theta) = \frac {\partial J(\theta)}{\partial a} \times \frac {\partial a}{\partial z} \times \frac {\partial z}{\partial \theta}

Your implementation challenge: If you were to compute these partial derivatives numerically for a single example, what would be your step-by-step approach?

2.3 Why Tanh?

Now we can explain why we use tanh function in this step, not ReLU.

To understand the trade-offs, we must look at the forward pass and the derivative (gradient) for each function.

ActivationForward (f(z)f(z))Derivative (f(z)f'(z))
tanh(z)tanh(z)=ezezez+ez\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}1tanh2(z)1 - \tanh^2(z)
Leaky ReLU(z){zif z0αzif z<0\begin{cases} z & \text{if } z \geq 0 \\ \alpha z & \text{if } z < 0 \end{cases}{1z0αz<0\begin{cases} 1 & z \geq 0 \\ \alpha & z < 0 \end{cases}

We are currently in Phase 1: Foundational Neurons & Backprop. The priority is mathematical clarity and gradient validation, not building a production-ready LLM yet.

  • Smoothness & Differentiability: Tanh is “smooth” everywhere. Leaky ReLU has a “kink” at z=0z=0 where the derivative is discontinuous. In scalar manual backprop, these kinks can cause numerical instability and confusing results during gradient checks.

  • Bounded Output: Tanh keeps outputs in (1;1)(-1; 1). This makes gradient magnitudes predictable and prevents values from “exploding” while you are still debugging your weight initializations.

  • Historical Validation: Most foundational backprop literature uses tanh. Using it here allows you to replicate classic experiments and ensure your chain rule implementation is 100% correct.

Why not Leaky ReLU yet?

Leaky ReLU’s main advantage—avoiding “dead neurons”—is only truly relevant in deep neural networks. In a single scalar neuron, it adds an extra hyperparameter (α\alpha) with almost no benefit. Furthermore, modern Transformers (like GPT) have largely moved past Leaky ReLU in favor of GELU, which we will implement in Phase 2.

“Vanishing” vs. “Dead” Gradients

It is important to distinguish between these two phenomena:

  1. Vanishing Gradients (The Tanh Problem): This happens when z|z| is very large. The function becomes very flat, so the gradient becomes tiny (e.g., 0.00001). Training slows down, but the neuron is still “alive.”

  2. Dead Gradients (The ReLU Problem): In standard ReLU, if z<0z < 0, the gradient is exactly zero. The neuron stops learning entirely because no signal passes back through it.

Leaky ReLU solves “Dead Gradients”: By using α=0.01\alpha = 0.01, it ensures the gradient is never zero, even for negative inputs.

The Impact on Your Implementation:

We need non-zero, smooth gradients to validate your manual backprop code. If you used standard ReLU, any test input where z0z \leq 0 would result in a gradient of exactly 0.

A Finite-Difference Gradient

It’s a numerical method to approximate the derivative of a loss function with respect to a parameter—using only function evaluations, no calculus required.

Formula (forward difference):

LθL(θ+ϵ)L(θ)ϵ\frac{\partial L}{\partial \theta} \approx \frac{L(\theta + \epsilon) - L(\theta)}{\epsilon}

where ϵ\epsilon is a tiny number (e.g., 10-5).

  • It’s a ground-truth check for your analytic (manual) gradient

  • If your analytic gradient is correct, it should match the finite-difference approximation within ~10-7

But what is L(θ)L(\theta)?

It’s not a simple algebraic function of θ\theta alone. It’s the output of a full forward pass through a computational graph.

L(θ)L(\theta) is defined as:

L(θ)=L(f(θ;x,b), ytrue)L(\theta) = \mathcal{L}\big( f(\theta; x, b),\ y_{\text{true}} \big)

In our scalar neuron example with tanh\tanh activation and MSE loss:

L(θ)=12(tanh(θx+b)ytrue)2Full computational graphL(\theta) = \underbrace{\frac{1}{2} \left( \tanh(\theta \cdot x + b) - y_{\text{true}} \right)^2}_{\text{Full computational graph}}

We denote:

  • LL(θ)L \equiv L(\theta) = loss with original θ\theta

  • L+L(θ+ϵ)L_+ \equiv L(\theta + \epsilon) = loss with perturbed θ\theta

Concrete scalar neuron example:

Given fixed values:

  • x=2.0x = 2.0

  • b=0.1b = 0.1

  • ytrue=1.0y_{\text{true}} = 1.0

  • Original θ=0.5\theta = 0.5

  • ϵ=105\epsilon = 10^{-5}

x = 2.0
b = 0.1
y_true = 1.0
theta = 0.5
epsilon = 10**(-5)
  1. Forward pass:

    • z=0.52.0+0.1=1.1z = 0.5 \cdot 2.0 + 0.1 = 1.1

    • a=tanh(1.1)0.80049902a = \tanh(1.1) \approx 0.80049902

    • L=12(aytrue)20.01990020L = \frac{1}{2}(a - y_{\text{true}})^2 \approx 0.01990020

    L=12(tanh(1.1)1.0)212(0.8004991.0)212(0.039800)0.019900L = \frac{1}{2}(\tanh(1.1) - 1.0)^2 \approx \frac{1}{2}(0.800499 - 1.0)^2 \approx \frac{1}{2}(0.039800) \approx 0.019900
import numpy as np

def compute_l(x, theta, b, y_true):
    z = x*theta + b
    a = np.tanh(z)
    L = (a - y_true)**2 / 2

    return z, a, L

z, a, L = compute_l(x, theta, b, y_true)
print(L)
  1. L+=L(θ+ϵ)L_+ = L(\theta + \epsilon):

    • θnew=θ+ϵ=0.5+0.00001=0.50001\theta_{\text{new}} = \theta + \epsilon = 0.5 + 0.00001 = 0.50001

    • z+=θnewx+b=(0.50001)(2.0)+0.1=1.00002+0.1=1.10002z_+ = \theta_{\text{new}} \cdot x + b = (0.50001)(2.0) + 0.1 = 1.00002 + 0.1 = 1.10002

    • a+=tanh(z+)=tanh(1.10002)0.80051220a_+ = \tanh(z_+) = \tanh(1.10002) \approx 0.80051220

    • L+=12(a+ytrue)2=12(0.8005161.0)2=12(0.199484)212(0.039794)0.01989877L_+ = \frac{1}{2}(a_+ - y_{\text{true}})^2 = \frac{1}{2}(0.800516 - 1.0)^2 = \frac{1}{2}(-0.199484)^2 \approx \frac{1}{2}(0.039794) \approx 0.01989877

def compute_l_plus(x, theta, b, epsilon, y_true):
    theta_new = theta + epsilon
    z_plus = x*theta_new + b
    a_plus = np.tanh(z_plus)
    L_plus = (a_plus - y_true)**2 / 2

    return theta_new, z_plus, a_plus, L_plus

theta_new, z_plus, a_plus, L_plus = compute_l_plus(x, theta, b, epsilon, y_true)
print(L_plus)
  1. Finite-difference gradient w.r.t. θ\theta:

    LθL+Lϵ=0.019898770.01990020105=0.000001430.00001=0.143\frac{\partial L}{\partial \theta} \approx \frac{L_+ - L}{\epsilon} = \frac{0.01989877 - 0.01990020}{10^{-5}} = \frac{-0.00000143}{0.00001} = \mathbf{-0.143}
fin_diff_grad = (L_plus - L) / epsilon
print(fin_diff_grad)

This -0.143 is the numerical approximation of the gradient.

Compare it to the analytic gradient from backpropagation:

  • La=aytrue=0.8004991.0=0.199501\frac{\partial L}{\partial a} = a - y_{\text{true}} = 0.800499 - 1.0 = -0.199501

  • az=1tanh2(z)=1(0.800499)210.6408=0.3592\frac{\partial a}{\partial z} = 1 - \tanh^2(z) = 1 - (0.800499)^2 \approx 1 - 0.6408 = 0.3592

  • zθ=x=2.0\frac{\partial z}{\partial \theta} = x = 2.0

  • Analytic gradient:

    (0.199501)(0.3592)(2.0)0.1432(-0.199501) \cdot (0.3592) \cdot (2.0) \approx -0.1432

Key takeaway: The finite-difference method gives a ground-truth reference to validate your manual or automatic differentiation.

def compute_gradient(x, y_true):
    dz_dtheta = x
    da_dz = 1 - np.tanh(z)**2
    dL_da = a - y_true
    dL_dtheta = dL_da * da_dz * dz_dtheta

    return dz_dtheta, da_dz, dL_da, dL_dtheta

dz_dtheta, da_dz, dL_da, dL_dtheta = compute_gradient(x, y_true)

print('dL_da:', dL_da)
print('da_dz:', da_dz)
print('dz_dtheta:', dz_dtheta)
print("Gradient:", dL_dtheta)
print(fin_diff_grad - dL_dtheta)

Back to Your ReLU Scenario

Now consider a ReLU neuron: a=max(0,z)a = \max(0, z), with z=θx+bz = \theta x + b.

Suppose for a given input, you get z=2z = -2 (i.e., in the flat region).

  • Analytic gradient (subgradient of ReLU):
    az=0\frac{\partial a}{\partial z} = 0 → so Lθ=0\frac{\partial L}{\partial \theta} = 0

  • Finite-difference gradient:
    Perturbing θ\theta slightly may push zz closer to zero.
    If z+δz>0z + \delta z > 0, then aa changes → loss changes → non-zero numerical gradient (e.g., 103\sim -10^{-3})

→ This mismatch seems alarming…
→ But it’s expected at non-differentiable points!

However, in practice:

  • If your analytic code correctly returns 0 for z<0z < 0, it’s right.

  • But your gradient checker will fail near z=0z = 0 because the function isn’t smooth there.

And in a scalar neuron with ReLU:

  • For any input where z<0z < 0, gradient = 0no weight update

  • Gradient checks fail randomly depending on input sign

That’s why we avoid ReLU in simple gradient-checking demos—use smooth activations like tanh\tanh or sigmoid instead.

Final Check: Your Turn

Given:

  • x=1.0x = 1.0

  • θ=0.0\theta = 0.0

  • b=0.0b = 0.0

  • ytrue=0.5y_{\text{true}} = 0.5

  • ϵ=105\epsilon = 10^{-5}

  • Loss = 12(ay)2\frac{1}{2}(a - y)^2

Compute L+=L(θ+ϵ)L_+ = L(\theta + \epsilon) step by step.

Show θnew\theta_{new}, z+z_+, a+a_+, L+L_+:

x = 1.0
theta = 0.0
b = 0.0
epsilon = 10**(-5)
y_true = 0.5

theta_new, z_plus, a_plus, L_plus = compute_l_plus(x, theta, b, epsilon, y_true)
print(theta_new, z_plus, a_plus, L_plus)

Your code is correct, and your output reveals a critical insight—let’s interpret it step by step.

Output Breakdown

  1. θnew=0+105=105\theta_{\text{new}} = 0 + 10^{-5} = 10^{-5}

  2. z+=1.0105+0=105z_+ = 1.0 \cdot 10^{-5} + 0 = 10^{-5}

  3. a+=tanh(105)9.999999999666667×106a_+ = \tanh(10^{-5}) \approx 9.999999999666667 \times 10^{-6}
    (since tanh(u)=uu3/3+\tanh(u) = u - u^3/3 + \cdots, so slightly less than uu)

  4. L+=12(a+0.5)2=12(0.49999000000000033)2L_+ = \frac{1}{2}(a_+ - 0.5)^2 = \frac{1}{2}(-0.49999000000000033)^2

This matches your printed result: 0.12499500005000017.

Why This Matters: Gradient Behavior at θ=0\theta = 0

Now compute the original loss L=L(θ)L = L(\theta) (with θ=0\theta = 0):

  • z=0z = 0,

  • a=tanh(0)=0a = \tanh(0) = 0

  • L=12(00.5)2=0.125L = \frac{1}{2}(0 - 0.5)^2 = 0.125

z, a, L = compute_l(x, theta, b, y_true)
print(z, a, L)

Finite-difference gradient:

L+Lϵ=0.124995000050.125105=4.99995×1061050.499995\frac{L_+ - L}{\epsilon} = \frac{0.12499500005 - 0.125}{10^{-5}} = \frac{-4.99995 \times 10^{-6}}{10^{-5}} \approx -0.499995
fin_diff_grad = (L_plus - L) / epsilon
print(fin_diff_grad)

Analytic gradient (via backprop):

Lθ=(aytrue)0.5(1tanh2(z))1.0x1.0=0.5\frac{\partial L}{\partial \theta} = \underbrace{(a - y_{\text{true}})}_{-0.5} \cdot \underbrace{(1 - \tanh^2(z))}_{1.0} \cdot \underbrace{x}_{1.0} = -0.5
dz_dtheta, da_dz, dL_da, dL_dtheta = compute_gradient(x, y_true)

print('dL_da:', dL_da)
print('da_dz:', da_dz)
print('dz_dtheta:', dz_dtheta)
print("Gradient:", dL_dtheta)
print(fin_diff_grad - dL_dtheta)

Numerical (-0.499995) ≈ Analytic (-0.5)validation passes.

Key Takeaway

This example demonstrates:

  1. Finite differences work even when θ=0\theta = 0 (a common initialization point)

  2. tanh’s derivative is 1 at z=0z = 0 → gradients are strong here (no vanishing!)

  3. Your implementation correctly isolates the effect of perturbing θ\theta

The Gradient Check Dilemma:

Suppose you used ReLU and your test input gave z=2z = -2. Your analytic gradient would be 0. However, your finite-difference (numerical) gradient check might show a tiny non-zero change. Would you be able to tell if your backprop code was actually broken, or if the neuron was just “dead”?

By using tanh, we ensure that for almost any input, you get a meaningful gradient to verify your math.

That’s why we avoid ReLU here.

Why Compute Analytic Gradients When Finite Differences Exist?

Short answer: Finite differences are computationally infeasible for real models.

Computational Cost Comparison

Let pp = number of parameters.

MethodGradient CostExample: 100M-parameter LLM
Finite differencesO(p)O(p) forward passes100,000,000100,000,000 forward passes per gradient update
Analytic (backprop)O(1)O(1) forward + O(1)O(1) backward1 forward + 1 backward pass
  • Finite differences: For each parameter θi\theta_i, you must:

    1. Perturb θi\theta_i by ϵ\epsilon

    2. Run full forward pass → get L+L_+

    3. Compute (L+L)/ϵ(L_+ - L)/\epsilonTotal: pp forward passes per gradient estimate

Suppose your model has two parameters: θ1\theta_1 and θ2\theta_2.

You want the full gradient: [Lθ1, Lθ2]\left[ \frac{\partial L}{\partial \theta_1},\ \frac{\partial L}{\partial \theta_2} \right].

To estimate Lθ1\frac{\partial L}{\partial \theta_1}:

  1. Start with original parameters: (θ1,θ2)(\theta_1, \theta_2)

  2. Compute baseline loss: L=L(θ1,θ2)L = L(\theta_1, \theta_2)1 forward pass

  3. Perturb only θ1\theta_1: (θ1+ϵ,θ2)(\theta_1 + \epsilon, \theta_2)

  4. Compute L+(1)=L(θ1+ϵ,θ2)L_+^{(1)} = L(\theta_1 + \epsilon, \theta_2)1 more forward pass

  5. Approximate:

    Lθ1L+(1)Lϵ\frac{\partial L}{\partial \theta_1} \approx \frac{L_+^{(1)} - L}{\epsilon}

To estimate Lθ2\frac{\partial L}{\partial \theta_2}:

  1. Perturb only θ2\theta_2: (θ1,θ2+ϵ)(\theta_1, \theta_2 + \epsilon)

  2. Compute L+(2)=L(θ1,θ2+ϵ)L_+^{(2)} = L(\theta_1, \theta_2 + \epsilon)another forward pass

  3. Approximate:

    Lθ2L+(2)Lϵ\frac{\partial L}{\partial \theta_2} \approx \frac{L_+^{(2)} - L}{\epsilon}

✅ Total: 1 baseline + 2 perturbed = 3 forward passes
But note: you can reuse the baseline LL for all parameters!

So for pp parameters:

  • 1 forward pass to compute baseline L(θ)L(\theta)

  • pp forward passes to compute L(θ+ϵei)L(\theta + \epsilon e_i) for each i=1,,pi = 1,\dots,p

Total: p+1O(p)p + 1 \approx O(p) forward passes

(We drop the “+1” in big-O notation because it’s negligible when pp is large.)

Why Can’t We Do It in One Pass?

Because each perturbation changes a different parameter.

The loss function is:

L(θ1,θ2,,θp)L(\theta_1, \theta_2, \dots, \theta_p)

To see how changing θ5\theta_5 affects the loss, you must run the model with θ5\theta_5 altered and all others unchanged.

You cannot perturb all parameters at once and recover individual gradients—that would mix all effects together (like trying to hear one instrument in an orchestra by playing everyone at once).

So each partial derivative requires its own controlled experiment → its own forward pass.

Concrete Numbers

  • Your 100M-parameter LLM:

    • Forward pass time: ~0.5 sec (on 4090 Ti)

    • Finite-diff gradient time: 100e6×0.5 sec1.5 years100e6 \times 0.5 \text{ sec} \approx 1.5 \text{ years}

    • Backprop time: ~1 sec

Finite differences are only viable for debugging tiny models (e.g., scalar neuron).

Why Is Finite-Difference a Valid Gradient Check?

Core idea: The derivative is defined as the limit of finite differences.

Mathematical Definition

The true derivative is:

Lθ=limϵ0L(θ+ϵ)L(θ)ϵ\frac{\partial L}{\partial \theta} = \lim_{\epsilon \to 0} \frac{L(\theta + \epsilon) - L(\theta)}{\epsilon}
  • Finite difference uses a small but finite ϵ\epsilon (e.g., 10-5) to approximate this limit.

  • Error analysis shows the approximation error is O(ϵ)O(\epsilon) (for forward difference).

Why It’s Trustworthy for Validation

  1. No assumptions about your code

    • Finite difference uses only forward passes—no chain rule, no manual derivatives.

    • If your analytic gradient matches it, your entire backprop derivation is likely correct.

  2. Controlled error bounds

    • With ϵ=105\epsilon = 10^{-5}, typical error in gradient estimate is 1010\sim 10^{-10} to 10-7.

    • If your analytic gradient differs by more than 10-6, you have a bug.

  3. Failure modes are obvious

    • Common errors caught:

      • Forgot a term in chain rule (e.g., missed a/z\partial a/\partial z)

      • Sign error in loss derivative

      • Shape mismatch in vectorized code

Critical Caveat

  • Finite difference is not exact—but it’s exact enough for debugging.

  • It assumes your forward pass is correct. If forward pass is broken, both gradients will be wrong in the same way (false positive).
    → Always validate forward pass first (e.g., with known inputs/outputs).

Summary Table

AspectFinite-Difference GradientAnalytic Gradient (Backprop)
PurposeDebug/validationTraining/inference
Computational CostO(p)O(p) forward passesO(1)O(1) forward + backward
ScalabilityOnly for p<103p < 10^3Scales to p=1011p = 10^{11}
AccuracyApproximate (error ~ϵ\epsilon)Exact (up to floating-point)
DependencyOnly on forward pass correctnessOn entire backprop derivation

Check for Understanding

Suppose you have a vectorized neuron with WR10×768W \in \mathbb{R}^{10 \times 768} (7,680 parameters).

How many forward passes would finite differences require to check the full gradient?
(Answer with a number—and state whether this is practical.)

Answer:

  • 7,680 forward passes, and it is only for ONE example from the set, so this is not acceptable for training LLMs.

Correct.

7,680 forward passes per example is computationally prohibitive—even for a single training step on a small batch. For a 100M-parameter LLM, it becomes physically impossible.

This is precisely why backpropagation is non-negotiable in deep learning: it gives exact gradients (up to floating-point precision) in time proportional to one forward pass, not the number of parameters.

We have confirmed:

  • The purpose and limitation of finite-difference gradients

  • The necessity of analytic gradients for scalable learning

  • Your ability to implement and reason about scalar gradient validation

Exercises

Exercise 2.1: Scalar Neuron Backpropagation

So: mastering the scalar case is mastering the core of matmul-based backprop.

Consider a single neuron with:

  • Input x=2.0x = 2.0,

  • Weight w=0.5w = -0.5,

  • Bias b=1.0b = 1.0,

  • Activation: σ(z)=tanh(z)\sigma(z) = \tanh(z),

  • Loss: L=12(ay)2L = \frac{1}{2}(a - y)^2, with target y=0.8y = 0.8.

Compute by hand:

  1. z=wx+bz = wx + b

  2. a=tanh(z)a = \tanh(z)

  3. L=12(ay)2L = \frac{1}{2}(a - y)^2

Compute all values numerically, step by step. Provide the results, then compute the gradients:

  • La\displaystyle \frac{\partial L}{\partial a}, Lz\displaystyle \frac{\partial L}{\partial z}, Lw\displaystyle \frac{\partial L}{\partial w}, Lb\displaystyle \frac{\partial L}{\partial b}

Show each step with numerical values. Do not use code—use math. This tests your chain rule mechanics.

What are the results?

Recall your values:

  • x=2.0x = 2.0

  • w=0.5w = -0.5

  • b=1.0b = 1.0

So first compute:

z=wx+b=(0.5)(2.0)+1.0=1.0+1.0=0.0z = w \cdot x + b = (-0.5)(2.0) + 1.0 = -1.0 + 1.0 = 0.0

Now, what is a=tanh(0)a = \tanh(0)?

Correct:

a=tanh(0)=0a = \tanh(0) = 0

  • z=0z = 0

  • a=0a = 0

  • L=12(00.8)2=0.32L = \frac{1}{2}(0 - 0.8)^2 = 0.32

dL_da

If L=12(ay)2L = \frac{1}{2}(a - y)^2, then

The derivative of LL with respect to aa is:

La=(ay)\frac{\partial L}{\partial a} = (a - y)

This follows from the chain rule:

dda[12(ay)2]=122(ay)1=(ay)\frac{d}{da} \left[ \frac{1}{2}(a - y)^2 \right] = \frac{1}{2} \cdot 2(a - y) \cdot 1 = (a - y)

Given a=0a = 0, y=0.8y = 0.8:

La=00.8=0.8\displaystyle \frac{\partial L}{\partial a} = 0 - 0.8 = -0.8

dL_dz

If a=tanh(z)a = \tanh(z), then the derivative of aa w.r.t. zz:

az=1tanh2(z)=1a2\frac{\partial a}{\partial z} = 1 - \tanh^2(z) = 1 - a^2
Lz=Laaz=(ay)(1a2)\frac{\partial L}{\partial z} = \frac{\partial L}{\partial a} \cdot \frac{\partial a}{\partial z} = (a - y) (1 - a^2)

Lz=0.81=0.8\displaystyle \frac{\partial L}{\partial z} = -0.8 \cdot 1 = -0.8

dL_dw and dL_db

If z=wx+bz = wx + b, then the derivative of zz w.r.t. ww is:

zw=x\frac{\partial z}{\partial w} = x

and the derivative of zz w.r.t. bb is:

zb=1\frac{\partial z}{\partial b} = 1

then we have:

Lw=Laazzw=(ay)(1a2)x\frac{\partial L}{\partial w} = \frac{\partial L}{\partial a} \cdot \frac{\partial a}{\partial z} \cdot \frac{\partial z}{\partial w} = (a - y) (1 - a^2) \cdot x
Lb=Laazzb=(ay)(1a2)b\frac{\partial L}{\partial b} = \frac{\partial L}{\partial a} \cdot \frac{\partial a}{\partial z} \cdot \frac{\partial z}{\partial b} = (a - y) (1 - a^2) \cdot b
  • Lw=0.812=1.6\displaystyle \frac{\partial L}{\partial w} = -0.8 \cdot 1 \cdot 2 = -1.6

  • Lb=0.811=0.8\displaystyle \frac{\partial L}{\partial b} = -0.8 \cdot 1 \cdot 1 = -0.8

Summary of Results

QuantityValue
zz0.0
aa0.0
LL0.32
La\displaystyle \frac{\partial L}{\partial a}-0.8
Lz\displaystyle \frac{\partial L}{\partial z}-0.8
Lw\displaystyle \frac{\partial L}{\partial w}-1.6
Lb\displaystyle \frac{\partial L}{\partial b}-0.8

This is the complete forward and backward pass for one neuron.

Final verified gradients:

  • Lw=1.6\frac{\partial L}{\partial w} = -1.6

  • Lb=0.8\frac{\partial L}{\partial b} = -0.8

This completes the scalar neuron backpropagation exercise with full chain-rule derivation.

Now, we scale to vectorized operations.

Exercise 2.2: Implement the scalar neuron (Steps 1–2) in NumPy

You are absolutely right—and this is a critical pedagogical checkpoint.

Before advancing to layers, you must implement the scalar neuron (Steps 1–2) in NumPy, without autograd, to verify your understanding maps to executable code.

This is non-optional. Reasoning on paper is necessary but insufficient; code exposes hidden gaps in indexing, broadcasting, and memory layout.

Do not use PyTorch, JAX, or autodiff. Pure NumPy. Phase 1 follows a strict progression:

  1. Scalar math + hand computation → verified (done)

  2. Scalar NumPy implementation (procedural, one-time)required now

  3. Vectorized batched NumPy (procedural) → next

  4. Class-based modular design → only after vectorized gradients are fully understood

Why delay classes?
Because abstraction hides mechanics. If you wrap the neuron in a class before you’ve manually computed and coded its gradients in raw NumPy, you risk treating it as a black box—exactly what this course forbids.

So for now:
Write a procedural script (no class) that computes the forward and backward pass for the given scalar neuron.

This is a verification step, not a reusable module.

Once you provide this code and confirm its output matches:

  • L = 0.32

  • dL_dw = -1.6

  • dL_db = -0.8

— we will:

  • Extend to batched vectorized form (still procedural),

  • Then refactor into classes only when the underlying math and memory layout are unquestionably solid.

Paste your NumPy implementation.

Required Implementation (NumPy, No Autograd)

Write a self-contained script that:

  1. Defines scalars: x = 2.0, w = -0.5, b = 1.0, y = 0.8

  2. Computes forward pass:

    • z = w * x + b

    • a = np.tanh(z)

    • L = 0.5 * (a - y)**2

  3. Computes manual gradients (no .backward(), no grad):

    • dL_da = a - y

    • da_dz = 1 - a**2

    • dL_dz = dL_da * da_dz

    • dL_dw = dL_dz * x

    • dL_db = dL_dz * 1

  4. Prints all values

Code

import numpy as np

x = 2.0
w = -0.5
b = 1.0
y = 0.8

z = x*w + b
a = (np.exp(z) - np.exp(-z)) / (np.exp(z) + np.exp(-z))
L = 0.5 * (a -y)**2

dL_da = a - y
da_dz = 1 - a**2

dL_dz = dL_da * da_dz

dL_dw = dL_dz * x
dL_db = dL_dz * 1

print("x =", x)
print("w =", w)
print("b =", b)
print("y =", y)

print("z =", z)
print("a =", a)
print("L =", L)

print("dL_dw =", dL_dw)
print("dL_db =", dL_db)

Peer Review

Your implementation is mostly correct, but contains one critical inefficiency and a minor numerical risk.

Issue: Manual tanh via exp You wrote:

a = (np.exp(z) - np.exp(-z)) / (np.exp(z) + np.exp(-z))

This is mathematically correct but:

  1. Redundant: NumPy provides np.tanh(z)—use it.

  2. Numerically unstable: For large |z|, np.exp(z) overflows (e.g., z=1000inf), while np.tanh(z) is stable.

In production systems, always use optimized, stable primitives (tanh, softmax, etc.), not manual expansions—unless you are implementing the primitive itself (which we are not).

Corrected Script

import numpy as np

x = 2.0
w = -0.5
b = 1.0
y = 0.8

# Forward pass
z = w * x + b
a = np.tanh(z)            # Use stable built-in
L = 0.5 * (a - y)**2

# Backward pass (manual gradients)
dL_da = a - y
da_dz = 1 - a**2
dL_dz = dL_da * da_dz
dL_dw = dL_dz * x
dL_db = dL_dz

# Output
print(f"z = {z}")
print(f"a = {a}")
print(f"L = {L}")
print(f"dL_dw = {dL_dw}")
print(f"dL_db = {dL_db}")

Expected Output

z = 0.0
a = 0.0
L = 0.32
dL_dw = -1.6
dL_db = -0.8

If your script produces these values, Phase 1, Step 2 (Scalar Neuron Implementation) is complete.

3. From Scalar to Vector — The Linear Layer

Why scalar first

Your implementation of “Exercise 2.2: Implement the scalar neuron (Steps 1–2) in NumPy” is correct for a scalar neuron—one input, one weight, one bias, processing one sample. This is exactly the right starting point. The absence of matrix operations is not a flaw; it is intentional in Phase 1.

However, we must now ask:

Consider this:
In a vectorized (batched) setting, the forward pass for BB samples (rows) and DD inputs (features) is:
Z=XW+bZ = XW + b, where

  • XRB×DX \in \mathbb R^{B \times D}

  • WRD×1W \in \mathbb R^{D \times 1}

  • bRb \in \mathbb R (broadcasted)

  • ZRB×1Z \in \mathbb R^{B \times 1}

The scalar case is the atomic unit of this operation. Each element zi=xiw+bz_i = x_i·w + b. If you cannot compute Lw\frac {\partial L}{\partial w} for a single zz, you cannot correctly derive the batch gradient.

Now, examine your code:

  • You computed Lw=Lzx\frac {\partial L}{\partial w} = \frac {\partial L}{\partial z} \cdot x.
    In the batch case, the gradient becomes:
    LW=XTσ\frac {\partial L}{\partial W} = X^T \cdot \sigma, where σ=LZRB×1\sigma = \frac {\partial L}{\partial Z} \in \mathbb R^{B \times 1}.

Your scalar Lw=Lzx\frac {\partial L}{\partial w} = \frac {\partial L}{\partial z} \cdot x is precisely the ii-th term in that vectorized sum. So your code is not just fine — it is foundational.

Moving to vectors

But here’s the critical next step:
Can you extend this to a batch of 3 samples using only NumPy (no autograd), and verify that your batch gradients match the sum of individual scalar gradients?

Do not write the code yet. First, answer:

Correct. The batch loss is typically the mean of per-sample losses:

Lbatch=1Bi=1B12(aiyi)2L_{\text{batch}} = \frac{1}{B} \sum_{i=1}^{B} \frac{1}{2} (a_i - y_i)^2

Consequently, the gradient w.r.t. any parameter becomes the mean of the per-sample gradients:

Lbatchw=1Bi=1BLiw\frac{\partial L_{\text{batch}}}{\partial w} = \frac{1}{B} \sum_{i=1}^{B} \frac{\partial L_i}{\partial w}

This is not merely a convention—it directly impacts learning dynamics. Using the mean (instead of sum) makes the gradient magnitude independent of batch size, which stabilizes training when B varies.

Now, let’s test your understanding with a concrete micro-step.

Exercise 3.1: Vector Forward and Backward Pass

Given the following batch of 3 samples:

X = np.array([[2.0],
              [1.5],
              [-1.0]])   # Shape: (3, 1)
W = np.array([[-0.5]])    # Shape: (1, 1)
b = 1.0
Y = np.array([[0.8],
              [0.5],
              [-0.3]])  # Shape: (3, 1)

Write the forward pass (compute Z, A, L) and backward pass (compute dL_dW, dL_db) using only NumPy, without autograd or loops. Use the same tanh activation and MSE loss (mean-reduced over batch).

Do not encapsulate in functions—keep it procedural, like your scalar version.

Before you write code:

DRAFT

Now, back to the exercise:
Given the shape correction, write the full NumPy code for forward and backward passes with the provided X, W, b, Y.

Keep it procedural. Compute:

  • Z, A, L

  • dL_dA, dL_dZ

  • dL_dW, dL_db

Use matrix operations only (no loops).

Consider a single-layer network with:

  • Input batch XRB×d\mathbf{X} \in \mathbb{R}^{B \times d} (B = batch size, d = input dim),

  • Weight matrix WRdout×d\mathbf{W} \in \mathbb{R}^{d_{\text{out}} \times d},

  • Bias bRdout\mathbf{b} \in \mathbb{R}^{d_{\text{out}}},

  • Output pre-activations Z=XW+1Bb\mathbf{Z} = \mathbf{X} \mathbf{W}^\top + \mathbf{1}_B \mathbf{b}^\top

Assume a single output neuron first (dout=1d_{\text{out}} = 1), so WR1×d\mathbf{W} \in \mathbb{R}^{1 \times d}, bR\mathbf{b} \in \mathbb{R}.

Given a loss L=1Bi=1B12(a(i)y(i))2L = \frac{1}{B} \sum_{i=1}^B \frac{1}{2}(a^{(i)} - y^{(i)})^2, what is the gradient of LL with respect to W\mathbf{W} in matrix form?

Hint:

  • From the scalar case, L(i)w=L(i)z(i)x(i)\frac{\partial L^{(i)}}{\partial \mathbf{w}} = \frac{\partial L^{(i)}}{\partial z^{(i)}} \cdot \mathbf{x}^{(i)}

  • The full gradient is the average over the batch.

Express LW\frac{\partial L}{\partial \mathbf{W}} using matrix operations (e.g., outer product, matrix multiplication).

What is the formula?

Now, answer this:
If we had a batch of 32 inputs, how would the gradient computation for W\mathbf{W} and b\mathbf{b} change in structure?

Be specific:

  • Would you compute gradients per sample then average?

  • How does this relate to matrix multiplication in the backward pass?

  • What is the shape of LW\frac{\partial L}{\partial \mathbf{W}} if XR32×d\mathbf{X} \in \mathbb{R}^{32 \times d} and WR1×d\mathbf{W} \in \mathbb{R}^{1 \times d}?

Explain your reasoning.