import sys
sys.version'3.14.0 free-threading build (main, Oct 28 2025, 12:10:48) [Clang 20.1.4 ]'Environment preparation¶
# bash
venv_name="slm_from_scratch"
venv_path="${HOME}/venv/${venv_name}"
create_jupyter_venv -p 3.14t -n "${venv_name}"
uv pip install -p "${venv_path}" \
matplotlib \
numpy \
seaborn
# remove_jupyter_venv "${venv_name}"The Core Idea: A neural network is just a mathematical function that can be represented as a computational graph. The “learning” happens by adjusting the parameters of this function to minimize some error.
1. Forward Pass¶
We begin with the absolute building block of all deep learning: the single neuron.
1.1 The Mathematical Neuron¶
A neuron computes:
Where:
is the input vector,
is the weight vector,
is the bias,
is a nonlinear activation function (e.g., ReLU, tanh),
is the output (activation).
This is not a biological metaphor—it is a differentiable function that enables composition and gradient flow.
Think about processing one input vector vs many inputs:
Single input: with weights
Multiple inputs (as matrix):
Inputs: [ [x₁₁, x₁₂, x₁₃], Weights: [𝜃₁, 𝜃₂, 𝜃₃]ᵀ
[x₂₁, x₂₂, x₂₃],
[x₃₁, x₃₂, x₃₃] ]What linear algebra operation would efficiently compute all outputs at once? Matrix multiplication.
Read about GEMM and BLAS:
1.3 Activation Function¶
Now, consider this:
Why do we require in (2) to be nonlinear? What happens to the representational capacity of a multi-layer network if is linear?
Correct. If all activations are linear, the composition of layers collapses to a single linear transformation:
This cannot model nonlinear decision boundaries—hence the necessity of nonlinear .
What is “eff” in ?*
“eff” stands for effective. It denotes that the product of multiple weight matrices collapses to a single equivalent linear transformation when all activations are linear. So is the effective weight matrix of the entire network.
Activation Function Options¶
| Function | Formula | Range | Key Properties |
|---|---|---|---|
| Sigmoid | 1/(1+e⁻ˣ) | (0,1) | Smooth, bounded, but can saturate (vanishing gradients) |
| Tanh | (eˣ-e⁻ˣ)/(eˣ+e⁻ˣ) | (-1,1) | Zero-centered, but still can saturate |
| ReLU | max(0,x) | [0,∞) | Simple, avoids saturation, but “dying ReLU” problem |
| Leaky ReLU | max(0.01x,x) | (-∞,∞) | Fixes dying ReLU, small gradient for negatives |
Historical Context & Modern Practice
1980s-2000s: Sigmoid/tanh were dominant (biological plausibility)
2010s: ReLU became standard for hidden layers (training speed)
Today: Variants like Leaky ReLU, GELU are common
tanh()¶
The hyperbolic tangent function is defined as:
This is the complete closed-form formula. It maps to .
For computation by hand, you can evaluate it numerically using known values or a calculator.
This is how the function looks like:
import numpy as np
import matplotlib.pyplot as plt
def plot_tanh():
# Generate x values from -10 to 10
x = np.linspace(-10, 10, 400)
# Compute tanh for each x value
y = np.tanh(x)
# Create the plot
plt.figure(figsize=(8, 6))
plt.plot(x, y, label='tanh(x)')
# Add title and labels
plt.title('Hyperbolic Tangent Function')
plt.xlabel('x')
plt.ylabel('tanh(x)')
# Add a legend
plt.legend()
# Show the plot
plt.grid(True)
plt.axhline(0, color='black',linewidth=0.5)
plt.axvline(0, color='black',linewidth=0.5)
plt.show()
# Call the function to display the plot
plot_tanh()1.4 What the Single Neuron really is¶
You already have prior experience (CNNs, NumPy backprop), so we are not teaching you deep learning from absolute zero. Instead, we are recalibrating your foundation at the level required for an AI Architect, where every operation must be understood in three layers simultaneously:
Mathematical identity (e.g., chain rule, matrix derivatives),
Computational implementation (e.g., NumPy/PyTorch code),
Hardware implication (e.g., how this maps to GEMM in VRAM).
Starting with a single neuron—not a full matmul—is intentional. Why?
Because matmul is just a batched collection of dot products, and a dot product is just a sum of scaled inputs. If you cannot derive the gradient of one scalar output with respect to one weight, you will misapply vectorized gradients later—even if your code “runs.”
This is a mastery gate: prove you can do the atomic unit correctly, and we immediately scale to matmul.
A linear layer computes:
Each row of is
and each element
or in Einstein Summation form
exactly a single neuron
Thus, the gradient of a full layer is just the aggregate of single-neuron gradients across batch and output dimensions.
2. Backward Pass — Scalar Case¶
2.1 No micrograd in the course¶
Regarding your question about the Value object:
You are likely referring to micrograd-style implementations (e.g., Andrej Karpathy’s micrograd), where a Value class tracks:
A scalar data value,
Its computational graph (parents),
And implements
backward()for scalar autodiff.
We will not use Value.
Reason: Our goal is not to build a toy autodiff engine, but to understand how backpropagation maps to efficient, vectorized operations on real hardware (NumPy → PyTorch → CUDA). The Value abstraction obscures memory layout, batched computation, and the link to matrix calculus—precisely what an AI Architect must master.
Instead, we will:
Derive gradients analytically using matrix calculus,
Implement them explicitly in NumPy (no autograd),
Then transition to PyTorch with manual gradient checks,
Finally, analyze how these map to CUDA kernels (e.g., GEMM for linear layers).
This path ensures you understand what PyTorch’s .backward() actually does under the hood—something Value hides.
What is micrograd?
micrograd is an educational autodifferentiation engine written in pure Python. It implements:
A scalar
Valuenode that stores data and pointers to children in a compute graph,A recursive
.backward()that traverses the graph and accumulates gradients via the chain rule.
It is not a production system. It exists solely to visually and mechanically demonstrate how reverse-mode autodiff works at the scalar level.
Is micrograd used in real-world training systems?
No. Production deep learning frameworks (PyTorch, TensorFlow, JAX) use vectorized, batched, GPU-accelerated autodiff based on:
Operator-level differentiation: Each primitive (e.g.,
matmul,softmax) has a pre-defined backward kernel.Static or dynamic computational graphs (with fusion, memory planning, and CUDA kernels).
Memory-efficient gradient accumulation (in-place ops, gradient checkpointing).
These systems never build per-scalar graph nodes—doing so would be catastrophically slow and memory-inefficient. For a 100M-parameter model, micrograd-style graphs would require >100 million interconnected Python objects—impossible to train at scale.
Where might you encounter micrograd-like ideas in practice?
Only in two narrow contexts:
Research prototyping of novel differentiable operators (e.g., custom physics simulators), where symbolic or manual gradient derivation is needed before vectorization.
Debugging gradient flow in small subgraphs by manually computing derivatives—not by running micrograd, but by replicating its logic on paper or in NumPy.
Even then, you do not deploy such code. You derive the math, then implement a fused, vectorized CUDA kernel or PyTorch custom autograd function.
Why we avoid micrograd in this course
You are preparing for an AI Architect role, where your job is to:
Design models that fit in 16 GB VRAM,
Understand how
torch.nn.Linearmaps to cuBLAS GEMM calls,Optimize memory bandwidth during backprop.
micrograd teaches none of this. It teaches graph traversal in Python—a skill irrelevant to high-performance LLM systems.
2.2 Backpropagation¶
Idea¶
Backpropagation is the algorithm for efficiently computing the gradients through the entire computational graph.
Think of it this way: If your neuron’s computation is:
input → linear → activation → outputBackpropagation answers:
“How much did each weight contribute to the final error?”
The entire forward pass is this:
“Loss” vs. “Cost”
Loss function : defined per sample (e.g., ).
Cost function : the aggregate over a batch or dataset (e.g., mean of over samples).
In modern ML literature (including PyTorch, TensorFlow, and most LLM papers), the term “loss” is used even for the batch-aggregated quantity. For example, loss.backward() in PyTorch operates on the scalar batch loss.
So while “cost” = aggregate, “loss” = per-sample is a valid distinction (common in Andrew Ng’s early courses), contemporary usage favors “loss” for both, with context implying scope.
In this course, we will use “loss” for the scalar batch objective, consistent with PyTorch and LLM training codebases.
Let’s break this down into manageable steps. If you were to implement gradient calculation for your single neuron, what would be the minimal information you’d need?
Consider:
The error at the output
How the activation function affects the gradient
How the linear combination affects the gradient
The Error Signal Flow¶
The error signal begins with the difference between predicted and true values.
You’re correct: error = true_y - predicted_y (or some function of this difference)
Now here’s the key insight for backpropagation: We need to trace how this error flows backward through each operation.
Think about your neuron’s forward pass:
x → linear (w·x + b) → tanh → y_predFor backpropagation, we go backward:
error ← d_tanh ← d_linear ← gradients for w and bIf you were to implement this gradient calculation manually for a single input, what operations would you need to compute?
Consider:
Gradient of the error with respect to the output
How the activation function transforms this gradient
How the linear operation distributes gradients to weights and bias
What’s your thinking about the mathematical operations involved?
Chain Rule Intuition¶
Think about your neuron’s computation:
output = leaky_relu(w·x + b)
error = some_cost_function()If we want to know how much to change , we need to answer: “How does changing affect the final error?”
This is where the chain rule (производная сложной функции) from calculus comes in. We break the problem into steps:
How does error change with output?
How does output change with activation input?
How does activation input change with ?
We use the chain rule to compute gradients through the computational graph.
Think about your neuron:
x → z = 𝜃·x + b → a = tanh(z) → J = loss_function(a, y_true)where a is y_pred.
To find , we can compute:
Your implementation challenge: If you were to compute these partial derivatives numerically for a single example, what would be your step-by-step approach?
2.3 Why Tanh?¶
Now we can explain why we use tanh function in this step, not ReLU.
To understand the trade-offs, we must look at the forward pass and the derivative (gradient) for each function.
| Activation | Forward () | Derivative () |
|---|---|---|
| tanh(z) | ||
| Leaky ReLU(z) |
We are currently in Phase 1: Foundational Neurons & Backprop. The priority is mathematical clarity and gradient validation, not building a production-ready LLM yet.
Smoothness & Differentiability: Tanh is “smooth” everywhere. Leaky ReLU has a “kink” at where the derivative is discontinuous. In scalar manual backprop, these kinks can cause numerical instability and confusing results during gradient checks.
Bounded Output: Tanh keeps outputs in . This makes gradient magnitudes predictable and prevents values from “exploding” while you are still debugging your weight initializations.
Historical Validation: Most foundational backprop literature uses tanh. Using it here allows you to replicate classic experiments and ensure your chain rule implementation is 100% correct.
Why not Leaky ReLU yet?
Leaky ReLU’s main advantage—avoiding “dead neurons”—is only truly relevant in deep neural networks. In a single scalar neuron, it adds an extra hyperparameter () with almost no benefit. Furthermore, modern Transformers (like GPT) have largely moved past Leaky ReLU in favor of GELU, which we will implement in Phase 2.
“Vanishing” vs. “Dead” Gradients¶
It is important to distinguish between these two phenomena:
Vanishing Gradients (The Tanh Problem): This happens when is very large. The function becomes very flat, so the gradient becomes tiny (e.g., 0.00001). Training slows down, but the neuron is still “alive.”
Dead Gradients (The ReLU Problem): In standard ReLU, if , the gradient is exactly zero. The neuron stops learning entirely because no signal passes back through it.
Leaky ReLU solves “Dead Gradients”: By using , it ensures the gradient is never zero, even for negative inputs.
The Impact on Your Implementation:
We need non-zero, smooth gradients to validate your manual backprop code. If you used standard ReLU, any test input where would result in a gradient of exactly 0.
A Finite-Difference Gradient¶
It’s a numerical method to approximate the derivative of a loss function with respect to a parameter—using only function evaluations, no calculus required.
Formula (forward difference):
where is a tiny number (e.g., 10-5).
It’s a ground-truth check for your analytic (manual) gradient
If your analytic gradient is correct, it should match the finite-difference approximation within ~10-7
But what is ?
It’s not a simple algebraic function of alone. It’s the output of a full forward pass through a computational graph.
is defined as:
In our scalar neuron example with activation and MSE loss:
We denote:
= loss with original
= loss with perturbed
Concrete scalar neuron example:
Given fixed values:
Original
x = 2.0
b = 0.1
y_true = 1.0
theta = 0.5
epsilon = 10**(-5)Forward pass:
import numpy as np
def compute_l(x, theta, b, y_true):
z = x*theta + b
a = np.tanh(z)
L = (a - y_true)**2 / 2
return z, a, L
z, a, L = compute_l(x, theta, b, y_true)
print(L):
def compute_l_plus(x, theta, b, epsilon, y_true):
theta_new = theta + epsilon
z_plus = x*theta_new + b
a_plus = np.tanh(z_plus)
L_plus = (a_plus - y_true)**2 / 2
return theta_new, z_plus, a_plus, L_plus
theta_new, z_plus, a_plus, L_plus = compute_l_plus(x, theta, b, epsilon, y_true)
print(L_plus)Finite-difference gradient w.r.t. :
fin_diff_grad = (L_plus - L) / epsilon
print(fin_diff_grad)This -0.143 is the numerical approximation of the gradient.
Compare it to the analytic gradient from backpropagation:
Analytic gradient:
✅ Key takeaway: The finite-difference method gives a ground-truth reference to validate your manual or automatic differentiation.
def compute_gradient(x, y_true):
dz_dtheta = x
da_dz = 1 - np.tanh(z)**2
dL_da = a - y_true
dL_dtheta = dL_da * da_dz * dz_dtheta
return dz_dtheta, da_dz, dL_da, dL_dtheta
dz_dtheta, da_dz, dL_da, dL_dtheta = compute_gradient(x, y_true)
print('dL_da:', dL_da)
print('da_dz:', da_dz)
print('dz_dtheta:', dz_dtheta)
print("Gradient:", dL_dtheta)print(fin_diff_grad - dL_dtheta)Critical Clarification: What is ?
“Is just ?”
Yes—but only because:
The loss function is MSE:
is the network output when is perturbed to :
Network output = activation value = a
So by definition:
That’s literally the definition — not an assumption.
Back to Your ReLU Scenario
Now consider a ReLU neuron: , with .
Suppose for a given input, you get (i.e., in the flat region).
Analytic gradient (subgradient of ReLU):
→ soFinite-difference gradient:
Perturbing slightly may push closer to zero.
If , then changes → loss changes → non-zero numerical gradient (e.g., )
→ This mismatch seems alarming…
→ But it’s expected at non-differentiable points!
However, in practice:
If your analytic code correctly returns 0 for , it’s right.
But your gradient checker will fail near because the function isn’t smooth there.
And in a scalar neuron with ReLU:
For any input where , gradient = 0 → no weight update
Gradient checks fail randomly depending on input sign
→ That’s why we avoid ReLU in simple gradient-checking demos—use smooth activations like or sigmoid instead.
Final Check: Your Turn¶
Given:
Loss =
Compute step by step.
Show , , , :
x = 1.0
theta = 0.0
b = 0.0
epsilon = 10**(-5)
y_true = 0.5
theta_new, z_plus, a_plus, L_plus = compute_l_plus(x, theta, b, epsilon, y_true)
print(theta_new, z_plus, a_plus, L_plus)Your code is correct, and your output reveals a critical insight—let’s interpret it step by step.
Output Breakdown
(since , so slightly less than )
This matches your printed result: 0.12499500005000017.
Why This Matters: Gradient Behavior at
Now compute the original loss (with ):
,
z, a, L = compute_l(x, theta, b, y_true)
print(z, a, L)Finite-difference gradient:
fin_diff_grad = (L_plus - L) / epsilon
print(fin_diff_grad)Analytic gradient (via backprop):
dz_dtheta, da_dz, dL_da, dL_dtheta = compute_gradient(x, y_true)
print('dL_da:', dL_da)
print('da_dz:', da_dz)
print('dz_dtheta:', dz_dtheta)
print("Gradient:", dL_dtheta)print(fin_diff_grad - dL_dtheta)→ Numerical (-0.499995) ≈ Analytic (-0.5) → validation passes.
Key Takeaway¶
This example demonstrates:
Finite differences work even when (a common initialization point)
tanh’s derivative is 1 at → gradients are strong here (no vanishing!)
Your implementation correctly isolates the effect of perturbing
The Gradient Check Dilemma:
Suppose you used ReLU and your test input gave . Your analytic gradient would be 0. However, your finite-difference (numerical) gradient check might show a tiny non-zero change. Would you be able to tell if your backprop code was actually broken, or if the neuron was just “dead”?
By using tanh, we ensure that for almost any input, you get a meaningful gradient to verify your math.
That’s why we avoid ReLU here.
Why Compute Analytic Gradients When Finite Differences Exist?¶
Short answer: Finite differences are computationally infeasible for real models.
Computational Cost Comparison
Let = number of parameters.
| Method | Gradient Cost | Example: 100M-parameter LLM |
|---|---|---|
| Finite differences | forward passes | forward passes per gradient update |
| Analytic (backprop) | forward + backward | 1 forward + 1 backward pass |
Finite differences: For each parameter , you must:
Perturb by
Run full forward pass → get
Compute → Total: forward passes per gradient estimate
Suppose your model has two parameters: and .
You want the full gradient: .
To estimate :
Start with original parameters:
Compute baseline loss: → 1 forward pass
Perturb only :
Compute → 1 more forward pass
Approximate:
To estimate :
Perturb only :
Compute → another forward pass
Approximate:
✅ Total: 1 baseline + 2 perturbed = 3 forward passes
But note: you can reuse the baseline for all parameters!
So for parameters:
1 forward pass to compute baseline
forward passes to compute for each
→ Total: forward passes
(We drop the “+1” in big-O notation because it’s negligible when is large.)
Why Can’t We Do It in One Pass?
Because each perturbation changes a different parameter.
The loss function is:
To see how changing affects the loss, you must run the model with altered and all others unchanged.
You cannot perturb all parameters at once and recover individual gradients—that would mix all effects together (like trying to hear one instrument in an orchestra by playing everyone at once).
So each partial derivative requires its own controlled experiment → its own forward pass.
Concrete Numbers
Your 100M-parameter LLM:
Forward pass time: ~0.5 sec (on 4090 Ti)
Finite-diff gradient time:
Backprop time: ~1 sec
→ Finite differences are only viable for debugging tiny models (e.g., scalar neuron).
Why Is Finite-Difference a Valid Gradient Check?¶
Core idea: The derivative is defined as the limit of finite differences.
Mathematical Definition
The true derivative is:
Finite difference uses a small but finite (e.g., 10-5) to approximate this limit.
Error analysis shows the approximation error is (for forward difference).
Why It’s Trustworthy for Validation
No assumptions about your code
Finite difference uses only forward passes—no chain rule, no manual derivatives.
If your analytic gradient matches it, your entire backprop derivation is likely correct.
Controlled error bounds
With , typical error in gradient estimate is to 10-7.
If your analytic gradient differs by more than 10-6, you have a bug.
Failure modes are obvious
Common errors caught:
Forgot a term in chain rule (e.g., missed )
Sign error in loss derivative
Shape mismatch in vectorized code
Critical Caveat
Finite difference is not exact—but it’s exact enough for debugging.
It assumes your forward pass is correct. If forward pass is broken, both gradients will be wrong in the same way (false positive).
→ Always validate forward pass first (e.g., with known inputs/outputs).
Summary Table
| Aspect | Finite-Difference Gradient | Analytic Gradient (Backprop) |
|---|---|---|
| Purpose | Debug/validation | Training/inference |
| Computational Cost | forward passes | forward + backward |
| Scalability | Only for | Scales to |
| Accuracy | Approximate (error ~) | Exact (up to floating-point) |
| Dependency | Only on forward pass correctness | On entire backprop derivation |
Check for Understanding¶
Suppose you have a vectorized neuron with (7,680 parameters).
How many forward passes would finite differences require to check the full gradient?
(Answer with a number—and state whether this is practical.)
Answer:
7,680 forward passes, and it is only for ONE example from the set, so this is not acceptable for training LLMs.
Correct.
7,680 forward passes per example is computationally prohibitive—even for a single training step on a small batch. For a 100M-parameter LLM, it becomes physically impossible.
This is precisely why backpropagation is non-negotiable in deep learning: it gives exact gradients (up to floating-point precision) in time proportional to one forward pass, not the number of parameters.
We have confirmed:
The purpose and limitation of finite-difference gradients
The necessity of analytic gradients for scalable learning
Your ability to implement and reason about scalar gradient validation
Exercises¶
Exercise 2.1: Scalar Neuron Backpropagation¶
So: mastering the scalar case is mastering the core of matmul-based backprop.
Consider a single neuron with:
Input ,
Weight ,
Bias ,
Activation: ,
Loss: , with target .
Compute by hand:
Compute all values numerically, step by step. Provide the results, then compute the gradients:
, , ,
Show each step with numerical values. Do not use code—use math. This tests your chain rule mechanics.
What are the results?
Recall your values:
So first compute:
Now, what is ?
Correct:
dL_da¶
dL_dz¶
dL_dw and dL_db¶
Summary of Results¶
| Quantity | Value |
|---|---|
| 0.0 | |
| 0.0 | |
| 0.32 | |
| -0.8 | |
| -0.8 | |
| -1.6 | |
| -0.8 |
This is the complete forward and backward pass for one neuron.
Final verified gradients:
This completes the scalar neuron backpropagation exercise with full chain-rule derivation.
Now, we scale to vectorized operations.
Exercise 2.2: Implement the scalar neuron (Steps 1–2) in NumPy¶
You are absolutely right—and this is a critical pedagogical checkpoint.
Before advancing to layers, you must implement the scalar neuron (Steps 1–2) in NumPy, without autograd, to verify your understanding maps to executable code.
This is non-optional. Reasoning on paper is necessary but insufficient; code exposes hidden gaps in indexing, broadcasting, and memory layout.
Do not use PyTorch, JAX, or autodiff. Pure NumPy. Phase 1 follows a strict progression:
Scalar math + hand computation → verified (done)
Scalar NumPy implementation (procedural, one-time) → required now
Vectorized batched NumPy (procedural) → next
Class-based modular design → only after vectorized gradients are fully understood
Why delay classes?
Because abstraction hides mechanics. If you wrap the neuron in a class before you’ve manually computed and coded its gradients in raw NumPy, you risk treating it as a black box—exactly what this course forbids.
So for now:
Write a procedural script (no class) that computes the forward and backward pass for the given scalar neuron.
This is a verification step, not a reusable module.
Once you provide this code and confirm its output matches:
L = 0.32dL_dw = -1.6dL_db = -0.8
— we will:
Extend to batched vectorized form (still procedural),
Then refactor into classes only when the underlying math and memory layout are unquestionably solid.
Paste your NumPy implementation.
Required Implementation (NumPy, No Autograd)
Write a self-contained script that:
Defines scalars:
x = 2.0,w = -0.5,b = 1.0,y = 0.8Computes forward pass:
z = w * x + ba = np.tanh(z)L = 0.5 * (a - y)**2
Computes manual gradients (no
.backward(), nograd):dL_da = a - yda_dz = 1 - a**2dL_dz = dL_da * da_dzdL_dw = dL_dz * xdL_db = dL_dz * 1
Prints all values
Code¶
import numpy as np
x = 2.0
w = -0.5
b = 1.0
y = 0.8
z = x*w + b
a = (np.exp(z) - np.exp(-z)) / (np.exp(z) + np.exp(-z))
L = 0.5 * (a -y)**2
dL_da = a - y
da_dz = 1 - a**2
dL_dz = dL_da * da_dz
dL_dw = dL_dz * x
dL_db = dL_dz * 1
print("x =", x)
print("w =", w)
print("b =", b)
print("y =", y)
print("z =", z)
print("a =", a)
print("L =", L)
print("dL_dw =", dL_dw)
print("dL_db =", dL_db)Peer Review¶
Your implementation is mostly correct, but contains one critical inefficiency and a minor numerical risk.
Issue: Manual tanh via exp
You wrote:
a = (np.exp(z) - np.exp(-z)) / (np.exp(z) + np.exp(-z))This is mathematically correct but:
Redundant: NumPy provides
np.tanh(z)—use it.Numerically unstable: For large |z|,
np.exp(z)overflows (e.g.,z=1000→inf), whilenp.tanh(z)is stable.
In production systems, always use optimized, stable primitives (tanh, softmax, etc.), not manual expansions—unless you are implementing the primitive itself (which we are not).
Corrected Script
import numpy as np
x = 2.0
w = -0.5
b = 1.0
y = 0.8
# Forward pass
z = w * x + b
a = np.tanh(z) # Use stable built-in
L = 0.5 * (a - y)**2
# Backward pass (manual gradients)
dL_da = a - y
da_dz = 1 - a**2
dL_dz = dL_da * da_dz
dL_dw = dL_dz * x
dL_db = dL_dz
# Output
print(f"z = {z}")
print(f"a = {a}")
print(f"L = {L}")
print(f"dL_dw = {dL_dw}")
print(f"dL_db = {dL_db}")Expected Output¶
z = 0.0
a = 0.0
L = 0.32
dL_dw = -1.6
dL_db = -0.8If your script produces these values, Phase 1, Step 2 (Scalar Neuron Implementation) is complete.
3. From Scalar to Vector — The Linear Layer¶
Why scalar first¶
Your implementation of “Exercise 2.2: Implement the scalar neuron (Steps 1–2) in NumPy” is correct for a scalar neuron—one input, one weight, one bias, processing one sample. This is exactly the right starting point. The absence of matrix operations is not a flaw; it is intentional in Phase 1.
However, we must now ask:
Consider this:
In a vectorized (batched) setting, the forward pass for samples (rows) and inputs (features) is:
, where
(broadcasted)
The scalar case is the atomic unit of this operation. Each element . If you cannot compute for a single , you cannot correctly derive the batch gradient.
Now, examine your code:
You computed .
In the batch case, the gradient becomes:
, where .
Your scalar is precisely the -th term in that vectorized sum. So your code is not just fine — it is foundational.
Moving to vectors¶
But here’s the critical next step:
Can you extend this to a batch of 3 samples using only NumPy (no autograd), and verify that your batch gradients match the sum of individual scalar gradients?
Do not write the code yet. First, answer:
Correct. The batch loss is typically the mean of per-sample losses:
Consequently, the gradient w.r.t. any parameter becomes the mean of the per-sample gradients:
This is not merely a convention—it directly impacts learning dynamics. Using the mean (instead of sum) makes the gradient magnitude independent of batch size, which stabilizes training when B varies.
Now, let’s test your understanding with a concrete micro-step.
Exercise 3.1: Vector Forward and Backward Pass¶
Given the following batch of 3 samples:
X = np.array([[2.0],
[1.5],
[-1.0]]) # Shape: (3, 1)
W = np.array([[-0.5]]) # Shape: (1, 1)
b = 1.0
Y = np.array([[0.8],
[0.5],
[-0.3]]) # Shape: (3, 1)Write the forward pass (compute Z, A, L) and backward pass (compute dL_dW, dL_db) using only NumPy, without autograd or loops. Use the same tanh activation and MSE loss (mean-reduced over batch).
Do not encapsulate in functions—keep it procedural, like your scalar version.
Before you write code:
Answer
Let’s recompute dimensionally, step by step.
Given:
: (3, 1) — 3 samples, 1 feature
: (1, 1) — weight matrix (input dim → output dim)
: scalar (broadcasted)
Forward:
Loss:
Elementwise error:
Squared error:
Mean over batch:
Backward:
(Because )
Where represents element-wise multiplication.
Now, :
By chain rule:
: , : result:
Similarly, of over batch → scalar, but typically kept as or scalar.
So correct shapes:
:
:
:
:
: scalar (or if using array)
Your claim that is is incorrect. The gradient w.r.t. must have the same shape as — it’s an elementwise derivative.
This matters for CUDA: implies 3 independent activations, each contributing to the gradient. Mis-shaping breaks vectorization and leads to incorrect memory access patterns on GPU.
DRAFT¶
Now, back to the exercise:
Given the shape correction, write the full NumPy code for forward and backward passes with the provided X, W, b, Y.
Keep it procedural. Compute:
Z, A, L
dL_dA, dL_dZ
dL_dW, dL_db
Use matrix operations only (no loops).
Consider a single-layer network with:
Input batch (B = batch size, d = input dim),
Weight matrix ,
Bias ,
Output pre-activations
Assume a single output neuron first (), so , .
Given a loss , what is the gradient of with respect to in matrix form?
Hint:
From the scalar case,
The full gradient is the average over the batch.
Express using matrix operations (e.g., outer product, matrix multiplication).
What is the formula?
Now, answer this:
If we had a batch of 32 inputs, how would the gradient computation for and change in structure?
Be specific:
Would you compute gradients per sample then average?
How does this relate to matrix multiplication in the backward pass?
What is the shape of if and ?
Explain your reasoning.