Chapter 3: Automatic Differentiation
Yay! We will do some magic here. Automatic Differentiation (or Autograd) is the heart of every deep learning library. It allows us to calculate how much every single parameter in our model contributed to the final error, without us having to write the math by hand for every new model.
How does PyTorch do this?
PyTorch's torch.autograd engine works the same way — it records operations on tensors into a computation graph and walks it backward to compute gradients. See PyTorch Autograd documentation.
The folder structure after the previous chapter looks like this:
project/
├─ .venv/ # virtual environment folder
├─ babygrad/
│ ├─ __init__.py
│ ├─ tensor.py #Tensor code
│
├─ examples/ # example scripts using babygrad
└─ tests/ # unit tests
3.1 Operations
Operations are mathematical functions (like addition, multiplication, or matrix multiplication) that transform one or more tensors into a new tensor.
Each Function class will have a
- forward : How to calculate the output given some inputs.
- backward : How to calculate the gradients using the Chain Rule.
Lets define a Base Class .
File: babygrad/ops.py
from .tensor import Tensor, NDArray
import numpy as np
class Function:
def __call__(self, *args):
raise NotImplementedError()
def forward(self, *args):
"""Computes the forward pass of the operation.
Args:
*args: One or more NumPy arrays
"""
raise NotImplementedError()
def backward(self, out_grad, node):
"""Calculates backward pass (gradients)
Args:
out_grad: upstream gradient flowing from output to input
node: Value object holding inputs from forward pass
"""
raise NotImplementedError()
NDArray is the type alias np.ndarray we defined in tensor.py in Chapter 2. It's used in type hints to show that forward methods work on raw numpy arrays, not Tensors.
This is simple Base class for All our Functions. All the Functions will be subclass of Function.
What does the forward method return? Numpy Arrays! Should we wrap the output in a Tensor?
Yes! because this Output will be used for further calculations and is a part of computation graph.
So the forward method will :
- Compute new values.
- Wrap those values inside
Tensor.
Each mathematical operation (addition, multiplication, etc.) will need its own forward method, but the wrapping logic is the same for all of them.
Since the wrapping logic is identical for every operation, we can put it in one shared method instead of duplicating it in every forward(). We'll use __call__ for this.
__call__ is a Python special method that makes an object callable like a function. When you write Add()(a, b), Python runs Add().__call__(a, b).
Here's what __call__ will do:
- Accept Tensor inputs
- Extract
.data(the numpy arrays) from each input - Call
forward()with those numpy arrays - Wrap the result in a new Tensor
- Record
_opand_inputsfor the computation graph - Return the new Tensor
File : babygrad/ops.py
class Function:
def __call__(self, *args): # Takes inputs
requires_grad = any(t.requires_grad for t in args)
inputs_data = [t.data for t in args] # Gets .data
output_data = self.forward(*inputs_data) # Calls forward
# Wrap in Tensor
output_tensor = Tensor(output_data, requires_grad=requires_grad)
if requires_grad:
output_tensor._op = self # Save operation
output_tensor._inputs = args # Save parents
return output_tensor
Notice the requires_grad check: if any input requires a gradient, the output must too. This ensures gradient tracking propagates through the entire computation.
Why store _op and _inputs?
The forward method computes the result, but that's only half the story. During the backward pass, we'll need to know:
- What operation created this tensor (
_op) - What tensors were its inputs (
_inputs)
Each tensor stores its "parents" and how it was created—like a family tree. This chain of history is what makes automatic differentiation possible: when we compute gradients, we walk backward through the graph, and each tensor knows exactly where to pass its gradients.
Now let's define some concrete operations that subclass Function.
3.2 Forward Pass
The forward pass is straightforward: two inputs go through some operation and produce a single output value.
The operation here could be:
addmulsub- or any other differentiable operation (with a corresponding backward pass).
File : babygrad/ops.py
import numpy as np
from .tensor import Tensor, NDArray
class Add(Function):
def forward(self, a: NDArray, b: NDArray):
return a + b
def backward(self,out_grad , node):
return out_grad,out_grad
def add(a, b):
return Add()(a, b) #`__call__`
class Mul(Function):
def forward(self, a: NDArray, b: NDArray):
return a * b
def backward(self, out_grad, node):
a,b = node._inputs
return out_grad*b,out_grad*a
def mul(a, b):
return Mul()(a, b)
Don't worry about backward yet — we're only focusing on forward for now. You'll see backward defined alongside it because the two always live together, but we'll explain what out_grad is and where it comes from in section 3.5.
Not too hard, right? We just completed the forward pass of Add().
Why so many operations?
Arithmetic ops like Sub, Div, and Pow are the basic math behind loss functions and weighted sums.
Activation functions like ReLU, Sigmoid, and Tanh go between layers to introduce nonlinearity. Without them, stacking 100 layers would be no better than one.
MatMul is the core of every neural network layer. Every Linear layer you'll build in Chapter 4 is just a matrix multiply plus a bias.
Shape operations like Reshape, Transpose, and BroadcastTo are plumbing. Real data comes in awkward shapes, and these let you reshape, flip, and stretch tensors so the math works out.
Utility functions like Log, Exp, and Sqrt appear in specific places: Log in cross-entropy loss, Exp in softmax, Sqrt in normalization.
You won't use them all at once, but by the end of the tutorial you'll have used every single one.
forward method for these operations.
Each one follows the same pattern as Add and Mul above: take numpy arrays in, return a numpy array out. Use the corresponding numpy function.
Arithmetic (two inputs, like Add and Mul):
Sub— subtraction:a - bDiv— division:a / bPow— exponentiation:np.power(a, b)
Unary math (one input):
Negate— flip the sign:-aLog— natural logarithm:np.log(a)Exp— exponential:np.exp(a)Sqrt— square root:np.sqrt(a)Abs— absolute value:np.abs(a)
Activation functions (one input — these are what make neural networks nonlinear; without them, stacking layers would be no better than a single layer):
ReLU— "Rectified Linear Unit": returnsxif positive,0otherwise. Formula:max(0, x). Hint:a * (a > 0)Sigmoid— squashes any value into the range (0, 1). Formula:1 / (1 + e^(-x))Tanh— squashes any value into the range (-1, 1). Usenp.tanh(a)
Don't worry about Reshape, Transpose, BroadcastTo, Summation, or MatMul yet — those are more involved and are covered step-by-step in sections 3.4.1–3.4.5 below.
3.3 Integrating Functions inside the Tensor Class
We've written forward methods for a few functions, but how do we actually use them? How do we link them to the Tensor class?
We need to method overload the functions. This means that whenever we perform operations like x + y between two Tensor objects, it will use our Function class (e.g., Add) under the hood instead of Python's default addition.
The same approach applies to other operations (*, -, /, etc.).
File : babygrad/tensor.py
class Tensor:
def __add__(self, other):
"""Addition: a + b"""
from .ops import Add
if not isinstance(other, Tensor):
other = Tensor(other)
return Add()(self, other)
def __radd__(self, other):
"""Right addition: 5 + tensor"""
return self.__add__(other)
Now we can add two Tensors without any problem. tensor_a + tensor_b will just work.
from babygrad import Tensor
a = Tensor([1,2,3])
b = Tensor([2,3,4])
c = a+b
print(c.data) # output data
>>> [3. 5. 7.]
print(c._op) # which operation ? '+'
>>> <babygrad.ops.Add object at 0x00000248BAB0A600>
print(c._inputs) #who are the parents? [a,b]
>>> [Tensor([1. 2. 3.], requires_grad=True), Tensor([2. 3. 4.],
requires_grad=True)]
File: babygrad/tensor.py
Right now, using an operation means writing Add()(a, b) every time. Python's dunder methods let you wire these ops directly to operators like +, *, and -, so you can write natural expressions like a + b instead.
Using the __add__ example above as a template, write the following methods inside the Tensor class:
__mul__— usesMul__pow__— usesPow__sub__— usesSub__truediv__— usesDiv__neg__— usesNegate
Each follows the same pattern: wrap the other operand in a Tensor if needed, then call the op.
We'll add __matmul__, reshape, transpose, sum, and broadcast_to later, after we implement those ops in sections 3.4.1–3.4.5.
3.4 Backward Pass
In the Forward Pass, data flows from inputs to outputs to get a result. In the Backward Pass, we go in reverse. We start with the final error and walk backward through our family tree to figure out: "How much is each parent responsible for this error?"
A measure of how much the final error changes when we change a specific input or weight.
I'm rusty on derivatives — what does "how much it changes" mean?
If you're rusty on derivatives, 3Blue1Brown's The paradox of the derivative is an excellent 17-minute refresher. It builds the intuition for what "rate of change" means without assuming you remember the formulas.
Let's walk through a concrete example.
a = babygrad.Tensor([2, 3], dtype="float32")
b = babygrad.Tensor([4, 5], dtype="float32")
w = babygrad.Tensor([3, 2], dtype="float32")
c = a * b # step 1: multiply → c = [8, 15]
d = c * w # step 2: multiply again → d = [24, 30]
We want to know: if we nudge a slightly, how much does d change? That's the gradient of d with respect to a.
Since these are tensors, "nudge" means one element at a time. The gradient is a tensor of the same shape as a, where each element answers "if I change this element, how much does the output change?"
Let's check by hand. Nudge a[0] from 2 to 2.01:
c[0]= 2.01 × 4 = 8.04 (was 8)d[0]= 8.04 × 3 = 24.12 (was 24) — changed by 0.12
We nudged by 0.01 and the output moved by 0.12. The gradient for a[0] is 0.12 / 0.01 = 12. That's b[0] × w[0] = 4 × 3! The change in a got amplified by b in the first multiply, then amplified again by w in the second multiply.
Similarly, a[1]'s gradient is b[1] × w[1] = 5 × 2 = 10. So the full gradient is [12, 10].
What does "nudge a tensor" actually mean? What if they're different shapes?
Each element of the gradient tracks one element of a. We nudge one element, hold everything else constant, and measure how the output changes.
What if the tensors have different shapes? For example, summing [1, 2, 3] to a single number 6 gives a gradient of [1, 1, 1] — each element contributed equally. The rule: the gradient always has the same shape as the thing you're differentiating with respect to. That's why we need shape-changing ops like Summation, BroadcastTo, and Reshape in sections 3.4.1–3.4.5.
How do we compute [12, 10] systematically (without nudging by hand)? The chain rule says: break the chain of operations into small steps, compute each step's derivative, and multiply them together.
a doesn't affect d directly — it goes through c first. So we ask two smaller questions:
- If we nudge
c, how much doesdchange? Sinced = c * w, the answer is w = [3, 2] — each element ofcgets scaled by the corresponding element ofw. - If we nudge
a, how much doescchange? Sincec = a * b, the answer is b = [4, 5] — each element ofagets scaled by the corresponding element ofb.
Multiply them together: [3, 2] × [4, 5] = [12, 10]. Same answer we got by hand!
$$\frac{\partial d}{\partial a} = \frac{\partial d}{\partial c} \cdot \frac{\partial c}{\partial a} = w \cdot b = [3,2] \cdot [4,5] = [12, 10]$$
What about b? Same idea, separate formula. When we ask "how does changing a affect d?", we hold b constant — it's irrelevant to that question. b gets its own chain:
$$\frac{\partial d}{\partial b} = \frac{\partial d}{\partial c} \cdot \frac{\partial c}{\partial b} = w \cdot a = [3,2] \cdot [2,3] = [6, 6]$$
Why does this work?
Each small derivative (like "how does c change when I change a?") is called a local derivative — it only involves one operation. The "magic" of autograd is that each Function class only needs to know its own local derivative. When we multiply them together via the chain rule, we get the global gradient — how the final output depends on any input, no matter how many operations are in between.
This is how a 100-layer neural network becomes tractable: we never solve the whole thing at once, just one operation at a time.
3Blue1Brown has two excellent videos on this: Visualizing the chain rule builds the intuition with animations, and Backpropagation calculus shows how it applies to neural networks.
Walking backward step by step
I'm lost — is there a video that explains this?
3Blue1Brown's What is backpropagation really doing? walks through the same idea — starting at the output and working backward — with animations. It's 13 minutes and covers exactly what we're about to do in code.
In practice, we compute gradients by starting at the output and working backward one node at a time. Here's the computation graph for our example:
Step 1: c = a × b
Step 2: d = c × w — now chain two operations together:
That incoming gradient has a name: the upstream gradient. In our code, it's the out_grad parameter. Every backward method does the same thing: take out_grad, multiply by the local derivative, pass the result to each input.
Let's look at how this translates to code for Add and Mul:
class Add(Function):
def forward(self, a: NDArray, b: NDArray):
return a + b
def backward(self, out_grad, node):
# For a + b:
# if a increases by 1, output increases by 1 → local derivative is 1
# if b increases by 1, output increases by 1 → local derivative is 1
# Multiply out_grad by each local derivative:
return out_grad * 1, out_grad * 1 # simplifies to: out_grad, out_grad
class Mul(Function):
def forward(self, a: NDArray, b: NDArray):
return a * b
def backward(self, out_grad, node):
a, b = node._inputs
# For a * b:
# if a increases by 1, output increases by b → local derivative is b
# if b increases by 1, output increases by a → local derivative is a
# Multiply out_grad by each local derivative:
return out_grad * b, out_grad * a
Let's plug in real numbers to see why Mul works that way. Say a = 4 and b = 3, so a * b = 12.
- Nudge
ato 4.01 → output becomes 4.01 × 3 = 12.03. It moved by 0.03, which is 0.01 ×b. So the local derivative foraisb. - Nudge
bto 3.01 → output becomes 4 × 3.01 = 12.04. It moved by 0.04, which is 0.01 ×a. So the local derivative forbisa.
Then multiply each by out_grad (the gradient arriving from upstream) and return both. That's all backward does.
For Add it's even simpler: nudge either input by 0.01 and the output moves by exactly 0.01. Local derivative is always 1, so you just pass out_grad straight through.
The pattern is always the same: out_grad × local_derivative. Every backward method you write will follow this pattern.
File: babygrad/ops.py
The forward pass computes results, but without gradients the network cannot learn. The backward pass is what lets the optimizer adjust every weight in the right direction during training.
Write the backward method for all these classes:
- Mul
- Sub
- Div
- Pow
- Negate
- Log
- Exp
- Relu
- Sigmoid
- Tanh
- Sqrt
- Abs
PyTorch's autograd.Function
In PyTorch, custom operations extend torch.autograd.Function with static forward and backward methods — the same pattern we're building here. See torch.autograd.Function.
Guidelines:
- Do not use NumPy functions directly in the backward pass.
- You can (and should) reuse the functions you implemented in the forward pass, e.g.,
add(),mul(), etc. - Example:
def divide(a, b):
return Div()(a, b)
You can use divide() freely in the backward pass of another operation.
- Make sure the gradients have the same shape as the corresponding input Tensor, especially for operations involving broadcasting or reshaping.
The above functions are easy to implement. Now let's tackle the more interesting ones.
3.4.1 Reshape
Reshape is the simplest shape-changing operation. In the forward pass, we take the data and arrange it into a new shape (e.g., turning a \(1 \times 6\) vector into a \(2 \times 3\) matrix). Under the hood, it's not creating new memory—it's just changing how it views that data.
$$\begin{aligned} \text{Original matrix } A &= \begin{pmatrix} 1 & 2 & 3 & 4 \\ 5 & 6 & 7 & 8 \\ 9 & 10 & 11 & 12 \end{pmatrix}_{3 \times 4} \\[1em] \text{Reshape to } B &= \begin{pmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \\ 7 & 8 & 9 \\ 10 & 11 & 12 \end{pmatrix}_{4 \times 3} \\[1em] \text{or } C &= \begin{pmatrix} 1 & 2 \\ 3 & 4 \\ 5 & 6 \\ 7 & 8 \\ 9 & 10 \\ 11 & 12 \end{pmatrix}_{6 \times 2} \end{aligned}$$
- Shape of A is (3,4)
- In the forward pass we reshape it to (4,3).
- Now
out_gradis of shape (4,3).
We need to convert the shape of out_grad back to the shape of A. To pass this gradient back to the input, we must ensure the shapes match. Since reshaping doesn't change the values—only their positions—the backward pass is just a reshape in the opposite direction.
Neural network layers often output data in a different shape than the next layer expects -- for example, a convolutional layer produces a 4D tensor that must be flattened to 2D before a fully connected layer can use it. Reshape handles these shape conversions.
Let's write the Reshape class.
class Reshape(Function):
def __init__(self, shape):
self.shape = shape
def forward(self, a):
# use np.reshape or a.reshape
def backward(self, out_grad, node):
a = node._inputs[0]
#Your solution
def reshape(a, shape):
return Reshape(shape)(a)
We can use np.reshape and also a.reshape as a is Numpy array. Can we use reshape in backward too?
Now go back to tensor.py and add a reshape method to the Tensor class, following the same pattern as __add__.
3.4.2 Transpose
Transpose flips the axes of a tensor. In a 2D matrix, rows become columns and columns become rows.
$$\begin{aligned} A &= \begin{pmatrix} a & b & c \\ d & e & f \end{pmatrix}_{2 \times 3} \\[1em] A^T &= \begin{pmatrix} a & d \\ b & e \\ c & f \end{pmatrix}_{3 \times 2} \end{aligned}$$
The input axes can:
- Have no axes specified—in that case, we transpose the last two axes.
- Transpose whatever axes are provided.
Backward Pass
What about the backward pass? For a 2D matrix, the logic is simple: \((A^T)^T = A\). To get the gradient back to the original shape, we just transpose it again.
$$(A^T)^T = A$$
For the backward pass, a simple transpose will work for 2D matrices. For ND matrices, we need a slightly different approach.
Let's say we had a matrix:
- Original Axes order: (0, 1, 2) and Shape: (3, 2, 5)
- Forward Pass: Transpose(1, 2, 0) and Shape: (2, 5, 3)
- The Goal: Move the gradient from (2, 5, 3) back to (3, 2, 5)
For any ND matrix that has been transposed, we just need to know where the original axes ended up.
Current axis order: (1, 2, 0) with shape (2, 5, 3). Where did each original axis go?
- Where is 0 in the current order? At position 2.
- Where is 1 in the current order? At position 0.
- Where is 2 in the current order? At position 1.
The result: (2, 0, 1). If we apply transpose to out_grad using this result, we get the original shape back.
How do we compute this? np.argsort gives us exactly what we need. If we apply np.argsort((1, 2, 0)), we get (2, 0, 1).
Transpose is needed for the backward pass of matrix multiplication -- computing gradients for MatMul requires transposing matrices to align dimensions correctly.
Let's write the Transpose class.
class Transpose(Function):
def __init__(self, axes: Optional[tuple] = None):
self.axes = axes
def forward(self, a):
# use np.transpose or np.swapaxes
# Case 1: self.axes is None (swap last two axes)
# Case 2: self.axes is not None (transpose specified axes)
def backward(self, out_grad, node):
# your solution
# If axes is None just do tranpose(out_grad,axes)
# Hint: For ND arrays, look into np.argsort to find the inverse.
def transpose(a, axes=None):
return Transpose(axes)(a)
Use np.argsort to find the axes order.
Now add a transpose method to Tensor.
3.4.3 Summation
Sum the elements along a given axis.
$$\begin{aligned} A &= \begin{pmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \end{pmatrix}_{2 \times 3} \\[1em] \text{Sum all elements: } & \sum_{i,j} A_{ij} = 21 \\[1em] \text{Sum over rows (axis=0): } & \begin{pmatrix} 1+4 & 2+5 & 3+6 \end{pmatrix} = \begin{pmatrix} 5 & 7 & 9 \end{pmatrix} \\[1em] \text{Sum over columns (axis=1): } & \begin{pmatrix} 1+2+3 \\ 4+5+6 \end{pmatrix} = \begin{pmatrix} 6 \\ 15 \end{pmatrix} \end{aligned}$$
Notice that whatever axis we sum over, that axis vanishes (or becomes 1).
Backward pass
In the backward pass, out_grad is smaller than the original input. To pass gradients back to the parents, we have to stretch the gradient back to the original shape.
Think of it like this:
- Reshape: Put "1" back into the axis that vanished so the dimensions align.
- Multiply: Multiply the reshaped array by an array of ones with the parent's shape.
Example with shape (3, 3):
axis=None: Result has shape (1,). Reshape to (1, 1), then multiply (1, 1) * (3, 3).axis=0: Result has shape (3,). Reshape to (1, 3), then multiply (1, 3) * (3, 3).axis=1: Result has shape (3,). Reshape to (3, 1), then multiply (3, 1) * (3, 3).
Summation collapses a tensor along an axis -- you'll need it for computing loss (reducing a batch of errors to a single number) and for the backward pass of BroadcastTo.
Let's write the Summation class.
class Summation(Function):
def __init__(self, axes: Optional[tuple] = None):
self.axes = axes
def forward(self, a):
# can we use np.sum?
def backward(self, out_grad, node):
#your solution
def summation(a, axes=None):
return Summation(axes)(a)
You can implement backward using the reshape-and-multiply approach described above. (Later, after completing Exercise 3.8, you could also use broadcast_to.)
Now add a sum method to Tensor.
3.4.4 BroadcastTo
In an ideal world, all tensors would have the same shape. But in practice, we often need to broadcast a matrix of shape (3, 1) to (3, 3).
$$\begin{bmatrix} a \\ b \\ c \end{bmatrix} \rightarrow \begin{bmatrix} a & a & a \\ b & b & b \\ c & c & c \end{bmatrix}$$
We "stretch" the smaller tensor to match the larger one. During the backward pass, we do the exact opposite.
If one value was used three times to produce the output, it is responsible for the error at all three of those locations. Therefore, we sum the out_grad along the axes that were stretched.
When broadcasting, NumPy does two things:
- Prepending: (3,) broadcast to (2, 3) adds axes to the front.
- Stretching: If any dimension is 1, like (3, 1), it repeats to match, e.g., (3, 10).
Handling Prepending: If out_grad has more dimensions than the input, the forward pass added dimensions to the front. Sum out_grad along axis 0 until the number of dimensions matches.
Handling Stretching: Once dimensions match (e.g., both are 2D), you might still have a shape mismatch. If the input was (3, 1) but the gradient is (3, 10):
- If the original dimension was 1 but
out_grad's dimension is greater than 1, stretching happened on that axis. - Sum over that axis.
- Be careful: summing sometimes gives (3,) instead of (3, 1). You may need to insert a 1 to restore the shape.
Broadcasting lets you add a bias vector to every row in a batch, or multiply a whole matrix by a scalar -- operations that happen constantly in neural networks.
Let's write the BroadcastTo class.
class BroadcastTo(Function):
def __init__(self, shape):
self.shape = shape
def forward(self, a):
# Can we use np.broadcast_to ?
def backward(self, out_grad, node):
# your solution
def broadcast_to(a, shape):
return BroadcastTo(shape)(a)
You can implement backward using summation and reshape methods .
Now add a broadcast_to method to Tensor.
3.4.5 Matmul
Matrix multiplication is one of the most important operations in deep learning.
In the forward pass, we compute \(C = A \cdot B\). How do we distribute the gradient (\(out\_grad\)) back to \(A\) and \(B\)?
- Input \(A\) has shape \((M, N)\).
- Input \(B\) has shape \((N, P)\).
- Output \(C\) (and therefore \(out\_grad\)) has shape \((M, P)\).
Finding the Gradient for \(A\)
We know the result, grad_a, must match the shape of \(A\), which is \((M, N)\).
- We have out_grad: \((M, P)\)
- What can we multiply by to get \((M, N)\)?
- We need something with shape \((P, N)\).
- Looking at our inputs, \(B\) is \((N, P)\). Its transpose \(B^T\) is \((P, N)\)!
$$\frac{\partial L}{\partial A} = out\_grad \cdot B^T$$
Finding the Gradient for \(B\)
We know grad_b must match the shape of \(B\), which is \((N, P)\).
- We have out_grad: \((M, P)\)
- What can we multiply by to get \((N, P)\)?
- We need something with shape \((N, M)\).
- Input \(A\) is \((M, N)\). Its transpose \(A^T\) is \((N, M)\)!
$$\frac{\partial L}{\partial B} = A^T \cdot out\_grad$$
Matrix multiplication is how neural network layers transform their inputs -- every Linear layer you'll build in Chapter 4 is just a MatMul plus a bias.
Let's write the MatMul class.
class MatMul(Function):
def forward(self, a, b):
# can you use np.matmul?
def backward(self, out_grad, node):
# your solution
def matmul(a, b):
return MatMul()(a, b)
You can implement backward using summation, reshape, and broadcast_to methods.
Now add __matmul__ (the @ operator) and a matmul method to Tensor.
3.4.6 Scalar Operations
Suppose we have
x = Tensor([1,2,3])
y =x+1
Here x is a Tensor and 1 is a scalar. We need a way to handle these mixed-type operations.
Scalar operations are straightforward. The gradient only flows to the tensor input; the scalar is treated as a constant.
class AddScalar(Function):
def __init__(self, scalar):
self.scalar = scalar
def forward(self, a: NDArray):
return a + self.scalar
def backward(self, out_grad: Tensor, node: Tensor):
return out_grad
def add_scalar(a, scalar):
return AddScalar(scalar)(a)
That's it!
Scalar operations handle cases like x + 1 or x * 0.5 -- when one operand is a plain number, not a Tensor. You'll use these for scaling learning rates and normalizing values.
Write the forward and backward method for all these classes:
- MulScalar
- DivScalar
- PowerScalar
3.5 Implementing Backward Pass in the Tensor Class
We've built a Tensor class that supports forward and backward operations. We've already created the bridge between our forward operations and the Tensor class. Now we need to create the bridge for backward operations. Each forward function in ops had its own method in the Tensor class—do we do the same for backward?
You might be tempted to write a backward_add() or backward_mul() for every operation, but that would be a maintenance nightmare.
Do we really need multiple backward() methods? No. A single backward() method in the Tensor class is sufficient.
- The Tensor: Tracks the "Family Tree" (who created this tensor?).
- The Function: Stores the "Local Rule" (how do I derive this specific operation?).
When you call out.backward(), the Tensor simply walks the graph in reverse order. At each step, it asks the parent's Function: "Here is the gradient from above; what is my share of the responsibility?"
By convention, the gradient of the final output starts as 1:
# Start of backpropagation
all_nodes.grad = None
output.grad = 1
output.backward()
Let's see how to implement the backward() method in Tensor.
File : babygrad/tensor.py
class Tensor:
def backward(self, grad=None):
grads = {}
if grad is None:
#This is the upstream gradient.
grads[id(self)] = Tensor(np.ones_like(self.data))
# The output grad must be ones.
topo_order = []
visited = set()
def build_topo(node):
#your code
build_topo(self)
for node in reversed(topo_order):
#your code
To implement this, we need to:
- Find all nodes (Tensors) in the computation graph that contributed to the result.
- Order them correctly so we don't calculate a parent's gradient until we've finished calculating all of its children's gradients.
Let's start simple. First, we'll find all the nodes in the computation graph using depth-first search.
Depth-first search is an algorithm that explores a graph by going as deep as possible before backtracking.
def dfs(node, order, visited):
if node in visited:
return # already processed
for parent in node._inputs:
dfs(parent, order, visited)
visited.add(node)
order.append(node)
When backpropagating, a node in the computation graph might receive gradient contributions from multiple paths. That means a node's .grad might temporarily be a list of tensors—one from each path.
Example:
a.grad = [Tensor([1, 2, 3]), Tensor([3, 4, 5]), Tensor([4, 5, 6])]
This is fine temporarily, but eventually we need a single tensor representing the total gradient. We need to accumulate the gradients.
You might have noticed grads = {} in the backward code. Why do we use a dictionary?
In our backward loop, when we're at a child node, we calculate gradients for its parents. But we don't store them in parent.grad immediately—those parents might receive more gradients from other children. The parents will be processed later in the loop, so we need somewhere to hold their gradients until then.
The grads dictionary is our temporary storage:
- Key: The Tensor's unique ID (
id(node)) - Value: The accumulated gradient so far
The grads dictionary is our ledger for the entire backward pass.
In the loop, when we're at a child node:
- Grab the child's gradient from the
gradsdictionary. - Call
_op.backward(child_grad, child_node)to compute the parents' gradients. - Store those results back in the dictionary under each parent's
id().
When the loop eventually reaches those parents, their gradients will already be waiting in the dictionary.
backward method
This is the culmination of the chapter -- you're building the engine that walks the computation graph and calls each operation's backward method in the right order. After this, your tensors can actually learn.
Write the backward method in the Tensor class.
# imports
class Tensor:
#code
def backward(self, grad=None):
grads = {}
if grad is None:
#This is the upstream gradient.
#The output grad must be ones.
grads[id(self)] = Tensor(np.ones_like(self.data))
topo_order = []
visited = set()
def build_topo(node):
#your code
build_topo(self)
for node in reversed(topo_order):
#get the out_grad of the node
out_grad = grads.get(id(node))
# parent_grads = call node._op.backward(out_grad,node)
# for each parent inside node._inputs:
# store grad[parent_id] = parent_grads
PyTorch's backward pass
PyTorch's tensor.backward() does essentially the same thing: it topologically sorts the graph and walks it in reverse, calling each node's backward function. See Autograd Mechanics for a deeper look at how PyTorch handles this under the hood.
Original: zekcrates/chapter2