Chapter 3: Automatic Differentiation
Yay! We will do some magic here. Automatic Differentiation (or Autograd) is the heart of every deep learning library. It allows us to calculate how much every single parameter in our model contributed to the final error, without us having to write the math by hand for every new model.
The folder structure after the previous chapter looks like this:
project/
├─ .venv/ # virtual environment folder
├─ babygrad/
│ ├─ tensor.py #Tensor code
│
├─ examples/ # example scripts using baby
└─ tests/ # unit tests
3.1 Operations
Operations are mathematical functions (like addition, multiplication, or matrix multiplication) that transform one or more tensors into a new tensor.
Each Function class will have a
- forward : How to calculate the output given some inputs.
- backward : How to calculate the gradients using the Chain Rule.
Lets define a Base Class .
File: babygrad/ops.py
from .tensor import NDArray, Tensor
from typing import Tuple
import numpy as np
class Function:
def __call__(self, *args):
raise NotImplementedError()
def forward(self, *args):
"""Computes the forward pass of the operation.
Args:
*args: One or more NumPy arrays
"""
raise NotImplementedError()
def backward(self, out_grad, node):
"""Calculates backward pass (gradients)
Args:
out_grad: upstream gradient flowing from output to input
node: Value object holding inputs from forward pass
"""
pass
This is simple Base class for All our Functions. All the Functions will be subclass of Function.
What does the forward method return? Numpy Arrays! Should we wrap the Output around the Tensor class ?
Yes! because this Output will be used for further calculations and is a part of computation graph.
So the forward method will :
- Compute new values.
- Wrap those values inside
Tensor.
The first function is unique to all the Operations. But Wrapping is common to all.
So Why not use some common function ?
We will use __call__ method that will help us.
Lets question what should the __call__ method will do?
- Takes inputs(Tensors). We expect the inputs to be of type
Tensor. - Gets .data of those inputs.
- Calls forward() using data from inputs.
- Wraps the output to a
Tensor. - Returns.
File : babygrad/ops.py
class Function:
def __call__(self, *inputs): # Takes inputs
requires_grad = any(t.requires_grad for t in inputs)
inputs_data =[t.data for t in inputs] # Gets .data .
output_data = self.forward(*inputs_data) # Calls forward.
#wrap around Tensor
output_tensor = Tensor(output_data, requires_grad=requires_grad)
if requires_grad:
output_tensor._op = self #Save operation
output_tensor._inputs = inputs #Save parents
return output_tensor
Notice the requires_grad logic in __call__. If any of the input tensors require a gradient, the output tensor must also require a gradient. This ensures the "tracking" persists through the whole network.
Why did we do output_tensor._op = self and output_tensor._inputs =inputs ?
Lets see what actually happened.
We called forward method , which does something and gives us the result. To do this something we needed :
- Parents
- The operation
So we are just storing them. Each child stores the information about its parents. Think of it like a family tree. By storing its parents and the story of its creation, the child tensor has all the information it will need later to pass gradients back to those parents. This chain of history is what will make automatic differentiation possible.
from baby import Tensor
a = Tensor([1,2,3])
b = Tensor([2,3,4])
c = a+b
print(c.data) # output data
>>> [3. 5. 7.]
print(c._op) # which operation ? '+'
>>> <babygrad.ops.Add object at 0x00000248BAB0A600>
print(c._inputs) #who are the parents? [a,b]
>>> [Tensor([1. 2. 3.], requires_grad=True), Tensor([2. 3. 4.],
requires_grad=True)]
Now that we are done with Function class. We can now add some operations.
3.2 Forward Pass
Forward pass is pretty straightforward: Two inputs go through some operation and produce a single output value.
The operation here could be:
addmulsub- or any other mathematically valid operation.
File : babygrad/ops.py
import numpy as np
from .tensor import Tensor, NDArray
class Add(Function):
def forward(self, a: NDArray , b:NDArray):
return a+b
def backward(self,out_grad , node):
return out_grad,out_grad
def add(a, b):
return Add()(a, b) #`__call__`
class Mul(Op):
def forward(self, a, b):
return a * b
def backward(self, out_grad, node):
a,b = node._inputs
return out_grad*b,out_grad*a
def mul(a, b):
return Mul()(a, b)
Seriously how hard was it? We just completed the forward pass of Add().
File : baby/ops.py
Write the forward method for all these classes .
- Sub
- Div
- Pow
- Transpose
- Reshape
- BroadcastTo
- Summation
- MatMul
- Negate
- Log
- Exp
- Relu
- Sigmoid
- Tanh
- Sqrt
- abs
You should use numpy functions.
3.3 Integrating Functions inside the Tensor Class
So we have done forward methods for few functions, but still how do we use them? How do we link them to Tensor class?
We need to method overload the functions. This means that whenever we perform operations like x + y between two Tensor objects, it will use our Function class (e.g., Add) under the hood instead of Python's default addition.
The same approach applies to other operations (*, -, /, etc.).
File : babygrad/tensor.py
class Tensor:
def __add__(self, other):
"""Addition: a + b"""
from .ops import Add
if not isinstance(other, Tensor):
other = Tensor(other)
return Add()(self, other)
def __radd__(self, other):
"""Right addition: 5 + tensor"""
return self.__add__(other)
Now we can Add 2 Tensor's without any problem. The Tensor_a + Tensor_b will just work.
File: babygrad/tensor.py
Write the following methods inside Tensor class.
__mul____pow____sub____truediv____matmul__matmulsumbroadcast_toreshape__neg__transpose
Each operator should use the corresponding Function class (e.g., Add, Mul, MatMul).
3.4 Backward Pass
In the Forward Pass, data flows from inputs to outputs to get a result. In the Backward Pass, we go in reverse. We start with the final error and walk backward through our family tree to figure out: "How much is each parent responsible for this error?"
It is a measure of how much the final error changes when we change a specific input or weight
Local derivatives are the heart of the Chain Rule. The "magic" of Autograd is that it breaks a massive problem into small, manageable pieces using Local Derivatives.
Instead of trying to solve for a 100-layer neural network all at once, we solve it for one operation at a time. Each Function class only needs to know its own local derivative. When we chain them together during the backward pass, these local pieces multiply together via the Chain Ruleto give us the global gradient for every parameter in the system.
Example
a = babygrad.Tensor([1, 2, 3], dtype="float32")
b = babygrad.Tensor([2, 3, 4], dtype="float32")
c = a + b
d = c + babygrad.Tensor([3, 5, 3], dtype="float32")
$$\frac{\partial d}{\partial a} = \frac{\partial d}{\partial c} \cdot \frac{\partial c}{\partial a}$$
$$\frac{\partial d}{\partial b} = \frac{\partial d}{\partial c} \cdot \frac{\partial c}{\partial b}$$
As we said that gradient flows from right to left. The question is what is the gradient of output?
$$\frac{\partial d}{\partial d} = 1$$
It is simply just 1. This is the upstream derivative. This upstream gradient of 1 flows backward into c. Then c applies its local derivative (\(\frac{\partial c}{\partial a} = 1\), \(\frac{\partial c}{\partial b} = 1\)) and passes gradients further to a and b.
Lets look at the backward pass of Add
class Add(Function):
def forward(self, a: NDArray, b: NDArray):
return a + b
def backward(self, out_grad: Tensor, node: Tensor):
# derivative of (a + b) wrt a is 1
# derivative of (a + b) wrt b is 1
# this is local derivative
return out_grad, out_grad
class Mul(Function):
def forward(self, a: NDArray, b: NDArray):
return a * b
def backward(self, out_grad: Tensor, node: Tensor):
a, b = node._inputs
# derivative of (a*b) wrt a is b
# derivative of (a*b) wrt b is a
return out_grad * b, out_grad * a
The out_grad would be the upstream gradient coming from the output.
File: baby/ops.py
Write the backward method for all these classes:
- Mul
- Sub
- Div
- Pow
- Negate
- Log
- Exp
- Relu
- Sigmoid
- Tanh
- Sqrt
- Abs
Guidelines:
- Do not use NumPy functions directly in the backward pass.
- You can (and should) reuse the functions you implemented in the forward pass, e.g.,
add(),mul(), etc. - Example:
def divide(a, b):
return Div()(a, b)
You can use divide() freely in the backward pass of another operation.
- Make sure the gradients have the same shape as the corresponding input Tensor, especially for operations involving broadcasting or reshaping.
These above functions are pretty easy to implement. Now we will implement the most rewarding functions.
3.4.1 Reshape
Reshape function is the simplest shape-changing operation. In the forward pass, we take the data and arrange it into a new shape (e.g., turning a \(1 \times 6\) vector into a \(2 \times 3\) matrix). Under the hood it not creating a new memory but it's just changing how it views that data.
$$\begin{aligned} \text{Original matrix } A &= \begin{pmatrix} 1 & 2 & 3 & 4 \\ 5 & 6 & 7 & 8 \\ 9 & 10 & 11 & 12 \end{pmatrix}_{3 \times 4} \\[1em] \text{Reshape to } B &= \begin{pmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \\ 7 & 8 & 9 \\ 10 & 11 & 12 \end{pmatrix}_{4 \times 3} \\[1em] \text{or } C &= \begin{pmatrix} 1 & 2 \\ 3 & 4 \\ 5 & 6 \\ 7 & 8 \\ 9 & 10 \\ 11 & 12 \end{pmatrix}_{6 \times 2} \end{aligned}$$
- Shape of A is (3,4)
- In the forward pass we reshape it to (4,3).
- Now
out_gradis of shape (4,3).
Need to convert shape of out_grad to shape of A. To pass this gradient back to the input, we must ensure the shapes match. Since reshaping doesn't change the values but only their positions the backward pass is just a reshape in the opposite direction.
File : baby/ops.py
Lets write the Reshape class.
class Reshape(Function):
def __init__(self, shape):
self.shape = shape
def forward(self, a):
# use np.reshape or a.reshape
def backward(self, out_grad, node):
a = node._inputs[0]
#Your solution
def reshape(a, shape):
return Reshape(shape)(a)
We can use np.reshape and also a.reshape as a is Numpy array. Can we use reshape in backward too?
3.4.2 Transpose
Transpose flips the axes of a tensor. In a 2D matrix, rows become columns and columns become rows.
$$\begin{aligned} A &= \begin{pmatrix} a & b & c \\ d & e & f \end{pmatrix}_{2 \times 3} \\[1em] A^T &= \begin{pmatrix} a & d \\ b & e \\ c & f \end{pmatrix}_{3 \times 2} \end{aligned}$$
The Input axes can:
- Have no axes to tranpose, in that case we need to tranpose the last 2 axes.
- Tranpose whatever axes is provided.
Backward Pass
What about the backward pass? For a 2D matrix, the logic is simple: \((A^T)^T = A\). To get the gradient back to the original shape, we just transpose it again.
$$(A^T)^T = A$$
For backward pass doing the tranpose will work but only on 2D matrices. For ND matrices we might do something different and simple.
Lets say we had a matrix :
- Original Axes order: (0, 1, 2) and Shape: (3, 2, 5)
- Forward Pass: Transpose(1, 2, 0) and Shape: (2, 5, 3)
- The Goal: Move the gradient from (2, 5, 3) back to (3, 2, 5)
For any ND matrix that has been tranposed , we just need to know where the intial axes order currently is.
Current axis order: (1,2,0) and Shape: (2, 5, 3) - Where is 0 in the current axes order? At 2 - Where is 1 in the current axes order? At 0 - Where is 2 in the current axes order? At 1
The result we get = (2,0,1)
If we apply transpose to out_grad using this result Do you think we get the original shape?
The question is how do we get this result? How do we know the order? How can we know the order after forward pass is done?
We have np.argsort that gives us this order details. If we apply np.argsort((1,2,0)) we get (2,0,1).
File : babygrad/ops.py
Lets write the Transpose class.
class Transpose(Function):
def __init__(self, axes: Optional[tuple] = None):
self.axes = axes
def forward(self, a):
# use np.transpose or np.swapaxes
# Case 1: self.axes is None (swap last two axes)
# Case 2: self.axes is not None (transpose specified axes)
def backward(self, out_grad, node):
# your solution
# If axes is None just do tranpose(out_grad,axes)
# Hint: For ND arrays, look into np.argsort to find the inverse.
def transpose(a, axes=None):
return Transpose(axes)(a)
Use np.argsort to find the axes order.
3.4.3 Summation
Sum the elements along a given axis.
$$\begin{aligned} A &= \begin{pmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \end{pmatrix}_{2 \times 3} \\[1em] \text{Sum all elements: } & \sum_{i,j} A_{ij} = 21 \\[1em] \text{Sum over rows (axis=0): } & \begin{pmatrix} 1+4 & 2+5 & 3+6 \end{pmatrix} = \begin{pmatrix} 5 & 7 & 9 \end{pmatrix} \\[1em] \text{Sum over columns (axis=1): } & \begin{pmatrix} 1+2+3 \\ 4+5+6 \end{pmatrix} = \begin{pmatrix} 6 \\ 15 \end{pmatrix} \end{aligned}$$
We can safely observe that for whatever axis we are applying the summation that axis vanishes or becomes 1
Backward pass
In the backward pass, our out_grad is smaller than the original input. To get the "gradients" back to the parents, we have to stretch the gradient back to the original shape.
Think it like this:
- Reshape: Put "1" back into the axis that was vanished so the dimensions align.
- Multiply: Multiply the reshaped array with array of (ones of shape parent_shape)
Example: (3,3)
- axis=None Result is shape(1,). First reshape to (1,1). Then multiply (1,1)* (3,3)
- axis=0 Result is shape (3,). First reshape to (1,3). Then multiply (1,3) * (3,3)
- axis=1 Result is shape (3,). First reshape to (3,1) then multiply with (3,1) * (3,3)
File : babygrad/ops.py
Lets write the Summation class.
class Summation(Function):
def __init__(self, axes: Optional[tuple] = None):
self.axes = axes
def forward(self, a):
# can we use np.sum?
def backward(self, out_grad, node):
#your solution
def summation(a, axes=None):
return Summation(axes)(a)
You can implement backward using broadcast_to if you have completed it or using the above simple way.
3.4.4 BroadcastTo
In an Ideal world all tensors would have the same shape. But in our world we would like to broadcast a matrix of shape (3,1) to (3,3).
$$\begin{bmatrix} a \\ b \\ c \end{bmatrix} \rightarrow \begin{bmatrix} a & a & a \\ b & b & b \\ c & c & c \end{bmatrix}$$
We "stretch" the smaller tensor to match the larger one. During the backward pass, we do the exact opposite.
If one value was used three times to produce the output, it is responsible for the error at all three of those locations. Therefore, we sum the out_grad along the axes that were stretched.
Whenever we are broadcasting, Numpy is :
- Prepending : (3,) is broadcasted to (2,3) . It adds axes to the front.
- Stretching : If any dimension is 1 (3,1) it repeats till N to (3,10).
Handling Prepending: If out_grad has more dimensions than the input that means forward pass added a new dimension to the front. The length of out_grad dimensions must be equal to length of input dimensions. That means we need to sum the out_grad continuously on (axis=0) until length of dimensions match.
Handling Stretching:
Now that your dimensions match (e.g., both are 2D), you might still have a shape mismatch. Your input was (3, 1) but your gradient is (3, 10).
- If the original matrix dimension was 1 , but the out_grad dimension > 1 , then stretching happened in that axis.
- Simply sum over that axis.
- Be careful when we sum, sometimes it does (3,) instead of (3,1). You may want to insert 1 incase.
File : baby/ops.py
Lets write the BroadcastTo class.
class BroadcastTo(Function):
def __init__(self, shape):
self.shape = shape
def forward(self, a):
# Can we use np.broadcast_to ?
def backward(self, out_grad, node):
# your solution
def broadcast_to(a, shape):
return BroadcastTo(shape)(a)
You can implement backward using summation and reshape methods .
3.4.5 Matmul
Matmul is important.
In the forward pass, we compute \(C = A \cdot B\). How do we distribute the gradient (\(out\_grad\)) back to \(A\) and \(B\)?
- Input \(A\) has shape \((M, N)\).
- Input \(B\) has shape \((N, P)\).
- Output \(C\) (and therefore \(out\_grad\)) has shape \((M, P)\).
Finding the Gradient for \(A\)
We know the result, grad_a, must match the shape of \(A\), which is \((M, N)\).
- We have out_grad: \((M, P)\)
- What can we multiply by to get \((M, N)\)?
- We need something with shape \((P, N)\).
- Looking at our inputs, \(B\) is \((N, P)\). Its transpose \(B^T\) is \((P, N)\)!
$$\frac{\partial L}{\partial A} = out\_grad \cdot B^T$$
Finding the Gradient for \(B\)
We know grad_b must match the shape of \(B\), which is \((N, P)\).
- We have out_grad: \((M, P)\)
- What can we multiply by to get \((N, P)\)?
- We need something with shape \((N, M)\).
- Input \(A\) is \((M, N)\). Its transpose \(A^T\) is \((N, M)\)!
$$\frac{\partial L}{\partial B} = A^T \cdot out\_grad$$
File : baby/ops.py
Lets write the MatMul class.
class MatMul(Function):
def forward(self, a, b):
# can you use np.matul?
def backward(self, out_grad, node):
# your solution
def matmul(a, b):
return MatMul()(a, b)
You can implement backward using summation , reshape , broadcast_to and matmul methods .
3.4.6 Scalar Operations
Suppose we have
x = babygrad.autograd.Tensor([1,2,3])
y =x+1
Here x is a Tensor type and 1 is a Scalar type. So currently there is no way we can do the forward pass and backward pass for these types of situations.
The Scalar operations are pretty simple and straightforward too. The gradient only flows to the tensor input; the Scalar is treated as a constant.
class AddScalar(Function):
def __init__(self, scalar):
self.scalar = scalar
def forward(self, a: NDArray):
return a + self.scalar
def backward(self, out_grad: Tensor, node: Tensor):
return out_grad
def add_scalar(a, scalar):
return AddScalar(scalar)(a)
Yes! That's it.
File: baby/ops.py
Write the forward and backward method for all these classes:
- MulScalar
- DivScalar
- PowerScalar
3.5 Implementing Backward Pass in the Tensor Class
We've built a Tensor class that supports forward and backward operations. We have already created the bridge betweeen our forward operations and Tensor class. We just need to create the bridge for backward operations. Each forward function inside ops had its own method inside Tensor class do we do the same thing for backward?
You might be tempted to write a backward_add() or backward_mul() for every operation, but that is a maintenance nightmare.
Do we really need multiple backward() methods? No. A single backward() method inside the Tensor class is sufficient.
- The Tensor: Tracks the "Family Tree" (who created this tensor?).
- The Function: Stores the "Local Rule" (how do I derive this specific operation?).
When you call out.backward(), the Tensor simply walks the graph in reverse order. At each step, it asks the parent's Function: "Here is the gradient from above; what is my share of the responsibility?"
By convention, the gradient of the final output starts as 1:
# Start of backpropagation
all_nodes.grad = None
output.grad = 1
output.backward()
Lets see how we will implement backward() method inside Tensor.
File : babygrad/tensor.py
class Tensor:
def backward(self, grad=None):
grads = {}
if grad is None:
#This is the upstream gradient.
grads[id(self)] = Tensor(np.ones_like(self.data))
# The output grad must be ones.
topo_order = []
visited = set()
def build_topo(node):
#your code
build_topo(self)
grads = {}
for node in reversed(topo_order):
#your code
To implement this we need to :
- Find all Nodes (Tensors) in the computation graph that contributed to the result.
- Order them correctly so we don't calculate a parent's gradient until we've finished calculating all of its children's gradients.
Lets start simple. First we will find the Nodes in the computation graph Lets talk about Depth-first search to find all the nodes in the computation graph.
Depth-first search is an algorithm that explores a graph by going as deep as possible before backtracking.
def dfs(node, order, visited):
if node is visited:
return ## backtracking
for every parent in node.inputs:
call dfs(parent, order, visited)
add node to visited
add node to order
When backpropagating, a node in the computation graph might receive gradient contributions from multiple paths.
That means a nodes .grad might temporarily be represented as a list of tensors meaning one from each path.
Example:
a.grad = [Tensor([1, 2, 3]), Tensor([3, 4, 5]), Tensor([4, 5, 6])]
This is perfectly fine temporarily, but eventually we need a single tensor representing the total gradient for that node. We need to accumulate the gradients.
If you have read the backward code you might see grads = {}.
Why do we use grads = {} ?
In our backward loop, we are at a child node and we calculate the gradients for its parents. We only calculated the gradients of parents, we are not yet storing them inside parents.grad because we are currently running for child_node. The parents might get more gradients from other children. The parents will be processed later in the loop. We need to store them somewhere. A place to hold the gradient for every tensor until it's their turn in the loop.
So the grads dictionary is our temporary storage.
- Key: The Tensor's unique ID (id(node)).
- Value : The accumulated gradient (so far).
So grads is just our ledger for the entire backward pass.
In the loop, when we are at a child_node, we use its gradient.
- We grab the child's gradient from our grads dictionary.
- We run the _op.backward(childs_grad, child_node) to find the parents' gradients.
- We store those results back in the dictionary under each parent's id().
When the loop eventually reaches those parents, their gradients will already be waiting for them in the dictionary.
backward function.
Write the backward method inside Tensor class.
# imports
class Tensor:
#code
def backward(self, grad=None):
grads = {}
if grad is None:
#This is the upstream gradient.
#The output grad must be ones.
grads[id(self)] = Tensor(np.ones_like(self.data))
topo_order = []
visited = set()
def build_topo(node):
#your code
build_topo(self)
for node in reversed(topo_order):
#get the out_grad of the node
out_grad = grads.get(id(node))
# parent_grads = call node._op.backward(out_grad,node)
# for each parent inside node._inputs:
# store grad[parent_id] = parent_grads
Original: zekcrates/chapter2