Chapter 3: Automatic Differentiation

✓ Revised for clarity. This chapter has been rewritten with more detailed explanations.

Yay! We will do some magic here. Automatic Differentiation (or Autograd) is the heart of every deep learning library. It allows us to calculate how much every single parameter in our model contributed to the final error, without us having to write the math by hand for every new model.

How does PyTorch do this?

PyTorch's torch.autograd engine works the same way — it records operations on tensors into a computation graph and walks it backward to compute gradients. See PyTorch Autograd documentation.

The folder structure after the previous chapter looks like this:

project/
├─ .venv/          # virtual environment folder
├─ babygrad/
│   ├─ __init__.py
│   ├─ tensor.py   #Tensor code
│
├─ examples/        # example scripts using babygrad
└─ tests/           # unit tests

3.1 Operations

What are Operations?

Operations are mathematical functions (like addition, multiplication, or matrix multiplication) that transform one or more tensors into a new tensor.

Each Function class will have a

Lets define a Base Class .

File: babygrad/ops.py

from .tensor import Tensor, NDArray
import numpy as np

class Function:
    def __call__(self, *args):
        raise NotImplementedError()
    def forward(self, *args):
        """Computes the forward pass of the operation.
        Args:
            *args: One or more NumPy arrays
        """
        raise NotImplementedError()
    def backward(self, out_grad, node):
        """Calculates backward pass (gradients)
        Args:
            out_grad: upstream gradient flowing from output to input
            node: Value object holding inputs from forward pass
        """
        raise NotImplementedError()

NDArray is the type alias np.ndarray we defined in tensor.py in Chapter 2. It's used in type hints to show that forward methods work on raw numpy arrays, not Tensors.

This is simple Base class for All our Functions. All the Functions will be subclass of Function.

What does the forward method return? Numpy Arrays! Should we wrap the output in a Tensor?

Yes! because this Output will be used for further calculations and is a part of computation graph.

So the forward method will :

Each mathematical operation (addition, multiplication, etc.) will need its own forward method, but the wrapping logic is the same for all of them.

Since the wrapping logic is identical for every operation, we can put it in one shared method instead of duplicating it in every forward(). We'll use __call__ for this.

__call__ is a Python special method that makes an object callable like a function. When you write Add()(a, b), Python runs Add().__call__(a, b).

Here's what __call__ will do:

  1. Accept Tensor inputs
  2. Extract .data (the numpy arrays) from each input
  3. Call forward() with those numpy arrays
  4. Wrap the result in a new Tensor
  5. Record _op and _inputs for the computation graph
  6. Return the new Tensor

File : babygrad/ops.py

class Function:
    def __call__(self, *args):                          # Takes inputs
        requires_grad = any(t.requires_grad for t in args)
        inputs_data = [t.data for t in args]            # Gets .data
        output_data = self.forward(*inputs_data)        # Calls forward
        # Wrap in Tensor
        output_tensor = Tensor(output_data, requires_grad=requires_grad)
        if requires_grad:
            output_tensor._op = self                    # Save operation
            output_tensor._inputs = args                # Save parents
        return output_tensor

Notice the requires_grad check: if any input requires a gradient, the output must too. This ensures gradient tracking propagates through the entire computation.

Why store _op and _inputs?

The forward method computes the result, but that's only half the story. During the backward pass, we'll need to know:

Each tensor stores its "parents" and how it was created—like a family tree. This chain of history is what makes automatic differentiation possible: when we compute gradients, we walk backward through the graph, and each tensor knows exactly where to pass its gradients.

Now let's define some concrete operations that subclass Function.

3.2 Forward Pass

The forward pass is straightforward: two inputs go through some operation and produce a single output value.

The operation here could be:

File : babygrad/ops.py

import numpy as np
from .tensor import Tensor, NDArray

class Add(Function):
    def forward(self, a: NDArray, b: NDArray):
        return a + b
    def backward(self,out_grad , node):
        return out_grad,out_grad
def add(a, b):
    return Add()(a, b)  #`__call__`

class Mul(Function):
    def forward(self, a: NDArray, b: NDArray):
        return a * b
    def backward(self, out_grad, node):
        a,b = node._inputs
        return out_grad*b,out_grad*a
def mul(a, b):
    return Mul()(a, b)

Don't worry about backward yet — we're only focusing on forward for now. You'll see backward defined alongside it because the two always live together, but we'll explain what out_grad is and where it comes from in section 3.5.

Not too hard, right? We just completed the forward pass of Add().

Why so many operations?

Arithmetic ops like Sub, Div, and Pow are the basic math behind loss functions and weighted sums.

Activation functions like ReLU, Sigmoid, and Tanh go between layers to introduce nonlinearity. Without them, stacking 100 layers would be no better than one.

MatMul is the core of every neural network layer. Every Linear layer you'll build in Chapter 4 is just a matrix multiply plus a bias.

Shape operations like Reshape, Transpose, and BroadcastTo are plumbing. Real data comes in awkward shapes, and these let you reshape, flip, and stretch tensors so the math works out.

Utility functions like Log, Exp, and Sqrt appear in specific places: Log in cross-entropy loss, Exp in softmax, Sqrt in normalization.

You won't use them all at once, but by the end of the tutorial you'll have used every single one.

Exercise 3.2: Write the forward method for these operations.

Each one follows the same pattern as Add and Mul above: take numpy arrays in, return a numpy array out. Use the corresponding numpy function.

Arithmetic (two inputs, like Add and Mul):

  • Sub — subtraction: a - b
  • Div — division: a / b
  • Pow — exponentiation: np.power(a, b)

Unary math (one input):

  • Negate — flip the sign: -a
  • Log — natural logarithm: np.log(a)
  • Exp — exponential: np.exp(a)
  • Sqrt — square root: np.sqrt(a)
  • Abs — absolute value: np.abs(a)

Activation functions (one input — these are what make neural networks nonlinear; without them, stacking layers would be no better than a single layer):

  • ReLU — "Rectified Linear Unit": returns x if positive, 0 otherwise. Formula: max(0, x). Hint: a * (a > 0)
  • Sigmoid — squashes any value into the range (0, 1). Formula: 1 / (1 + e^(-x))
  • Tanh — squashes any value into the range (-1, 1). Use np.tanh(a)

Don't worry about Reshape, Transpose, BroadcastTo, Summation, or MatMul yet — those are more involved and are covered step-by-step in sections 3.4.1–3.4.5 below.

3.3 Integrating Functions inside the Tensor Class

We've written forward methods for a few functions, but how do we actually use them? How do we link them to the Tensor class?

We need to method overload the functions. This means that whenever we perform operations like x + y between two Tensor objects, it will use our Function class (e.g., Add) under the hood instead of Python's default addition.

The same approach applies to other operations (*, -, /, etc.).

File : babygrad/tensor.py

class Tensor:
    def __add__(self, other):
        """Addition: a + b"""
        from .ops import Add
        if not isinstance(other, Tensor):
            other = Tensor(other)
        return Add()(self, other)

    def __radd__(self, other):
        """Right addition: 5 + tensor"""
        return self.__add__(other)
    

Now we can add two Tensors without any problem. tensor_a + tensor_b will just work.

from babygrad import Tensor
a = Tensor([1,2,3])
b = Tensor([2,3,4])
c = a+b

print(c.data)       # output data
>>> [3. 5. 7.]
print(c._op)        # which operation ? '+'
>>> <babygrad.ops.Add object at 0x00000248BAB0A600>
print(c._inputs)    #who are the parents? [a,b]
>>> [Tensor([1. 2. 3.], requires_grad=True), Tensor([2. 3. 4.],
     requires_grad=True)]

File: babygrad/tensor.py

Exercise 3.3: Implement operator methods.

Right now, using an operation means writing Add()(a, b) every time. Python's dunder methods let you wire these ops directly to operators like +, *, and -, so you can write natural expressions like a + b instead.

Using the __add__ example above as a template, write the following methods inside the Tensor class:

  • __mul__ — uses Mul
  • __pow__ — uses Pow
  • __sub__ — uses Sub
  • __truediv__ — uses Div
  • __neg__ — uses Negate

Each follows the same pattern: wrap the other operand in a Tensor if needed, then call the op.

We'll add __matmul__, reshape, transpose, sum, and broadcast_to later, after we implement those ops in sections 3.4.1–3.4.5.

3.4 Backward Pass

In the Forward Pass, data flows from inputs to outputs to get a result. In the Backward Pass, we go in reverse. We start with the final error and walk backward through our family tree to figure out: "How much is each parent responsible for this error?"

What is a gradient?

A measure of how much the final error changes when we change a specific input or weight.

I'm rusty on derivatives — what does "how much it changes" mean?

If you're rusty on derivatives, 3Blue1Brown's The paradox of the derivative is an excellent 17-minute refresher. It builds the intuition for what "rate of change" means without assuming you remember the formulas.

Let's walk through a concrete example.

a = babygrad.Tensor([2, 3], dtype="float32")
b = babygrad.Tensor([4, 5], dtype="float32")
w = babygrad.Tensor([3, 2], dtype="float32")

c = a * b    # step 1: multiply → c = [8, 15]
d = c * w    # step 2: multiply again → d = [24, 30]

We want to know: if we nudge a slightly, how much does d change? That's the gradient of d with respect to a.

Since these are tensors, "nudge" means one element at a time. The gradient is a tensor of the same shape as a, where each element answers "if I change this element, how much does the output change?"

Let's check by hand. Nudge a[0] from 2 to 2.01:

We nudged by 0.01 and the output moved by 0.12. The gradient for a[0] is 0.12 / 0.01 = 12. That's b[0] × w[0] = 4 × 3! The change in a got amplified by b in the first multiply, then amplified again by w in the second multiply.

Similarly, a[1]'s gradient is b[1] × w[1] = 5 × 2 = 10. So the full gradient is [12, 10].

What does "nudge a tensor" actually mean? What if they're different shapes?

Each element of the gradient tracks one element of a. We nudge one element, hold everything else constant, and measure how the output changes.

What if the tensors have different shapes? For example, summing [1, 2, 3] to a single number 6 gives a gradient of [1, 1, 1] — each element contributed equally. The rule: the gradient always has the same shape as the thing you're differentiating with respect to. That's why we need shape-changing ops like Summation, BroadcastTo, and Reshape in sections 3.4.1–3.4.5.

How do we compute [12, 10] systematically (without nudging by hand)? The chain rule says: break the chain of operations into small steps, compute each step's derivative, and multiply them together.

a doesn't affect d directly — it goes through c first. So we ask two smaller questions:

  1. If we nudge c, how much does d change? Since d = c * w, the answer is w = [3, 2] — each element of c gets scaled by the corresponding element of w.
  2. If we nudge a, how much does c change? Since c = a * b, the answer is b = [4, 5] — each element of a gets scaled by the corresponding element of b.

Multiply them together: [3, 2] × [4, 5] = [12, 10]. Same answer we got by hand!

$$\frac{\partial d}{\partial a} = \frac{\partial d}{\partial c} \cdot \frac{\partial c}{\partial a} = w \cdot b = [3,2] \cdot [4,5] = [12, 10]$$

What about b? Same idea, separate formula. When we ask "how does changing a affect d?", we hold b constant — it's irrelevant to that question. b gets its own chain:

$$\frac{\partial d}{\partial b} = \frac{\partial d}{\partial c} \cdot \frac{\partial c}{\partial b} = w \cdot a = [3,2] \cdot [2,3] = [6, 6]$$

Why does this work?

Each small derivative (like "how does c change when I change a?") is called a local derivative — it only involves one operation. The "magic" of autograd is that each Function class only needs to know its own local derivative. When we multiply them together via the chain rule, we get the global gradient — how the final output depends on any input, no matter how many operations are in between.

This is how a 100-layer neural network becomes tractable: we never solve the whole thing at once, just one operation at a time.

3Blue1Brown has two excellent videos on this: Visualizing the chain rule builds the intuition with animations, and Backpropagation calculus shows how it applies to neural networks.

Walking backward step by step

I'm lost — is there a video that explains this?

3Blue1Brown's What is backpropagation really doing? walks through the same idea — starting at the output and working backward — with animations. It's 13 minutes and covers exactly what we're about to do in code.

In practice, we compute gradients by starting at the output and working backward one node at a time. Here's the computation graph for our example:

Step 1: c = a × b

a = [2, 3] b = [4, 5] × c = [8, 15] a.grad = b = [4, 5] b.grad = a = [2, 3] (for a single × op)

Step 2: d = c × w — now chain two operations together:

c = [8, 15] w = [3, 2] × d = [24, 30] c.grad = w = [3, 2] then chain rule: a.grad = [3,2] × [4,5] = [12, 10]

That incoming gradient has a name: the upstream gradient. In our code, it's the out_grad parameter. Every backward method does the same thing: take out_grad, multiply by the local derivative, pass the result to each input.

Let's look at how this translates to code for Add and Mul:

class Add(Function):
    def forward(self, a: NDArray, b: NDArray):
        return a + b

    def backward(self, out_grad, node):
        # For a + b:
        #   if a increases by 1, output increases by 1 → local derivative is 1
        #   if b increases by 1, output increases by 1 → local derivative is 1
        # Multiply out_grad by each local derivative:
        return out_grad * 1, out_grad * 1  # simplifies to: out_grad, out_grad

class Mul(Function):
    def forward(self, a: NDArray, b: NDArray):
        return a * b

    def backward(self, out_grad, node):
        a, b = node._inputs
        # For a * b:
        #   if a increases by 1, output increases by b → local derivative is b
        #   if b increases by 1, output increases by a → local derivative is a
        # Multiply out_grad by each local derivative:
        return out_grad * b, out_grad * a

Let's plug in real numbers to see why Mul works that way. Say a = 4 and b = 3, so a * b = 12.

Then multiply each by out_grad (the gradient arriving from upstream) and return both. That's all backward does.

For Add it's even simpler: nudge either input by 0.01 and the output moves by exactly 0.01. Local derivative is always 1, so you just pass out_grad straight through.

The pattern is always the same: out_grad × local_derivative. Every backward method you write will follow this pattern.

File: babygrad/ops.py

Exercise 3.4: Implement Backward Pass

The forward pass computes results, but without gradients the network cannot learn. The backward pass is what lets the optimizer adjust every weight in the right direction during training.

Write the backward method for all these classes:

  • Mul
  • Sub
  • Div
  • Pow
  • Negate
  • Log
  • Exp
  • Relu
  • Sigmoid
  • Tanh
  • Sqrt
  • Abs
PyTorch's autograd.Function

In PyTorch, custom operations extend torch.autograd.Function with static forward and backward methods — the same pattern we're building here. See torch.autograd.Function.

Guidelines:

  • Do not use NumPy functions directly in the backward pass.
  • You can (and should) reuse the functions you implemented in the forward pass, e.g., add(), mul(), etc.
  • Example:
def divide(a, b):
    return Div()(a, b)

You can use divide() freely in the backward pass of another operation.

  • Make sure the gradients have the same shape as the corresponding input Tensor, especially for operations involving broadcasting or reshaping.

The above functions are easy to implement. Now let's tackle the more interesting ones.

3.4.1 Reshape

Reshape is the simplest shape-changing operation. In the forward pass, we take the data and arrange it into a new shape (e.g., turning a \(1 \times 6\) vector into a \(2 \times 3\) matrix). Under the hood, it's not creating new memory—it's just changing how it views that data.

$$\begin{aligned} \text{Original matrix } A &= \begin{pmatrix} 1 & 2 & 3 & 4 \\ 5 & 6 & 7 & 8 \\ 9 & 10 & 11 & 12 \end{pmatrix}_{3 \times 4} \\[1em] \text{Reshape to } B &= \begin{pmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \\ 7 & 8 & 9 \\ 10 & 11 & 12 \end{pmatrix}_{4 \times 3} \\[1em] \text{or } C &= \begin{pmatrix} 1 & 2 \\ 3 & 4 \\ 5 & 6 \\ 7 & 8 \\ 9 & 10 \\ 11 & 12 \end{pmatrix}_{6 \times 2} \end{aligned}$$

We need to convert the shape of out_grad back to the shape of A. To pass this gradient back to the input, we must ensure the shapes match. Since reshaping doesn't change the values—only their positions—the backward pass is just a reshape in the opposite direction.

Exercise 3.5

Neural network layers often output data in a different shape than the next layer expects -- for example, a convolutional layer produces a 4D tensor that must be flattened to 2D before a fully connected layer can use it. Reshape handles these shape conversions.

Let's write the Reshape class.


class Reshape(Function):
    def __init__(self, shape):
        self.shape = shape

    def forward(self, a):
        # use np.reshape or a.reshape

    def backward(self, out_grad, node):
        a = node._inputs[0]
        #Your solution

def reshape(a, shape):
    return Reshape(shape)(a)

We can use np.reshape and also a.reshape as a is Numpy array. Can we use reshape in backward too?

Now go back to tensor.py and add a reshape method to the Tensor class, following the same pattern as __add__.


3.4.2 Transpose

Transpose flips the axes of a tensor. In a 2D matrix, rows become columns and columns become rows.

$$\begin{aligned} A &= \begin{pmatrix} a & b & c \\ d & e & f \end{pmatrix}_{2 \times 3} \\[1em] A^T &= \begin{pmatrix} a & d \\ b & e \\ c & f \end{pmatrix}_{3 \times 2} \end{aligned}$$

The input axes can:

Backward Pass

What about the backward pass? For a 2D matrix, the logic is simple: \((A^T)^T = A\). To get the gradient back to the original shape, we just transpose it again.

$$(A^T)^T = A$$

For the backward pass, a simple transpose will work for 2D matrices. For ND matrices, we need a slightly different approach.

Let's say we had a matrix:

For any ND matrix that has been transposed, we just need to know where the original axes ended up.

Current axis order: (1, 2, 0) with shape (2, 5, 3). Where did each original axis go?

The result: (2, 0, 1). If we apply transpose to out_grad using this result, we get the original shape back.

How do we compute this? np.argsort gives us exactly what we need. If we apply np.argsort((1, 2, 0)), we get (2, 0, 1).

Exercise 3.6

Transpose is needed for the backward pass of matrix multiplication -- computing gradients for MatMul requires transposing matrices to align dimensions correctly.

Let's write the Transpose class.

class Transpose(Function):
    def __init__(self, axes: Optional[tuple] = None):
        self.axes = axes

    def forward(self, a):
        # use np.transpose or np.swapaxes
        # Case 1: self.axes is None (swap last two axes)
        # Case 2: self.axes is not None (transpose specified axes)
    def backward(self, out_grad, node):
        # your solution
        # If axes is None just do tranpose(out_grad,axes)
        # Hint: For ND arrays, look into np.argsort to find the inverse.


def transpose(a, axes=None):
    return Transpose(axes)(a)

Use np.argsort to find the axes order.

Now add a transpose method to Tensor.

3.4.3 Summation

Sum the elements along a given axis.

$$\begin{aligned} A &= \begin{pmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \end{pmatrix}_{2 \times 3} \\[1em] \text{Sum all elements: } & \sum_{i,j} A_{ij} = 21 \\[1em] \text{Sum over rows (axis=0): } & \begin{pmatrix} 1+4 & 2+5 & 3+6 \end{pmatrix} = \begin{pmatrix} 5 & 7 & 9 \end{pmatrix} \\[1em] \text{Sum over columns (axis=1): } & \begin{pmatrix} 1+2+3 \\ 4+5+6 \end{pmatrix} = \begin{pmatrix} 6 \\ 15 \end{pmatrix} \end{aligned}$$

Notice that whatever axis we sum over, that axis vanishes (or becomes 1).

Backward pass

In the backward pass, out_grad is smaller than the original input. To pass gradients back to the parents, we have to stretch the gradient back to the original shape.

Think of it like this:

Example with shape (3, 3):

Exercise 3.7

Summation collapses a tensor along an axis -- you'll need it for computing loss (reducing a batch of errors to a single number) and for the backward pass of BroadcastTo.

Let's write the Summation class.

class Summation(Function):
    def __init__(self, axes: Optional[tuple] = None):
        self.axes = axes
    def forward(self, a):
        # can we use np.sum?
    def backward(self, out_grad, node):
        #your solution


def summation(a, axes=None):
    return Summation(axes)(a)

You can implement backward using the reshape-and-multiply approach described above. (Later, after completing Exercise 3.8, you could also use broadcast_to.)

Now add a sum method to Tensor.

3.4.4 BroadcastTo

In an ideal world, all tensors would have the same shape. But in practice, we often need to broadcast a matrix of shape (3, 1) to (3, 3).

$$\begin{bmatrix} a \\ b \\ c \end{bmatrix} \rightarrow \begin{bmatrix} a & a & a \\ b & b & b \\ c & c & c \end{bmatrix}$$

We "stretch" the smaller tensor to match the larger one. During the backward pass, we do the exact opposite.

If one value was used three times to produce the output, it is responsible for the error at all three of those locations. Therefore, we sum the out_grad along the axes that were stretched.

When broadcasting, NumPy does two things:

Handling Prepending: If out_grad has more dimensions than the input, the forward pass added dimensions to the front. Sum out_grad along axis 0 until the number of dimensions matches.

Handling Stretching: Once dimensions match (e.g., both are 2D), you might still have a shape mismatch. If the input was (3, 1) but the gradient is (3, 10):

Exercise 3.8

Broadcasting lets you add a bias vector to every row in a batch, or multiply a whole matrix by a scalar -- operations that happen constantly in neural networks.

Let's write the BroadcastTo class.


class BroadcastTo(Function):
    def __init__(self, shape):
        self.shape = shape

    def forward(self, a):
        # Can we use np.broadcast_to ?
    def backward(self, out_grad, node):
        # your solution

def broadcast_to(a, shape):
    return BroadcastTo(shape)(a)

You can implement backward using summation and reshape methods .

Now add a broadcast_to method to Tensor.

3.4.5 Matmul

Matrix multiplication is one of the most important operations in deep learning.

In the forward pass, we compute \(C = A \cdot B\). How do we distribute the gradient (\(out\_grad\)) back to \(A\) and \(B\)?

Finding the Gradient for \(A\)

We know the result, grad_a, must match the shape of \(A\), which is \((M, N)\).

$$\frac{\partial L}{\partial A} = out\_grad \cdot B^T$$

Finding the Gradient for \(B\)

We know grad_b must match the shape of \(B\), which is \((N, P)\).

$$\frac{\partial L}{\partial B} = A^T \cdot out\_grad$$

Exercise 3.9

Matrix multiplication is how neural network layers transform their inputs -- every Linear layer you'll build in Chapter 4 is just a MatMul plus a bias.

Let's write the MatMul class.


class MatMul(Function):
    def forward(self, a, b):
        # can you use np.matmul?
    def backward(self, out_grad, node):
        # your solution


def matmul(a, b):
    return MatMul()(a, b)

You can implement backward using summation, reshape, and broadcast_to methods.

Now add __matmul__ (the @ operator) and a matmul method to Tensor.

3.4.6 Scalar Operations

Suppose we have

x = Tensor([1,2,3])
y =x+1 

Here x is a Tensor and 1 is a scalar. We need a way to handle these mixed-type operations.

Scalar operations are straightforward. The gradient only flows to the tensor input; the scalar is treated as a constant.


class AddScalar(Function):
    def __init__(self, scalar):
        self.scalar = scalar

    def forward(self, a: NDArray):
        return a + self.scalar

    def backward(self, out_grad: Tensor, node: Tensor):
        return out_grad


def add_scalar(a, scalar):
    return AddScalar(scalar)(a)

That's it!

Exercise 3.10: Implement Forward and Backward Pass

Scalar operations handle cases like x + 1 or x * 0.5 -- when one operand is a plain number, not a Tensor. You'll use these for scaling learning rates and normalizing values.

Write the forward and backward method for all these classes:

  • MulScalar
  • DivScalar
  • PowerScalar

3.5 Implementing Backward Pass in the Tensor Class

We've built a Tensor class that supports forward and backward operations. We've already created the bridge between our forward operations and the Tensor class. Now we need to create the bridge for backward operations. Each forward function in ops had its own method in the Tensor class—do we do the same for backward?

You might be tempted to write a backward_add() or backward_mul() for every operation, but that would be a maintenance nightmare.

Do we really need multiple backward() methods? No. A single backward() method in the Tensor class is sufficient.

When you call out.backward(), the Tensor simply walks the graph in reverse order. At each step, it asks the parent's Function: "Here is the gradient from above; what is my share of the responsibility?"

By convention, the gradient of the final output starts as 1:

# Start of backpropagation
all_nodes.grad = None
output.grad = 1
output.backward()

Let's see how to implement the backward() method in Tensor.

File : babygrad/tensor.py


class Tensor:
    def backward(self, grad=None):
        grads = {}
        if grad is None:
            #This is the upstream gradient.
            grads[id(self)] = Tensor(np.ones_like(self.data))
            # The output grad must be ones.


        topo_order = []
        visited = set()
        def build_topo(node):
            #your code

        build_topo(self)

        for node in reversed(topo_order):
            #your code 

To implement this, we need to:

Let's start simple. First, we'll find all the nodes in the computation graph using depth-first search.

Depth-first search is an algorithm that explores a graph by going as deep as possible before backtracking.

def dfs(node, order, visited):
    if node in visited:
        return  # already processed

    for parent in node._inputs:
        dfs(parent, order, visited)

    visited.add(node)
    order.append(node)

When backpropagating, a node in the computation graph might receive gradient contributions from multiple paths. That means a node's .grad might temporarily be a list of tensors—one from each path.

Example:

a.grad = [Tensor([1, 2, 3]), Tensor([3, 4, 5]), Tensor([4, 5, 6])]

This is fine temporarily, but eventually we need a single tensor representing the total gradient. We need to accumulate the gradients.

You might have noticed grads = {} in the backward code. Why do we use a dictionary?

In our backward loop, when we're at a child node, we calculate gradients for its parents. But we don't store them in parent.grad immediately—those parents might receive more gradients from other children. The parents will be processed later in the loop, so we need somewhere to hold their gradients until then.

The grads dictionary is our temporary storage:

The grads dictionary is our ledger for the entire backward pass.

In the loop, when we're at a child node:

When the loop eventually reaches those parents, their gradients will already be waiting in the dictionary.


Exercise 3.11: Implement the backward method

This is the culmination of the chapter -- you're building the engine that walks the computation graph and calls each operation's backward method in the right order. After this, your tensors can actually learn.

Write the backward method in the Tensor class.

# imports
class Tensor:
    #code
    def backward(self, grad=None):
        grads = {}
        if grad is None:
            #This is the upstream gradient.
            #The output grad must be ones.
            grads[id(self)] = Tensor(np.ones_like(self.data))


        topo_order = []
        visited = set()
        def build_topo(node):
            #your code

        build_topo(self)

        for node in reversed(topo_order):
            #get the out_grad of the node
            out_grad = grads.get(id(node))
            # parent_grads = call node._op.backward(out_grad,node)

            # for each parent inside node._inputs:
            # store grad[parent_id] = parent_grads
PyTorch's backward pass

PyTorch's tensor.backward() does essentially the same thing: it topologically sorts the graph and walks it in reverse, calling each node's backward function. See Autograd Mechanics for a deeper look at how PyTorch handles this under the hood.

Original: zekcrates/chapter2