Chapter 4: nn

If we want to build a Neural Network we don't want to define each weight manually. In this chapter We will start creating the nn.Module class for various layers.

Note

This is how our folder structure currently looks like. In this chapter we will work inside babygrad/nn.py.

project/
|- .venv/                   # virtual environment
|- babygrad/                # source code
|   |- __init__.py
|   |- ops.py
|   |- tensor.py
|- examples/                # Examples
|   |- simple_mnist.py
|- tests/                   #tests

4.1 Parameter

What is a Parameter?

A Parameter is a Tensor containing a model's learnable weights.

In our Tensor class, we are now adopting the standard convention: requires_grad will default to False. This is a sensible choice because most tensors are intermediate results, and not tracking their gradients saves memory.

This means we need an explicit way to mark a tensor as learnable. We do this by wrapping it in the Parameter class, which automatically sets requires_grad=True

Why a seperate class? just for something simple?

It makes the code self-documenting. When you see self.weight = Parameter(...) inside a layer, you know immediately that it is a learnable weight, not just a temporary tensor.
Easier to find other learnable parameters recursively.

FILE: babygrad/nn.py

from babygrad import Tensor
class Parameter(Tensor):
     """
    A special Tensor that tells a Module it is a learnable parameter.
    Example:
        >>>
        >>> self.some_data = Tensor([1, 2, 3])
        >>>
        >>> # A parameter - will be found and trained by the optimizer!
        >>> self.weights = Parameter(Tensor.randn(10, 5))
    """
    def __init__(self, data, *args, **kwargs):
        # Parameters always require gradients.
        kwargs['requires_grad'] = True
        super().__init__(data, *args, **kwargs)

a = Tensor([1,2,3])
print(a.requires_grad)
>>>False

b = Parameter(a)
print(b.requires_grad)
>>>True

As you can see Parameter class only does this one thing. It makes a parameter a learnable parameter.

Lets also write a simple method to get all the parameters . Now that we have a way to mark parameters, we need a way to find them. A model can store parameters in several structures

Directly as attributes (self.weight)
Inside a list (self.layers): [Layer1, Layer2]
Inside a dictionary.

We need a method to find the Parameters of a model.

class Parameter(Tensor):
    #code
def _get_parameters(data):
    params = []
    if isinstance(data, Parameter):
        return [data]
    if isinstance(data, dict):
        for value in data.values(): #calling _get_parameters recursively
            params.extend(_get_parameters(value))
    if isinstance(data, (list, tuple)):
        for item in data:
            params.extend(_get_parameters(item))
    return params

4.2 Module

After the first 2 chapters you should have done the examples/simple_mnist. We used a SimpleNN class for our Model.

class SimpleNN:
    """A simple two-layer neural network."""
    def __init__(self, input_size, hidden_size, num_classes):
        self.W1 = Parameter(np.random.randn(input_size, hidden_size)
                    .astype(np.float32) / np.sqrt(hidden_size))
        self.W2 = Parameter(np.random.randn(hidden_size, num_classes)
                    .astype(np.float32) / np.sqrt(num_classes))
    def forward(self, x: Tensor) -> Tensor:
        """Performs the forward pass of the network."""
        z1 = x @ self.W1 # (8,784) @ (784, 100) -> (8,100)
        a1 = ops.relu(z1)
        logits = a1 @ self.W2 #  (8,100) @ (100,10) -> (8,10)
        return logits
    def parameters(self):
        """Returns a list of all model parameters."""
        return [self.W1, self.W2]

It worked, but we had to manually define the parameters() method to return a list of weights. If you had a 50-layer network, that list would be impossible to maintain.

Every layer (Linear, BatchNorm, Conv) shares the same core needs:

Manage Parameters: It needs to find every weight inside itself.
Forward Pass: It needs a way to process input.
Training State: It needs to know if it is currently training or evaluating.

We define the Module base class to handle all of this automatically.

FILE : babygrad/nn.py

from typing import List
from babygrad import Tensor
class Module:
    """
    Base class for all neural network layers.
    Example:
        class Linear(Module):
            def __init__(self, in_features, out_features):
                super().__init__()
                self.weight = Parameter(np.random.randn(in_features,
                 out_features))

            def forward(self, x):
                #pass
    """
    def __init__(self):
        self.training =True

    def parameters(self) -> List[Parameter]:
        """
        Returns a list of all parameters in the module and its submodules.
        """
        # self.__dict__ is a dictionary containing all the
         instance's attributes.
        # We pass it to our helper to recursively find all Parameters.
        params = _get_parameters(self.__dict__)
        unique_params = []
        seen_ids = set()
        for p in params:
            if id(p) not in seen_ids:
                unique_params.append(p)
                seen_ids.add(id(p))
        return unique_params
    def forward(self, *args, **kwargs):
        """The forward pass logic that must be defined by subclasses."""
        raise NotImplementedError
    def __call__(self, *args, **kwargs):
        """
        Makes the module callable like a function.
        Example:
            >>> model = Linear(10, 2)
            >>> input_tensor = Tensor.randn(64, 10)
            >>> output = model(input_tensor)  # This calls model.forward(...)
        """
        return self.forward(*args, **kwargs)

Training vs evaluation

We have created the Base Module class and we can start creating Layers using this class.

After training our model, we want to use it for making predictions. This brings up an important distinction between a model's behavior during training and during evaluation (or "inference").

What happens during training? During Training: We feed data, get predictions, calculate loss, and update weights. During Evaluation (Inference): We feed data and get predictions. We do not update weights.

But is that the only difference? If we just turn off gradient calculations, is that enough?

Not quite. Certain layers behave fundamentally differently depending on whether you are training or evaluating.

Why an explicit eval() mode is critical

Two of the most common layers that require a separate eval mode are:

Dropout: During training, this layer randomly sets some of its inputs to zero. This is a regularization technique to prevent overfitting. During evaluation, you want your model to use its full learned capacity, so Dropout must be turned off.

Batch Normalization: During training, this layer calculates the mean and variance of the current batch of data to normalize it. It also keeps a running average of these statistics. During evaluation, it stops calculating from the current batch and instead uses its saved running averages to normalize the data.

To handle this, our Module class will need to keep track of its current state. We will add a self.training attribute and two methods, train() and eval(), to switch between these states.

First lets use a method to find all the Modules present.

def _get_modules(obj) -> list['Module']:
    """
    A simple recursive function that finds all Module objects within any given
    object by searching through its attributes, lists, tuples, and dicts.
    """
    modules = []
    if isinstance(obj, Module):
        return [obj]

    if isinstance(obj, dict):
        for value in obj.values():
            modules.extend(_get_modules(value))

    if isinstance(obj, (list, tuple)):
        for item in obj:
            modules.extend(_get_modules(item))

    return modules

class Module:
    def __init__(self):
        self.training = True
    #code
    def train(self):
        self.training = True
        for m in _get_modules(self.__dict__):
            m.training = True
    def eval(self):
        self.training = False
        for m in _get_modules(self.__dict__):
            m.training = False

We have also defined a _get_modules which is same as _get_parameters. We can now just call model.eval() or model.train() to chain states.

Lets start building layers using Module.

4.2.1 ReLU

We will start with simple layers that don't contain any learnable parameters. These layers just apply a mathematical function to the input.

ReLU
Sigmoid
Tanh

FILE : babygrad/nn.py

Exercise 4.1

Lets write the ReLu class.

class ReLU(Module):
    def forward(self, x: Tensor):
        """
    Applies the Rectified Linear Unit (ReLU) function element-wise.
    Example:
        model = nn.Sequential(
            nn.Linear(128, 64),
            nn.ReLU()  # Apply activation after the linear layer
        )
    """
    #your solution


class Tanh(Module):
    def forward(self, x: Tensor):
        #your code
class Sigmoid(Module):
    def forward(self,x: Tensor):
        #your code

Note

Use methods from ops.py

4.2.2 Flatten

What does Flatten do?

Suppose there is a matrix of shape (2,3,4,5). Flatten will just reshape the matrix into (2,60). Flatten is extremely useful in CNN.

FILE : baby/nn.py

Exercise 4.2

Lets write the Flatten class.

class Flatten(Module):
    """
    Flattens a tensor by reshaping it to `(batch_size, -1)`.
    Example:
        # A CNN might produce a feature map of shape (32, 64, 7, 7)
        # (batch_size, channels, height, width)
        model = nn.Sequential(
            nn.Conv2d(...),
            nn.ReLU(),
            nn.Flatten(), # Reshapes output to (32, 64 * 7 * 7) = (32, 3136)
            nn.Linear(3136, 10)
        )
    """
    def forward(self, x: Tensor) -> Tensor:
        #your code

Note

Can we use reshape method from ops?

Till now in the exercises we didn't have a need to use Parameter to initialize anything.

4.2.3 Linear

So far, the layers we've built: ReLU, Flatten, Sigmoid have all been stateless. They perform a fixed mathematical operation but don't have any memory or learnable parts.

The most simplest layer is the Linear layer. It multiplies the input with weight and adds bias. The Linear layer (also called a "Dense" or "Fully Connected" layer) maintains state in the form of weight and bias. These are the learnable parameters that the model "adjusts" during training to get the right answer.

\[ output = x @ Weight + bias \]

FILE : baby/nn.py

Wrap self.weight and self.bias with Parameter.

Exercise 4.3

Lets write the Linear class.

class Linear(Module):
    """
    Applies a linear transformation to the incoming data: y = xA^T + b.

    Args:
        in_features (int): Size of each input sample.
        out_features (int): Size of each output sample.
        bias (bool, optional): If set to False,
         the layer will not learn  bias.
    Shape:
        - Input: `(batch_size, *, in_features)` where `*`
          means any number of additional dimensions.
        - Output: `(batch_size, *, out_features)` where
          all but the last dimension are the same shape as the input.

    Attributes:
        weight (Parameter): The learnable weights of the module of shape
                            `(in_features, out_features)`.
        bias (Parameter):   The learnable bias of the module
                             of shape `(out_features,)`.
    """
    def __init__(self, in_features: int, out_features: int, bias: bool
                , device: Any | None = None, dtype: str = "float32"):
        super().__init__()
        self.in_features = in_features
        self.out_features = out_features

        #your code

    def forward(self, x: Tensor) -> Tensor:
        # Note: x.shape is (batch_size, in_features)
        # self.weight.shape is (in_features, out_features)
        # The result should have shape (batch_size, out_features)

Note

Make sure self.bias is broadcasted before adding.

4.2.4 Sequential

If you have a model with many different layers like :

class MygoodNetwork(Module):
    def __init__(self):
        super().__init__()
        self.layer1 = Linear(784, 128)
        self.activation1 = ReLU()
        self.layer2 = Linear(128, 10)

    def forward(self, x):
        x = self.layer1(x)
        x = self.activation1(x)
        x = self.layer2(x)
        return x

This works, but the forward method is just a repetitive chain of calls. If we had 10 layers, this would become very verbose. Can we simplify this? Can we just "stack" the layers and have them run in order automatically? The Sequential class solves this by acting as a container. It takes a list of modules and handles the "passing" for you: the output of layer 1 automatically becomes the input of layer 2

We will write the Sequential class. That takes modules as input. It takes a sequence of modules and chains them together for you.

Exercise 4.4

Lets write the Sequential class.

class Sequential(Module):
    A container that chains a sequence of modules together.
    Example:
        # A simple 2-layer MLP for MNIST
        model = nn.Sequential(
            nn.Linear(784, 128),
            nn.ReLU(),
            nn.Linear(128, 10)
        )
        logits = model(input_tensor)
    """
    def __init__(self, *modules):
        super().__init__()
        self.modules = modules

    def forward(self, x: Tensor) -> Tensor:
        #your code
        #for each module pass 'x'.

4.2.5 Residual

We've seen that Sequential is great for stacking layers. But as we stack more and more layers, the network can start to "forget" the original input. The input has to pass through so many layers that it can get lost.

What if we don't always have to pass our input through our network? What if we could skip some layers? Instead of forcing the input \(x\) to be transformed by every single layer, can we create a way? to let the input bypass the layers and add itself back to the result at the end.

\[output = f(x) + x\]

Yes! This is called a Residual Connection. It allows gradients to flow through the "shortcut" path during backpropagation, making it much easier to train very deep networks. The good thing about using Residual is it won't add additional parameters and complexity to our network.

Important

class Residual(Module):
    """
    Creates a residual connection block, which implements `f(x) + x`.
    Example:
        # A simple residual block
        main_path = nn.Sequential(
            nn.Linear(64, 64),
            nn.ReLU()
        )
        res_block = nn.Residual(main_path)
        output = res_block(input_tensor_of_shape_64)
    """
    def __init__(self, fn: Module):
        super().__init__()
        self.fn = fn
    def forward(self, x: Tensor) -> Tensor:
        return self.fn(x) + x

4.2.6 Dropout

What is overfitting?

A model becomes too good at memorizing the training data. It learns the specific details and noise of the training set so well that it fails to generalize to new, unseen data.

That means the network is dependent on very few specific nodes for the output and the other nodes are not being used at their level.

In a quiz competition, if all team members are dependent on alice for correct answers, the team members will have no role to play at all.

So what shall we do?

Remove alice or remove everyone else?

Sometimes alice sit out , sometimes the some other person sits out. So that everyone starts performing instead of depending on others.

To prevent this, we use Dropout. We force Alice (and everyone else) to "sit out" at random during training. This forces every one to learn useful features on its own. This is what Dropout does.

During training, some neurons are temporarily turned off at random. And during testing everyone participates.

Exercise 4.5

Lets write the Dropout class.

class Dropout(Module):
    """
    A regularization layer to help prevent overfitting.

    During training (`.train()` mode), it randomly sets some input
    elements to zero with a probability of `p`. The remaining
    elements are scaled up by `1 / (1 - p)`.

    During evaluation (`.eval()` mode), this layer does nothing.

    Example:
        model = nn.Sequential(
            nn.Linear(128, 64),
            nn.ReLU(),
            nn.Dropout(p=0.2) # Drop 20% of activations during training
        )
    """
    def __init__(self, p: float = 0.5):
        super().__init__()
        self.p = p

    def forward(self, x: Tensor) -> Tensor:
        #your code
        # create a Tensor mask of binary elements that are random.(use Tensor.randb)
        # multiply this mask with x and divide with (1-self.p)

Note

If self.training=False return x.

4.2.7 LayerNorm

What is an Activation?

An activation is the numerical output of a neuron.

What is Mean?

The Mean is the average value of a set of numbers.

What is Variance?

Variance measures how spread out a set of numbers are from its mean.

What is a Distribution?

A distribution describes how the values in a set of data are spread out.

What is Internal Covariate Shift?

The change in the distribution of a layer's inputs during training.

In deep networks which consists of many layers, it is important for input distribution to stay stable.

Why?** Why should the input distribution stay stable?**

If it is not stable, which means learning process for each layer becomes harder.

Why does it becomes harder? Because each layer is trying to learn from unstable inputs.

So what can we do ? We need to stablize these inputs. Thats what we will do in LayerNorm.

\[ \text{normalized activation}_i = \frac{\text{activation}_i - \text{mean of activations}}{\text{standard deviation of activations} + \epsilon} \]

\[ \text{output}_i = \text{scale parameter} \times \text{normalized activation}_i + \text{shift parameter} \]

\[ x_{\text{norm}_i} = \frac{x_i - \text{mean}(x)}{\text{std}(x) + \epsilon} \]

\[ \text{output}_i = \text{weight} \times x_{\text{norm}_i} + \text{bias} \]

Reference: The Original LayerNorm Paper

Title: Layer Normalization Link: https://arxiv.org/abs/1607.06450

Exercise 4.6

Lets write the LayerNorm1d class.

class LayerNorm1d(Module):
    def __init__(self,dim: int, eps: float=1e-5,device=None,
                 dtype="float32"):
        super().__init__()
        self.dim = dim
        self.eps = eps
        self.weight = Parameter(Tensor.ones(dim, dtype=dtype))
        self.bias = Parameter(Tensor.zeros(dim, dtype="float32"))

    def forward(self,x):
        # x (batch_size,dim;)
        # your code

Note

You can use summation, reshape, broadcast_to from ops.

Remember to normalize over the features dimension (the last dimension of the input x). Keep track of the tensor shapes at each step!

4.2.8 BatchNorm

LayerNorm works on the features axis whereas BatchNorm works on batches(axis=0). It transforms the data so that input has a mean of 0 and a standard deviation of 1 over the batch dimension.

Tip

What is a Batch? A batch is a small group of data samples (like 32 images) that the network looks at all at once before updating its weights.

\[ \text{mean} = \frac{1}{N} \sum_{i=1}^{N} x_i \]

\[ \text{var} = \frac{1}{N} \sum_{i=1}^{N} (x_i - \text{mean})^2 \]

\[ x_{\text{norm}_i} = \frac{x_i - \text{mean}}{\sqrt{\text{var} + \epsilon}} \]

\[ \text{output}_i = \text{weight} \times x_{\text{norm}_i} + \text{bias} \]

Training vs Evaluation

During the training phase, we maintain 2 buffers running_mean and running_var. These are not Parameters but buffers that store the mean and variance.

During training we use the mean and variance of the current batch to normalize the data. We update our buffers using momentum.

\[\text{running\_mean} = (1 - \text{momentum}) \times \text{running\_mean} + \text{momentum} \times \text{batch\_mean}\]

\[\text{running\_var} = (1 - \text{momentum}) \times \text{running\_var} + \text{momentum} \times \text{batch\_var}\]

But during training=False we do not calculate new running_mean and running_var but use the buffers that we created during the training phase .

FILE : babygrad/nn.py

Exercise 4.7

Lets write the BatchNorm class.

class BatchNorm1d(Module):
    def __init__(self, dim: int, eps: float = 1e-5, momentum: float = 0.1,
             device: Any | None = None, dtype: str = "float32") -> None:
        super().__init__()
        self.dim = dim
        self.eps = eps
        self.momentum = momentum
        self.weight = Parameter(Tensor.ones(dim, dtype=dtype))
        self.bias = Parameter(Tensor.zeros(dim,  dtype=dtype))
        self.running_mean = Tensor.zeros(dim, dtype=dtype)
        self.running_var = Tensor.ones(dim, dtype=dtype)

    def forward(self, x: Tensor) -> Tensor:
        #x.shape (bs,dim)
        if self.training:
            self.running_mean.data = (1 - self.momentum) *
                    self.running_mean.data + self.momentum * mean.data
            self.running_var.data = (1 - self.momentum) *
                        self.running_var.data + self.momentum * var.data
        else:
            mean_to_use = self.running_mean
            var_to_use = self.running_var

Note

You can use summation, reshape, broadcast_to from ops.

Remember to normalize over the batch dimension (the first dimension of the input x). Keep track of the tensor shapes at each step!

4.2.9 MSE Loss

It is a Standard loss function for Regression tasks. It calculates the average of the squared differences between the predicted values and the actual targets.

\[\text{MSE} = \frac{1}{N} \sum_{i=1}^{N} (y_{\text{pred}_i} - y_{\text{target}_i})^2\]

class MSELoss(Module):
    def forward(self, pred: Tensor, target: Tensor) -> Tensor:
        """
        Calculates the Mean Squared Error.
        """
        diff = pred - target
        sq_diff = diff * diff
        return sq_diff.sum() / Tensor(target.data.size)

4.2.10 SoftmaxLoss

Lets take a single image and the image could be [Cat,Dog,Elephant]. Our image is a Dog. We take this image and give it to our model. And the model gives us the output scores:

\[logits = [2.0, 5.0, 1.0]\]

Model is sure it is a Dog.

Lets turn our true label(Dog) into a one_hot vector.

\[y\_one\_hot = [0, 1, 0]\]

Lets multiply the logits with one_hot

\[[2.0, 5.0, 1.0] \times [0, 1, 0] = [0, 5.0, 0]\]

\[\text{sum}([0, 5.0, 0]) = \mathbf{5.0}\]

Lets do logsumexp for the for the logits.

\[\text{logsumexp}([2.0, 5.0, 1.0]) = \ln(e^{2.0} + e^{5.0} + e^{1.0})\]

We get nearly = 5.066

Now lets find the loss:

\[\text{Loss} = logsumexp - h_y\] \[\text{Loss} = 5.066 - 5.0 = \mathbf{0.066}\]

The loss is less, that means the model correctly predicted our image.

We will use the max trick , so that the exponents don't explode.

\[\text{LogSumExp}(x) = \max(x) + \ln \left( \sum_{i} e^{x_i - \max(x)} \right)\]

Lets figure out how to do backward pass for this.

Using the chain rule on \(f(x) = \ln(\sum e^{x_i})\), the derivative with respect to \(x_i\) is:

\[\frac{\partial \text{LSE}}{\partial x_i} = \frac{e^{x_i}}{\sum_j e^{x_j}}\]

Does it look like a Softmax function?

FILE : babygrad/nn.py

Exercise 4.8

Lets write the SoftmaxLoss class.

class SoftmaxLoss(Module):
    def forward(self, logits, y):
        """
        Calculates the softmax cross-entropy loss.
        Args:
            logits: A tensor of shape (batch_size, num_classes)
                containing the model's raw output.
            y: A list or numpy array of integers (batch_size,)
                 containing the true class labels.
        """
        n, k = logits.shape
        y_one_hot = Tensor.one_hot(y, k, requires_grad=False)
        logsumexp_val = ops.logsumexp(logits, axes=(1,))
        h_y = (logits * y_one_hot).sum(axes=(1,))

        return (logsumexp_val - h_y).sum() / n

Note

You should write ops.logsumexp instead of doing it here.

← Automatic Differentiation Next: Optimizer →

Original: zekcrates/nn