Chapter 4: nn
If we want to build a Neural Network we don't want to define each weight manually. In this chapter We will start creating the nn.Module class for various layers.
This is how our folder structure currently looks like. In this chapter we will work inside babygrad/nn.py.
project/
|- .venv/ # virtual environment
|- babygrad/ # source code
| |- __init__.py
| |- ops.py
| |- tensor.py
|- examples/ # Examples
| |- simple_mnist.py
|- tests/ #tests
4.1 Parameter
A Parameter is a Tensor containing a model's learnable weights.
In our Tensor class, we are now adopting the standard convention: requires_grad will default to False. This is a sensible choice because most tensors are intermediate results, and not tracking their gradients saves memory.
This means we need an explicit way to mark a tensor as learnable. We do this by wrapping it in the Parameter class, which automatically sets requires_grad=True
Why a seperate class? just for something simple?
- It makes the code self-documenting. When you see self.weight = Parameter(...) inside a layer, you know immediately that it is a learnable weight, not just a temporary tensor.
- Easier to find other learnable parameters recursively.
FILE: babygrad/nn.py
from babygrad import Tensor
class Parameter(Tensor):
"""
A special Tensor that tells a Module it is a learnable parameter.
Example:
>>>
>>> self.some_data = Tensor([1, 2, 3])
>>>
>>> # A parameter - will be found and trained by the optimizer!
>>> self.weights = Parameter(Tensor.randn(10, 5))
"""
def __init__(self, data, *args, **kwargs):
# Parameters always require gradients.
kwargs['requires_grad'] = True
super().__init__(data, *args, **kwargs)
a = Tensor([1,2,3])
print(a.requires_grad)
>>>False
b = Parameter(a)
print(b.requires_grad)
>>>True
As you can see Parameter class only does this one thing. It makes a parameter a learnable parameter.
Lets also write a simple method to get all the parameters . Now that we have a way to mark parameters, we need a way to find them. A model can store parameters in several structures
- Directly as attributes (self.weight)
- Inside a list (self.layers): [Layer1, Layer2]
- Inside a dictionary.
We need a method to find the Parameters of a model.
class Parameter(Tensor):
#code
def _get_parameters(data):
params = []
if isinstance(data, Parameter):
return [data]
if isinstance(data, dict):
for value in data.values(): #calling _get_parameters recursively
params.extend(_get_parameters(value))
if isinstance(data, (list, tuple)):
for item in data:
params.extend(_get_parameters(item))
return params
4.2 Module
After the first 2 chapters you should have done the examples/simple_mnist. We used a SimpleNN class for our Model.
class SimpleNN:
"""A simple two-layer neural network."""
def __init__(self, input_size, hidden_size, num_classes):
self.W1 = Parameter(np.random.randn(input_size, hidden_size)
.astype(np.float32) / np.sqrt(hidden_size))
self.W2 = Parameter(np.random.randn(hidden_size, num_classes)
.astype(np.float32) / np.sqrt(num_classes))
def forward(self, x: Tensor) -> Tensor:
"""Performs the forward pass of the network."""
z1 = x @ self.W1 # (8,784) @ (784, 100) -> (8,100)
a1 = ops.relu(z1)
logits = a1 @ self.W2 # (8,100) @ (100,10) -> (8,10)
return logits
def parameters(self):
"""Returns a list of all model parameters."""
return [self.W1, self.W2]
It worked, but we had to manually define the parameters() method to return a list of weights. If you had a 50-layer network, that list would be impossible to maintain.
Every layer (Linear, BatchNorm, Conv) shares the same core needs:
- Manage Parameters: It needs to find every weight inside itself.
- Forward Pass: It needs a way to process input.
- Training State: It needs to know if it is currently training or evaluating.
We define the Module base class to handle all of this automatically.
FILE : babygrad/nn.py
from typing import List
from babygrad import Tensor
class Module:
"""
Base class for all neural network layers.
Example:
class Linear(Module):
def __init__(self, in_features, out_features):
super().__init__()
self.weight = Parameter(np.random.randn(in_features,
out_features))
def forward(self, x):
#pass
"""
def __init__(self):
self.training =True
def parameters(self) -> List[Parameter]:
"""
Returns a list of all parameters in the module and its submodules.
"""
# self.__dict__ is a dictionary containing all the
instance's attributes.
# We pass it to our helper to recursively find all Parameters.
params = _get_parameters(self.__dict__)
unique_params = []
seen_ids = set()
for p in params:
if id(p) not in seen_ids:
unique_params.append(p)
seen_ids.add(id(p))
return unique_params
def forward(self, *args, **kwargs):
"""The forward pass logic that must be defined by subclasses."""
raise NotImplementedError
def __call__(self, *args, **kwargs):
"""
Makes the module callable like a function.
Example:
>>> model = Linear(10, 2)
>>> input_tensor = Tensor.randn(64, 10)
>>> output = model(input_tensor) # This calls model.forward(...)
"""
return self.forward(*args, **kwargs)
Training vs evaluation
We have created the Base Module class and we can start creating Layers using this class.
After training our model, we want to use it for making predictions. This brings up an important distinction between a model's behavior during training and during evaluation (or "inference").
What happens during training? During Training: We feed data, get predictions, calculate loss, and update weights. During Evaluation (Inference): We feed data and get predictions. We do not update weights.
But is that the only difference? If we just turn off gradient calculations, is that enough?
Not quite. Certain layers behave fundamentally differently depending on whether you are training or evaluating.
Two of the most common layers that require a separate eval mode are:
Dropout: During training, this layer randomly sets some of its inputs to zero. This is a regularization technique to prevent overfitting. During evaluation, you want your model to use its full learned capacity, so Dropout must be turned off.
Batch Normalization: During training, this layer calculates the mean and variance of the current batch of data to normalize it. It also keeps a running average of these statistics. During evaluation, it stops calculating from the current batch and instead uses its saved running averages to normalize the data.
To handle this, our Module class will need to keep track of its current state. We will add a self.training attribute and two methods, train() and eval(), to switch between these states.
First lets use a method to find all the Modules present.
def _get_modules(obj) -> list['Module']:
"""
A simple recursive function that finds all Module objects within any given
object by searching through its attributes, lists, tuples, and dicts.
"""
modules = []
if isinstance(obj, Module):
return [obj]
if isinstance(obj, dict):
for value in obj.values():
modules.extend(_get_modules(value))
if isinstance(obj, (list, tuple)):
for item in obj:
modules.extend(_get_modules(item))
return modules
class Module:
def __init__(self):
self.training = True
#code
def train(self):
self.training = True
for m in _get_modules(self.__dict__):
m.training = True
def eval(self):
self.training = False
for m in _get_modules(self.__dict__):
m.training = False
We have also defined a _get_modules which is same as _get_parameters. We can now just call model.eval() or model.train() to chain states.
Lets start building layers using Module.
4.2.1 ReLU
We will start with simple layers that don't contain any learnable parameters. These layers just apply a mathematical function to the input.
- ReLU
- Sigmoid
- Tanh
FILE : babygrad/nn.py
Lets write the ReLu class.
class ReLU(Module):
def forward(self, x: Tensor):
"""
Applies the Rectified Linear Unit (ReLU) function element-wise.
Example:
model = nn.Sequential(
nn.Linear(128, 64),
nn.ReLU() # Apply activation after the linear layer
)
"""
#your solution
class Tanh(Module):
def forward(self, x: Tensor):
#your code
class Sigmoid(Module):
def forward(self,x: Tensor):
#your code
Use methods from ops.py
4.2.2 Flatten
What does Flatten do?
Suppose there is a matrix of shape (2,3,4,5). Flatten will just reshape the matrix into (2,60). Flatten is extremely useful in CNN.
FILE : baby/nn.py
Lets write the Flatten class.
class Flatten(Module):
"""
Flattens a tensor by reshaping it to `(batch_size, -1)`.
Example:
# A CNN might produce a feature map of shape (32, 64, 7, 7)
# (batch_size, channels, height, width)
model = nn.Sequential(
nn.Conv2d(...),
nn.ReLU(),
nn.Flatten(), # Reshapes output to (32, 64 * 7 * 7) = (32, 3136)
nn.Linear(3136, 10)
)
"""
def forward(self, x: Tensor) -> Tensor:
#your code
Can we use reshape method from ops?
Till now in the exercises we didn't have a need to use Parameter to initialize anything.
4.2.3 Linear
So far, the layers we've built: ReLU, Flatten, Sigmoid have all been stateless. They perform a fixed mathematical operation but don't have any memory or learnable parts.
The most simplest layer is the Linear layer. It multiplies the input with weight and adds bias. The Linear layer (also called a "Dense" or "Fully Connected" layer) maintains state in the form of weight and bias. These are the learnable parameters that the model "adjusts" during training to get the right answer.
\[ output = x @ Weight + bias \]
FILE : baby/nn.py
- Wrap
self.weightandself.biaswithParameter.
Lets write the Linear class.
class Linear(Module):
"""
Applies a linear transformation to the incoming data: y = xA^T + b.
Args:
in_features (int): Size of each input sample.
out_features (int): Size of each output sample.
bias (bool, optional): If set to False,
the layer will not learn bias.
Shape:
- Input: `(batch_size, *, in_features)` where `*`
means any number of additional dimensions.
- Output: `(batch_size, *, out_features)` where
all but the last dimension are the same shape as the input.
Attributes:
weight (Parameter): The learnable weights of the module of shape
`(in_features, out_features)`.
bias (Parameter): The learnable bias of the module
of shape `(out_features,)`.
"""
def __init__(self, in_features: int, out_features: int, bias: bool
, device: Any | None = None, dtype: str = "float32"):
super().__init__()
self.in_features = in_features
self.out_features = out_features
#your code
def forward(self, x: Tensor) -> Tensor:
# Note: x.shape is (batch_size, in_features)
# self.weight.shape is (in_features, out_features)
# The result should have shape (batch_size, out_features)
Make sure self.bias is broadcasted before adding.
4.2.4 Sequential
If you have a model with many different layers like :
class MygoodNetwork(Module):
def __init__(self):
super().__init__()
self.layer1 = Linear(784, 128)
self.activation1 = ReLU()
self.layer2 = Linear(128, 10)
def forward(self, x):
x = self.layer1(x)
x = self.activation1(x)
x = self.layer2(x)
return x
This works, but the forward method is just a repetitive chain of calls. If we had 10 layers, this would become very verbose. Can we simplify this? Can we just "stack" the layers and have them run in order automatically? The Sequential class solves this by acting as a container. It takes a list of modules and handles the "passing" for you: the output of layer 1 automatically becomes the input of layer 2
We will write the Sequential class. That takes modules as input. It takes a sequence of modules and chains them together for you.
Lets write the Sequential class.
class Sequential(Module):
A container that chains a sequence of modules together.
Example:
# A simple 2-layer MLP for MNIST
model = nn.Sequential(
nn.Linear(784, 128),
nn.ReLU(),
nn.Linear(128, 10)
)
logits = model(input_tensor)
"""
def __init__(self, *modules):
super().__init__()
self.modules = modules
def forward(self, x: Tensor) -> Tensor:
#your code
#for each module pass 'x'.
4.2.5 Residual
We've seen that Sequential is great for stacking layers. But as we stack more and more layers, the network can start to "forget" the original input. The input has to pass through so many layers that it can get lost.
What if we don't always have to pass our input through our network? What if we could skip some layers? Instead of forcing the input \(x\) to be transformed by every single layer, can we create a way? to let the input bypass the layers and add itself back to the result at the end.
\[output = f(x) + x\]
Yes! This is called a Residual Connection. It allows gradients to flow through the "shortcut" path during backpropagation, making it much easier to train very deep networks. The good thing about using Residual is it won't add additional parameters and complexity to our network.
class Residual(Module):
"""
Creates a residual connection block, which implements `f(x) + x`.
Example:
# A simple residual block
main_path = nn.Sequential(
nn.Linear(64, 64),
nn.ReLU()
)
res_block = nn.Residual(main_path)
output = res_block(input_tensor_of_shape_64)
"""
def __init__(self, fn: Module):
super().__init__()
self.fn = fn
def forward(self, x: Tensor) -> Tensor:
return self.fn(x) + x
4.2.6 Dropout
A model becomes too good at memorizing the training data. It learns the specific details and noise of the training set so well that it fails to generalize to new, unseen data.
That means the network is dependent on very few specific nodes for the output and the other nodes are not being used at their level.
In a quiz competition, if all team members are dependent on alice for correct answers, the team members will have no role to play at all.
So what shall we do?
Remove alice or remove everyone else?
Sometimes alice sit out , sometimes the some other person sits out. So that everyone starts performing instead of depending on others.
To prevent this, we use Dropout. We force Alice (and everyone else) to "sit out" at random during training. This forces every one to learn useful features on its own. This is what Dropout does.
During training, some neurons are temporarily turned off at random. And during testing everyone participates.
Lets write the Dropout class.
class Dropout(Module):
"""
A regularization layer to help prevent overfitting.
During training (`.train()` mode), it randomly sets some input
elements to zero with a probability of `p`. The remaining
elements are scaled up by `1 / (1 - p)`.
During evaluation (`.eval()` mode), this layer does nothing.
Example:
model = nn.Sequential(
nn.Linear(128, 64),
nn.ReLU(),
nn.Dropout(p=0.2) # Drop 20% of activations during training
)
"""
def __init__(self, p: float = 0.5):
super().__init__()
self.p = p
def forward(self, x: Tensor) -> Tensor:
#your code
# create a Tensor mask of binary elements that are random.(use Tensor.randb)
# multiply this mask with x and divide with (1-self.p)
If self.training=False return x.
4.2.7 LayerNorm
An activation is the numerical output of a neuron.
The Mean is the average value of a set of numbers.
Variance measures how spread out a set of numbers are from its mean.
A distribution describes how the values in a set of data are spread out.
The change in the distribution of a layer's inputs during training.
In deep networks which consists of many layers, it is important for input distribution to stay stable.
Why?** Why should the input distribution stay stable?**
If it is not stable, which means learning process for each layer becomes harder.
Why does it becomes harder? Because each layer is trying to learn from unstable inputs.
So what can we do ? We need to stablize these inputs. Thats what we will do in LayerNorm.
\[ \text{normalized activation}_i = \frac{\text{activation}_i - \text{mean of activations}}{\text{standard deviation of activations} + \epsilon} \]
\[ \text{output}_i = \text{scale parameter} \times \text{normalized activation}_i + \text{shift parameter} \]
\[ x_{\text{norm}_i} = \frac{x_i - \text{mean}(x)}{\text{std}(x) + \epsilon} \]
\[ \text{output}_i = \text{weight} \times x_{\text{norm}_i} + \text{bias} \]
Title: Layer Normalization Link: https://arxiv.org/abs/1607.06450
Lets write the LayerNorm1d class.
class LayerNorm1d(Module):
def __init__(self,dim: int, eps: float=1e-5,device=None,
dtype="float32"):
super().__init__()
self.dim = dim
self.eps = eps
self.weight = Parameter(Tensor.ones(dim, dtype=dtype))
self.bias = Parameter(Tensor.zeros(dim, dtype="float32"))
def forward(self,x):
# x (batch_size,dim;)
# your code
You can use summation, reshape, broadcast_to from ops.
Remember to normalize over the features dimension (the last dimension of the input x). Keep track of the tensor shapes at each step!
4.2.8 BatchNorm
LayerNorm works on the features axis whereas BatchNorm works on batches(axis=0). It transforms the data so that input has a mean of 0 and a standard deviation of 1 over the batch dimension.
What is a Batch? A batch is a small group of data samples (like 32 images) that the network looks at all at once before updating its weights.
\[ \text{mean} = \frac{1}{N} \sum_{i=1}^{N} x_i \]
\[ \text{var} = \frac{1}{N} \sum_{i=1}^{N} (x_i - \text{mean})^2 \]
\[ x_{\text{norm}_i} = \frac{x_i - \text{mean}}{\sqrt{\text{var} + \epsilon}} \]
\[ \text{output}_i = \text{weight} \times x_{\text{norm}_i} + \text{bias} \]
Training vs Evaluation
During the training phase, we maintain 2 buffers running_mean and running_var. These are not Parameters but buffers that store the mean and variance.
During training we use the mean and variance of the current batch to normalize the data. We update our buffers using momentum.
\[\text{running\_mean} = (1 - \text{momentum}) \times \text{running\_mean} + \text{momentum} \times \text{batch\_mean}\]
\[\text{running\_var} = (1 - \text{momentum}) \times \text{running\_var} + \text{momentum} \times \text{batch\_var}\]
But during training=False we do not calculate new running_mean and running_var but use the buffers that we created during the training phase .
FILE : babygrad/nn.py
Lets write the BatchNorm class.
class BatchNorm1d(Module):
def __init__(self, dim: int, eps: float = 1e-5, momentum: float = 0.1,
device: Any | None = None, dtype: str = "float32") -> None:
super().__init__()
self.dim = dim
self.eps = eps
self.momentum = momentum
self.weight = Parameter(Tensor.ones(dim, dtype=dtype))
self.bias = Parameter(Tensor.zeros(dim, dtype=dtype))
self.running_mean = Tensor.zeros(dim, dtype=dtype)
self.running_var = Tensor.ones(dim, dtype=dtype)
def forward(self, x: Tensor) -> Tensor:
#x.shape (bs,dim)
if self.training:
self.running_mean.data = (1 - self.momentum) *
self.running_mean.data + self.momentum * mean.data
self.running_var.data = (1 - self.momentum) *
self.running_var.data + self.momentum * var.data
else:
mean_to_use = self.running_mean
var_to_use = self.running_var
You can use summation, reshape, broadcast_to from ops.
Remember to normalize over the batch dimension (the first dimension of the input x). Keep track of the tensor shapes at each step!
4.2.9 MSE Loss
It is a Standard loss function for Regression tasks. It calculates the average of the squared differences between the predicted values and the actual targets.
\[\text{MSE} = \frac{1}{N} \sum_{i=1}^{N} (y_{\text{pred}_i} - y_{\text{target}_i})^2\]
class MSELoss(Module):
def forward(self, pred: Tensor, target: Tensor) -> Tensor:
"""
Calculates the Mean Squared Error.
"""
diff = pred - target
sq_diff = diff * diff
return sq_diff.sum() / Tensor(target.data.size)
4.2.10 SoftmaxLoss
Lets take a single image and the image could be [Cat,Dog,Elephant]. Our image is a Dog. We take this image and give it to our model. And the model gives us the output scores:
\[logits = [2.0, 5.0, 1.0]\]
Model is sure it is a Dog.
Lets turn our true label(Dog) into a one_hot vector.
\[y\_one\_hot = [0, 1, 0]\]
Lets multiply the logits with one_hot
\[[2.0, 5.0, 1.0] \times [0, 1, 0] = [0, 5.0, 0]\]
\[\text{sum}([0, 5.0, 0]) = \mathbf{5.0}\]
Lets do logsumexp for the for the logits.
\[\text{logsumexp}([2.0, 5.0, 1.0]) = \ln(e^{2.0} + e^{5.0} + e^{1.0})\]
We get nearly = 5.066
Now lets find the loss:
\[\text{Loss} = logsumexp - h_y\] \[\text{Loss} = 5.066 - 5.0 = \mathbf{0.066}\]
The loss is less, that means the model correctly predicted our image.
We will use the max trick , so that the exponents don't explode.
\[\text{LogSumExp}(x) = \max(x) + \ln \left( \sum_{i} e^{x_i - \max(x)} \right)\]
Lets figure out how to do backward pass for this.
Using the chain rule on \(f(x) = \ln(\sum e^{x_i})\), the derivative with respect to \(x_i\) is:
\[\frac{\partial \text{LSE}}{\partial x_i} = \frac{e^{x_i}}{\sum_j e^{x_j}}\]
Does it look like a Softmax function?
FILE : babygrad/nn.py
Lets write the SoftmaxLoss class.
class SoftmaxLoss(Module):
def forward(self, logits, y):
"""
Calculates the softmax cross-entropy loss.
Args:
logits: A tensor of shape (batch_size, num_classes)
containing the model's raw output.
y: A list or numpy array of integers (batch_size,)
containing the true class labels.
"""
n, k = logits.shape
y_one_hot = Tensor.one_hot(y, k, requires_grad=False)
logsumexp_val = ops.logsumexp(logits, axes=(1,))
h_y = (logits * y_one_hot).sum(axes=(1,))
return (logsumexp_val - h_y).sum() / n
You should write ops.logsumexp instead of doing it here.
Original: zekcrates/nn