Introduction
The Nerva-Torch Python Library is a library for neural networks. It is part of the Nerva library collection https://github.com/wiegerw/nerva, that includes native C++ and Python implementations of neural networks. Originally the library was intended for experimenting with truly sparse neural networks. Nowadays, the library also aims to provide a transparent and accessible implementation of neural networks.
This document describes the implementation of the Nerva-Torch Python Library. For initial versions of the library I took inspiration from lecture notes of machine learning courses by Roger Grosse, which I highly recommend. This influence may still be evident in the naming of symbols.
Installation
The nerva_torch
library can be installed in two ways: from the source repository or from the Python Package Index (PyPI).
# Install from the local repository
pip install .
# Install directly from PyPI
pip install nerva-torch
Overview of the code
This section provides an overview of the code in the Nerva-Torch Python Library. All core functionality is contained in the nerva_torch
module.
Module contents
The most important files in the nerva_torch
module are listed below. Each file implements a distinct part of the neural network library.
File | Description |
---|---|
Defines the |
|
Implements various neural network layers, such as fully connected or custom layers. |
|
Provides commonly used activation functions (e.g., ReLU, sigmoid, tanh) to introduce non-linearity. |
|
Implements loss functions used to quantify the difference between predictions and targets (e.g., cross-entropy, MSE). |
|
Provides functions for initializing neural network weights, supporting different strategies for stability and performance. |
|
Defines optimizer functions that update neural network parameters based on computed gradients (e.g., SGD, momentum). |
|
Implements learning rate schedulers to adjust the learning rate dynamically during training. |
|
Contains the stochastic gradient descent (SGD) algorithms for training multilayer perceptrons. |
Number type
The Nerva-Torch Python Library uses 32-bit floating point numbers (float32) as its default number type. This choice balances memory usage and computational efficiency on both CPUs and GPUs. All computations, including feedforward, backpropagation, and gradient updates, are performed in this precision.
API / User guide
Classes
Class Layer
The class Layer
is the base class of all neural network layers. There are several different types of layers:
Layer | Description |
---|---|
|
A linear layer. |
|
A linear layer followed by a pointwise activation function. |
|
A linear layer followed by a SReLU activation function. |
|
A linear layer followed by a softmax activation function. |
|
A linear layer followed by a logsoftmax activation function. |
|
A batch normalization layer. |
Class MultilayerPerceptron
A multilayer perceptron (MLP) is modeled using the class MultilayerPerceptron
. It contains a list of layers, and has member functions feedforward
, backpropagate
and optimize
that can be used for training the neural network. Constructing an MLP can be done as follows:
M = MultilayerPerceptron()
# configure layer 1
layer1 = ActivationLayer(784, 1024, ReLUActivation())
set_weights_xavier_normal(layer1.W)
set_bias_zero(layer1.b)
optimizer_W = MomentumOptimizer(layer1.W, layer1.DW, 0.9)
optimizer_b = NesterovOptimizer(layer1.b, layer1.Db, 0.75)
layer1.optimizer = CompositeOptimizer([optimizer_W, optimizer_b])
# configure layer 2
layer2 = ActivationLayer(1024, 512, LeakyReLUActivation(0.5))
layer2.set_weights("XavierNormal")
layer2.set_optimizer("Momentum(0.8)")
# configure layer 3
layer3 = LinearLayer(512, 10)
layer3.set_weights("HeNormal")
layer3.set_optimizer("GradientDescent")
M.layers = [layer1, layer2, layer3]
This creates an MLP with three linear layers, and various activation functions, weight initializers and optimizers.
Another way to construct MLPs is provided by the function parse_multilayer_perceptron
, that parses an MLP from textual specifications:
layer_specifications = ["ReLU", "LeakyReLU(0.5)", "Linear"]
linear_layer_sizes = [784, 1024, 512, 10]
linear_layer_optimizers = ["Nesterov(0.9)", "Momentum(0.8)", "GradientDescent"]
linear_layer_weight_initializers = ["XavierNormal", "XavierUniform", "HeNormal"]
M = parse_multilayer_perceptron(layer_specifications,
linear_layer_sizes,
linear_layer_optimizers,
linear_layer_weight_initializers)
Note that optimizers should not only be specified for linear layers, but also for batch normalization layers.
Class LossFunction
The class LossFunction
is the base class of all loss functions. There are five loss functions available:
-
SquaredErrorLoss
-
CrossEntropyLoss
-
LogisticCrossEntropyLoss
-
NegativeLogLikelihoodLoss
-
SoftmaxCrossEntropyLoss
See the Nerva library specifications document for precise definitions of these loss functions.
Class ActivationFunction
The class ActivationFunction
is the base class of all activation functions. The following activation functions are available:
-
ReLU
-
Sigmoid
-
Softmax
-
LogSoftmax
-
LeakyReLU
-
AllReLU
-
SReLU
-
HyperbolicTangent
See the Nerva library specifications document for precise definitions of these activation functions.
Training a neural network
The library provides two variants of stochastic gradient descent (SGD) training for multilayer perceptrons.
The preferred interface is stochastic_gradient_descent
, which accepts PyTorch-style DataLoader
instances for
training and test sets. This approach is the easiest in practice, since the DataLoader
abstraction automatically
handles batching, shuffling, and iteration over the dataset.
def stochastic_gradient_descent(M: MultilayerPerceptron,
epochs: int,
loss: LossFunction,
learning_rate: LearningRateScheduler,
train_loader: DataLoader,
test_loader: DataLoader
):
print_epoch_header()
lr = learning_rate(0)
compute_statistics(M, lr, loss, train_loader, test_loader, epoch=0)
training_time = 0.0
for epoch in range(epochs):
timer = StopWatch()
lr = learning_rate(epoch) # update the learning at the start of each epoch
for k, (X, T) in enumerate(train_loader):
Y = M.feedforward(X)
DY = loss.gradient(Y, T) / X.shape[0]
if TrainOptions.debug:
print_batch_debug_info(epoch, k, M, X, Y, DY)
M.backpropagate(Y, DY)
M.optimize(lr)
seconds = timer.seconds()
training_time += seconds
compute_statistics(M, lr, loss, train_loader, test_loader, epoch=epoch + 1, elapsed_seconds=seconds)
print_epoch_footer(training_time)
For educational purposes, a lower-level variant stochastic_gradient_descent_plain
is also available.
It operates directly on raw tensors in row layout (samples as rows), giving full control over batching and
shuffling, but at the cost of additional boilerplate.
def stochastic_gradient_descent_plain(M: MultilayerPerceptron,
Xtrain: Matrix,
Ttrain: Matrix,
loss: LossFunction,
learning_rate: LearningRateScheduler,
epochs: int,
batch_size: int,
shuffle: bool
):
N = Xtrain.shape[0] # number of examples (row layout)
I = list(range(N))
K = N // batch_size # number of full batches
num_classes = M.layers[-1].output_size()
for epoch in range(epochs):
if shuffle:
random.shuffle(I)
lr = learning_rate(epoch) # update learning rate each epoch
for k in range(K):
batch = I[k * batch_size: (k + 1) * batch_size]
X = Xtrain[batch, :] # shape (batch_size, input_dim)
# Convert labels to one-hot if needed
if Ttrain.ndim == 2 and Ttrain.shape[1] > 1:
# already one-hot encoded
T = Ttrain[batch, :]
else:
T = to_one_hot(Ttrain[batch], num_classes)
Y = M.feedforward(X)
DY = loss.gradient(Y, T) / X.shape[0]
if TrainOptions.debug:
print_batch_debug_info(epoch, k, M, X, Y, DY)
M.backpropagate(Y, DY)
M.optimize(lr)
Both functions support targets provided either as a one-dimensional tensor of class indices (the default
convention used in PyTorch’s classification losses) or as a one-hot encoded matrix with as many columns
as the output Y
. If class indices are provided, they are internally converted to one-hot encoding using to_one_hot
.
Batching of the training data depends on the chosen interface. With stochastic_gradient_descent
, batching and
shuffling are handled automatically by the DataLoader
. With stochastic_gradient_descent_plain
, batching is
implemented manually inside the training loop.
In each epoch, every batch (X, T)
goes through the three standard steps of stochastic gradient descent:
-
Feedforward: Given an input batch
X
and the current neural network parametersΘ
, compute the outputsY
. In the code, this corresponds toY = M.feedforward(X)
. -
Backpropagation: Given the outputs
Y
and the targetsT
, compute the gradientDY
ofY
with respect to the loss function. Then, usingY
andDY
, compute the gradients of the model parametersDΘ
. These parameter gradients are stored internally in the model rather than returned. In the code, this step is performed byM.backpropagate(Y, DY)
. -
Optimization: Use the internally stored parameter gradients to update the parameters
Θ
. In the code, this corresponds toM.optimize(lr)
.
Command line tools
The following command line tools are available. They can be found in the tools
directory.
Tool | Description |
---|---|
|
A tool for training multilayer perceptrons. |
|
A tool for inspecting the contents of a file in NumPy NPZ format. |
The tool mlp.py
The tool mlp.py
can be used for training multilayer perceptrons. An example invocation of the mlp.py
tool is
python3 -u ../tools/mlp.py \
--layers="ReLU;ReLU;Linear" \
--layer-sizes="3072;1024;512;10" \
--layer-weights="XavierNormal;XavierNormal;XavierNormal" \
--optimizers="Momentum(0.9);Momentum(0.9);Momentum(0.9)" \
--batch-size=100 \
--epochs=5 \
--loss=SoftmaxCrossEntropy \
--learning-rate="Constant(0.01)" \
--load-dataset=$dataset
This will train a CIFAR-10 model using an MLP consisting of three linear layers with activation functions ReLU, ReLU and no activation. A script prepare_data.py
is available in the data
directory
that can be used to download the dataset, flatten it and store it in .npz
format. See the section Preparing data for details.
The output may look like this:
Loading dataset from file ../data/cifar10-flattened.npz
--------------------------------------------------------------------------------
epoch | lr | loss | train_acc | test_acc | time (s)
--------------------------------------------------------------------------------
0 | 0.010000 | 2.408590 | 0.097760 | 0.096000 | 0.000000
1 | 0.010000 | 1.645980 | 0.412700 | 0.410500 | 3.572163
2 | 0.010000 | 1.548570 | 0.448900 | 0.440100 | 3.586480
3 | 0.010000 | 1.475506 | 0.477560 | 0.465100 | 4.450712
4 | 0.010000 | 1.431293 | 0.491620 | 0.474800 | 4.429117
5 | 0.010000 | 1.369700 | 0.513900 | 0.494300 | 4.489507
--------------------------------------------------------------------------------
Total training time: 20.527979 s
mlp.py Command Line Options
This section gives an overview of the command line interface of the mlp.py
tool.
Parameter Lists
Some options accept a list of items. Lists must be semicolon-separated.
For example: --layers="ReLU;AllReLU(0.3);Linear"
.
Named Parameters
Some items accept parameters using function-call syntax with commas to separate arguments.
Use named parameters when needed, e.g. AllReLU(alpha=0.3)
. If a parameter has a default value, it may be omitted: SReLU()
is equivalent to SReLU(al=0,tl=0,ar=0,tr=1)
.
General Options
-
--help
Display help information. -
--debug
Enable debug output. Prints batches, weight matrices, bias vectors, and gradients.
Random Generator Options
-
--seed <value>
Set the seed value for the random number generator.
Layer Configuration Options
-
--layers <value>
Specify a semicolon-separated list of layers. Example:--layers=ReLU;AllReLU(0.3);Linear
.
Specification | Description |
---|---|
|
Linear layer without activation |
|
Linear layer with ReLU activation |
|
Linear layer with sigmoid activation |
|
Linear layer with softmax activation |
|
Linear layer with log-softmax activation |
|
Linear layer with hyperbolic tangent activation |
|
Linear layer with AllReLU activation |
|
Linear layer with SReLU activation. Defaults: |
|
Batch normalization layer |
-
--layer-sizes <value>
Specify the sizes of linear layers (semicolon-separated). Example:--layer-sizes=3072;1024;512;10
. -
--layer-weights <value>
Specify the weight initialization method for linear layers. Supported values:
Specification | Description |
---|---|
|
Xavier Glorot weights (normal distribution) |
|
Xavier Glorot weights (uniform distribution) |
|
Kaiming He weights (normal distribution) |
|
Kaiming He weights (uniform distribution) |
|
Normal distribution |
|
Uniform distribution |
|
All weights are zero (N.B. This is not recommended for training) |
Training Configuration Options
-
--epochs <value>
Set the number of training epochs. Default: 100. -
--batch-size <value>
Set the training batch size. -
--optimizers <value>
Specify a semicolon-separated list of optimizers for linear and batch normalization layers.
Specification | Description |
---|---|
|
Standard gradient descent |
|
Momentum optimization with parameter |
|
Nesterov momentum optimization |
-
--learning-rate <value>
Specify a semicolon-separated list of learning rate schedulers. If only one is given, it applies to all layers.
Specification | Description |
---|---|
|
Constant learning rate |
|
Adaptive learning rate with decay |
|
Step-based learning rate with scheduled drops |
|
Drops learning rate at specified epoch milestones |
|
Exponentially decreasing learning rate |
-
--loss <value>
Specify the loss function. Supported values:
Specification | Description |
---|---|
|
Squared error loss |
|
Cross entropy loss |
|
Logistic cross entropy loss |
|
Softmax cross entropy (matches PyTorch) |
|
Negative log likelihood loss |
-
--load-weights <value>
Load weights and biases from a NumPy.npz
file. Weight matrices keys:W1,W2,…
; bias vectors keys:b1,b2,…
. See numpy.lib.format[numpy.lib.format].
Dataset Options
-
--load-dataset <file>
Load a dataset from a NumPy.npz
file. The file must contain the following arrays:-
Xtrain
: training inputs -
Ttrain
: training labels -
Xtest
: test inputs -
Ttest
: test labels
-
Other arrays will be ignored. The shapes should match the expected input and output dimensions of the network.
The tool inspect_npz.py
The tool inspect_npz.py
can be used to inspect the contents of a dataset stored in .npz
format.
An example invocation of the inspect_npz.py
tool is
python inspect_npz.py data/cifar10-flattened.npz
The output may look like this:
Xtrain (50000x3072 ) inf-norm = 1.00000000 [[0.23137255 0.16862745 0.19607843 ... 0.54901961 0.32941176 0.28235294] [0.60392157 0.49411765 0.41176471 ... 0.54509804 0.55686275 0.56470588] [1. 0.99215686 0.99215686 ... 0.3254902 0.3254902 0.32941176] ... [0.1372549 0.15686275 0.16470588 ... 0.30196078 0.25882353 0.19607843] [0.74117647 0.72941176 0.7254902 ... 0.6627451 0.67058824 0.67058824] [0.89803922 0.9254902 0.91764706 ... 0.67843137 0.63529412 0.63137255]] Ttrain (50000 ) inf-norm = 9.00000000 [6 9 9 ... 9 1 1] Xtest (10000x3072 ) inf-norm = 1.00000000 [[0.61960784 0.62352941 0.64705882 ... 0.48627451 0.50588235 0.43137255] [0.92156863 0.90588235 0.90980392 ... 0.69803922 0.74901961 0.78039216] [0.61960784 0.61960784 0.54509804 ... 0.03137255 0.01176471 0.02745098] ... [0.07843137 0.0745098 0.05882353 ... 0.19607843 0.20784314 0.18431373] [0.09803922 0.05882353 0.09019608 ... 0.31372549 0.31764706 0.31372549] [0.28627451 0.38431373 0.38823529 ... 0.36862745 0.22745098 0.10196078]] Ttest (10000 ) inf-norm = 9.00000000 [3 8 8 ... 5 1 7]
With the command line option --shapes-only
a summary can be obtained:
Xtrain (50000x3072 ) inf-norm = 1.00000000 Ttrain (50000 ) inf-norm = 9.00000000 Xtest (10000x3072 ) inf-norm = 1.00000000 Ttest (10000 ) inf-norm = 9.00000000
Data Handling
The Nerva-Torch Python Library provides utilities for reading and writing datasets and the weights and biases of MLP models in NumPy .npz
format.
This format ensures portability between Python and C++ implementations.
Currently, storing the complete model, including its architecture, is not supported.
NPZ format
The default storage format used in the Nerva libraries is the NumPy NPZ format (see numpy.lib.format).
A .npz
file can store a dictionary of arrays in compressed form, which allows both datasets and model parameters to be saved efficiently.
Preparing data
The mlp.py
utility expects training and testing data in .npz
format.
A helper script is provided to download and preprocess commonly used datasets, including MNIST and CIFAR-10.
The script is located at data/prepare_data.py
and can be run from the command line.
MNIST
To download and prepare MNIST:
python prepare_data.py --dataset=mnist --download
This will:
-
Download
mnist.npz
if it does not exist. -
Create a flattened and normalized version of the dataset as
mnist-flattened.npz
.
The output file contains:
-
Xtrain
,Xtest
: flattened and normalized image data -
Ttrain
,Ttest
: corresponding label vectors
CIFAR-10
To download and prepare CIFAR-10:
python prepare_data.py --dataset=cifar10 --download
This will:
-
Download the CIFAR-10 binary dataset from https://www.cs.toronto.edu/~kriz/cifar.html
-
Extract the archive
-
Flatten and normalize RGB images into shape
[N, 3072]
-
Save the result as
cifar10-flattened.npz
The output file contains:
-
Xtrain
,Xtest
: flattened image arrays with pixel values normalized to[0, 1]
-
Ttrain
,Ttest
: integer class labels
Reusing Existing Files
If the required .npz
files already exist, the script will detect this and skip reprocessing.
It is safe to rerun the script without overwriting existing files.
Help
To see all script options:
python prepare_data.py --help
Inspecting .npz
files
When constructing To inspect |
Storing datasets and weights
The mlp.py
utility supports saving and loading datasets and model parameters in .npz
format.
-
Use
--save-dataset
and--load-dataset
to write or read datasets. -
Use
--save-weights
and--load-weights
to store or restore the weights and biases of an MLP.
The .npz
file for datasets contains:
-
Xtrain
,Ttrain
: training inputs and labels -
Xtest
,Ttest
: test inputs and labels
The .npz
file for model parameters contains:
-
W1
,W2
, … : weight matrices for each linear layer -
b1
,b2
, … : corresponding bias vectors
All arrays use standard NumPy formats and can be inspected or manipulated in Python using numpy.load()
and numpy.savez()
.
The architecture of the model (number of layers, activation functions, etc.) is not stored in the |
Advanced Topics
Matrix operations
The most important part of the implementation of neural networks consists of matrix operations. In the implementation of activation functions, loss functions and neural network layers, many different matrix operations are needed. In Nerva a structured approach is followed to implement these components. All equations are expressed in terms of the matrix operations in the table below.
Operation | Code | Definition |
---|---|---|
\(0_{m}\) |
|
\(m \times 1\) column vector with elements equal to 0 |
\(0_{mn}\) |
|
\(m \times n\) matrix with elements equal to 0 |
\(1_{m}\) |
|
\(m \times 1\) column vector with elements equal to 1 |
\(1_{mn}\) |
|
\(m \times n\) matrix with elements equal to 1 |
\(\mathbb{I}_n\) |
|
\(n \times n\) identity matrix |
\(X^\top\) |
|
transposition |
\(cX\) |
|
scalar multiplication, \(c \in \mathbb{R}\) |
\(X + Y\) |
|
addition |
\(X - Y\) |
|
subtraction |
\(X \cdot Z\) |
|
matrix multiplication, also denoted as \(XZ\) |
\(x^\top y~\) or \(~x y^\top\) |
|
dot product, \(x,y \in \mathbb{R}^{m \times 1}\) or \(x,y \in \mathbb{R}^{1 \times n}\) |
\(X \odot Y\) |
|
element-wise product of \(X\) and \(Y\) |
\(\mathsf{diag}(X)\) |
|
column vector that contains the diagonal of \(X\) |
\(\mathsf{Diag}(x)\) |
|
diagonal matrix with \(x\) as diagonal, \(x \in \mathbb{R}^{1 \times n}\) or \(x \in \mathbb{R}^{m \times 1}\) |
\(1_m^\top \cdot X \cdot 1_n\) |
|
sum of the elements of \(X\) |
\(x \cdot 1_n^\top\) |
|
\(n\) copies of column vector \(x \in \mathbb{R}^{m \times 1}\) |
\(1_m \cdot x\) |
|
\(m\) copies of row vector \(x \in \mathbb{R}^{1 \times n}\) |
\(1_m^\top \cdot X\) |
|
\(1 \times n\) row vector with sums of the columns of \(X\) |
\(X \cdot 1_n\) |
|
\(m \times 1\) column vector with sums of the rows of \(X\) |
\(\max(X)_{col}\) |
|
\(1 \times n\) row vector with maximum values of the columns of \(X\) |
\(\max(X)_{row}\) |
|
\(m \times 1\) column vector with maximum values of the rows of \(X\) |
\((1_m^\top \cdot X) / n\) |
|
\(1 \times n\) row vector with mean values of the columns of \(X\) |
\((X \cdot 1_n) / m\) |
|
\(m \times 1\) column vector with mean values of the rows of \(X\) |
\(f(X)\) |
|
element-wise application of \(f: \mathbb{R} \rightarrow \mathbb{R}\) to \(X\) |
\(e^X\) |
|
element-wise application of \(f: x \rightarrow e^x\) to \(X\) |
\(\log(X)\) |
|
element-wise application of the natural logarithm \(f: x \rightarrow \ln(x)\) to \(X\) |
\(1 / X\) |
|
element-wise application of \(f: x \rightarrow 1/x\) to \(X\) |
\(\sqrt{X}\) |
|
element-wise application of \(f: x \rightarrow \sqrt{x}\) to \(X\) |
\(X^{-1/2}\) |
|
element-wise application of \(f: x \rightarrow x^{-1/2}\) to \(X\) |
\(\log(\sigma(X))\) |
|
element-wise application of \(f: x \rightarrow \log(\sigma(x))\) to \(X\) |