The Nerva-Rowwise Python manual

1. Introduction

The Nerva-Rowwise Python Library is a library for neural networks. It is part of the Nerva library collection https://github.com/wiegerw/nerva, that includes native C++ and Python implementations of neural networks. Originally the library was intended for experimenting with truly sparse neural networks. Nowadays, the library also aims to provide a transparent and accessible implementation of neural networks.

This document describes the implementation of the Nerva-Rowwise Python Library. For initial versions of the library I took inspiration from lecture notes of machine learning courses by Roger Grosse, which I highly recommend. This influence may still be evident in the naming of symbols.

This library features a Python module named nerva, that is built using Python bindings to the Nerva-Rowwise C++ Library. Note that the matrix type used internally in the nerva module is torch.Tensor, to ensure an easy integration with PyTorch.

2. Installation

The Nerva-Rowwise Python Library Python bindings provide access to the high-performance C++ backend of the Nerva neural network library. These bindings are built using pybind11 and compiled with C++17 or higher. Installation is supported via pip, but requires several system-level dependencies to be available.

The installation process uses a setup.py script located in the python directory of the repository. This script requires certain environment variables to be set in order to locate the necessary libraries and headers.

2.1. Requirements

Before installing the Python bindings, ensure the following C++ libraries are available on your system:

Intel oneMKL (for efficient linear algebra routines): oneMKL
FMT (header-only formatting library): https://github.com/fmtlib/fmt
Eigen (header-only C++ linear algebra library): https://eigen.tuxfamily.org/
pybind11 (for binding C++ to Python): https://github.com/pybind/pybind11

These libraries must be available as header files (and, for MKL, also as linkable libraries) at build time.

2.2. Environment Variables

The build system locates dependencies using the following environment variables:

# Required for MKL
export MKL_ROOT=/opt/intel/oneapi/mkl/latest
# or use:
export ONEAPI_ROOT=/opt/intel/oneapi

# Required if not using CMake dependency resolution
export EIGEN_INCLUDE_DIR=/path/to/eigen
export FMT_INCLUDE_DIR=/path/to/fmt/include
export PYBIND11_INCLUDE_DIR=/path/to/pybind11/include

If the ONEAPI_ROOT variable is set, the MKL path will be inferred as $ONEAPI_ROOT/mkl/latest. If you are on Windows, setting ONEAPI_ROOT is required to find the Intel OpenMP runtime (libiomp5).

2.3. Alternative: CMake Dependency Resolution

If you built the C++ library using CMake, you can optionally have FMT, Eigen, and pybind11 downloaded automatically via CMake’s FetchContent mechanism. These will be placed in the _deps subdirectory of the build directory.

To use the CMake-resolved dependencies, set the following environment variable:

export CMAKE_DEPS_DIR=/path/to/cmake-build/_deps

This tells setup.py to look for the headers in:

CMAKE_DEPS_DIR/eigen-src
CMAKE_DEPS_DIR/fmt-src/include
CMAKE_DEPS_DIR/pybind11-src/include

Refer to the CMake install section in the C++ documentation for details.

2.4. Compiler Requirements

A C++17-compatible compiler is required. The installation has been tested with

Linux/macOS: GCC 11+ or Clang 14+
Windows: Visual Studio 2022 or later

The setup script uses OpenMP and links against MKL; make sure these are available on your platform.

2.5. Python Compatibility

The bindings are compatible with Python 3.12 and newer. Make sure your pip corresponds to a supported version of Python.

2.6. Installing the Python Module

To install the Python bindings, navigate to the python subdirectory of the repository and run:

pip install .

This will build the native extension and install the nerva Python module. If any required environment variables are missing or incorrectly set, setup.py will raise an informative error.

3. Command line tools

The following command line tools are available. They can be found in the tools directory.

Tool Description

Tool	Description
`mlp.py`	A tool for training multilayer perceptrons.
`inspect_npz.py`	A tool for inspecting the contents of a file in NumPy NPZ format.

mlp.py

A tool for training multilayer perceptrons.

inspect_npz.py

A tool for inspecting the contents of a file in NumPy NPZ format.

3.1. The tool mlp.py

The tool mlp.py can be used for training multilayer perceptrons. An example invocation of the mlp.py tool is

python ../tools/mlp.py \
    --layers="ReLU;ReLU;Linear" \
    --layer-sizes="3072;1024;1024;10" \
    --layer-weights=XavierNormal \
    --optimizers="Nesterov(0.9)" \
    --loss=SoftmaxCrossEntropy \
    --learning-rate=0.01 \
    --epochs=100 \
    --batch-size=100 \
    --threads=12 \
    --overall-density=0.05 \
    --dataset=$dataset \
    --seed=123

This will train a CIFAR-10 model using an MLP consisting of three sparse layers with activation functions ReLU, ReLU and no activation. A script prepare_data.py is available in the data directory that can be used to download the dataset, flatten it and store it in .npz format. See the section Preparing data for details.

The output may look like this:

Loading dataset from file ../../data/cifar10-flattened.npz
=== Nerva python model ===
MultilayerPerceptron(
  Sparse(output_size=1024, density=0.042382812500000006, activation=ReLU(), optimizer=Nesterov(0.9), weight_initializer=Xavier),
  Sparse(output_size=1024, density=0.06357421875000001, activation=ReLU(), optimizer=Nesterov(0.9), weight_initializer=Xavier),
  Dense(output_size=10, activation=NoActivation(), optimizer=Nesterov(0.9), weight_initializer=Xavier, dropout=0.0)
)
loss = SoftmaxCrossEntropyLoss()
scheduler = 0.01
layer densities: 133325/3145728 (4.238%), 66662/1048576 (6.357%), 10240/10240 (100%)


=== Training Nerva model ===
epoch   0  lr: 0.01000000  loss: 2.30248605  train accuracy: 0.10576000  test accuracy: 0.10570000  time: 0.00000000s
epoch   1  lr: 0.01000000  loss: 2.24581714  train accuracy: 0.17630000  test accuracy: 0.18050000  time: 3.80288282s
epoch   2  lr: 0.01000000  loss: 2.03262505  train accuracy: 0.24958000  test accuracy: 0.24750000  time: 3.53668000s
epoch   3  lr: 0.01000000  loss: 1.92843132  train accuracy: 0.29416000  test accuracy: 0.29450000  time: 4.51532699s
epoch   4  lr: 0.01000000  loss: 1.86818965  train accuracy: 0.32428000  test accuracy: 0.32560000  time: 4.43234620s
epoch   5  lr: 0.01000000  loss: 1.81106027  train accuracy: 0.35034000  test accuracy: 0.34890000  time: 4.40411531s
epoch   6  lr: 0.01000000  loss: 1.75482929  train accuracy: 0.37082000  test accuracy: 0.36880000  time: 4.30115332s
epoch   7  lr: 0.01000000  loss: 1.70783696  train accuracy: 0.38672000  test accuracy: 0.38330000  time: 4.36580419s
epoch   8  lr: 0.01000000  loss: 1.66517540  train accuracy: 0.40234000  test accuracy: 0.40010000  time: 4.72230031s
epoch   9  lr: 0.01000000  loss: 1.62592003  train accuracy: 0.41828000  test accuracy: 0.41820000  time: 4.38559597s
epoch  10  lr: 0.01000000  loss: 1.58918872  train accuracy: 0.43308000  test accuracy: 0.43240000  time: 4.33621435s

3.1.1. Command line options of mlp.py

This section gives an overview of the command line interface of the mlp.py tool.

Parameters lists

Some command line options take a list of items as input, for example a list of layers. These items must be separated by semicolons, e.g. --layers="ReLU;ReLU;Linear".

Named parameters

Some of the items take parameters. For this we use a function call syntax with named parameters, e.g. AllReLU(alpha=0.3). In case that there is only one parameter, the name may be omitted: AllReLU(0.3). If the parameters have default values, they may be omitted. For example, SReLU or SReLU() is equivalent to SReLU(al=0,tl=0,ar=0,tr=1).

General options

-?, -h, --help Display help information.
--debug, -d Show debug output. This prints batches, weight matrices, bias vectors, gradients etc.

Random generator options

--seed <value> A seed value for the random generator.

Layer configuration options

--layers <value> A semicolon separated list of layers. For example, --layers=ReLU;AllReLU(0.3);Linear is used to specify a neural network with three layers with an ReLU, AllReLU and no activation function. The following layers are supported:

Specification Description

Specification	Description
`Linear`	Linear layer without activation
`ReLU`	Linear layer with ReLU activation
`Sigmoid`	Linear layer with sigmoid activation
`Softmax`	Linear layer with softmax activation
`LogSoftmax`	Linear layer with log-softmax activation
`HyperbolicTangent`	Linear layer with hyperbolic tangent activation
`AllReLU(<alpha>)`	Linear layer with All ReLU activation
`SReLU(<al>,<tl>,<ar>,<tr>)`	Linear layer with SReLU activation. The default value for the parameters are `al=0, tl=0, ar=0, tr=1`. For these values `SReLU` coincides with `ReLU`.
`BatchNormalization`	Batch normalization layer

Linear

Linear layer without activation

ReLU

Linear layer with ReLU activation

Sigmoid

Linear layer with sigmoid activation

Softmax

Linear layer with softmax activation

LogSoftmax

Linear layer with log-softmax activation

HyperbolicTangent

Linear layer with hyperbolic tangent activation

AllReLU(<alpha>)

Linear layer with All ReLU activation

SReLU(<al>,<tl>,<ar>,<tr>)

Linear layer with SReLU activation. The default value for the parameters are al=0, tl=0, ar=0, tr=1. For these values SReLU coincides with ReLU.

BatchNormalization

Batch normalization layer

--layer-sizes <value> A semicolon-separated list of the sizes of linear layers of the multilayer perceptron. For example, --layer-sizes=3072;1024;512;10 specifies the sizes of three linear layers. The first one has 3072 inputs and 1024 outputs, the second one 1024 inputs and 512 outputs, and the third one has 512 inputs and 10 outputs.
--densities <value> A comma-separated list of linear layer densities. By default, all linear layers are dense (i.e. have density 1.0). If only one value is specified, it will be used for all linear layers.
--dropouts <value> A comma-separated list of dropout rates of linear layers. By default, all linear layers have no dropout (i.e. dropout rate 0.0).
--overall-density <value> The overall density of the linear layers. This value should be in the interval $[0,1$], and it specifies the fraction of the total number of weights that is non-zero. The overall density is not distributed evenly over the layers. Instead, small layers will be assigned a higher density than large layers.
--layer-weights <value> The generator that is used for initializing the weights of the linear layers. The following weight generators are supported:

Specification Description

Specification	Description
`XavierNormal`	Xavier Glorot weights (normal distribution)
`XavierUniform`	Xavier Glorot weights (uniform distribution)
`HeNormal`	Kaiming He weights (normal distribution)
`HeUniform`	Kaiming He weights (uniform distribution)
`Normal`	Normal distribution
`Uniform`	Uniform distribution
`Zero`	All weights are zero (N.B. This is not recommended for training)

XavierNormal

Xavier Glorot weights (normal distribution)

XavierUniform

Xavier Glorot weights (uniform distribution)

HeNormal

Kaiming He weights (normal distribution)

HeUniform

Kaiming He weights (uniform distribution)

Normal

Normal distribution

Uniform

Uniform distribution

Zero

All weights are zero (N.B. This is not recommended for training)

Training configuration options

--epochs <value> The number of epochs of the training (default: 100).
--batch-size <value> The batch size of the training.
--no-shuffle Do not shuffle the dataset during training.
--no-statistics Do not display intermediate statistics during training.
--optimizers <value> A semicolon-separated list of optimizers used for linear and batch normalization layers. The following optimizers are supported:

Specification Description

Specification	Description
`GradientDescent`	Gradient descent optimization
`Momentum(mu)`	Momentum optimization with momentum parameter `mu`
`Nesterov(mu)`	Nesterov optimization with momentum parameter `mu`

GradientDescent

Gradient descent optimization

Momentum(mu)

Momentum optimization with momentum parameter mu

Nesterov(mu)

Nesterov optimization with momentum parameter mu

--learning-rate <value> A semicolon-separated list of learning rate schedulers of linear and batch normalization layers. If only one learning rate scheduler is specified, it is applied to all layers. The following learning rate schedulers are supported:

Specification Description

Specification	Description
`Constant(lr)`	Constant learning rate `lr`
`TimeBased(lr, decay)`	Adaptive learning rate with decay
`StepBased(lr, drop_rate, change_rate)`	Step based learning rate where the learning rate is regularly dropped to a lower value
`MultistepLR(lr, milestones, gamma)`	Step based learning rate, where `milestones` contains the epoch numbers in which the learning rate is dropped.
`Exponential(lr, change_rate)`	Exponentially decreasing learning rate

Constant(lr)

Constant learning rate lr

TimeBased(lr, decay)

Adaptive learning rate with decay

StepBased(lr, drop_rate, change_rate)

Step based learning rate where the learning rate is regularly dropped to a lower value

MultistepLR(lr, milestones, gamma)

Step based learning rate, where milestones contains the epoch numbers in which the learning rate is dropped.

Exponential(lr, change_rate)

Exponentially decreasing learning rate

--loss <value> The loss function used for training the multilayer perceptron. The following loss functions are supported:

Specification Description

Specification	Description
`SquaredError`	Squared error loss.
`CrossEntropy`	Cross entropy loss (N.B. prone to numerical problems!)
`LogisticCrossEntropy`	Logistic cross entropy loss.
`SoftmaxCrossEntropy`	Softmax cross entropy loss. Matches `CrossEntropy` of PyTorch. Suitable for classification experiments.
`NegativeLogLikelihood`	Negative log likelihood loss.

SquaredError

Squared error loss.

CrossEntropy

Cross entropy loss (N.B. prone to numerical problems!)

LogisticCrossEntropy

Logistic cross entropy loss.

SoftmaxCrossEntropy

Softmax cross entropy loss. Matches CrossEntropy of PyTorch. Suitable for classification experiments.

NegativeLogLikelihood

Negative log likelihood loss.

--load-weights <value> Load weights and biases from a dictionary in NumPy .npz format. The weight matrices should be stored with keys W1,W2,… and the bias vectors with keys b1,b2,…. See also numpy.lib.format.
--save-weights <value> Save weights and biases to a dictionary in NumPy .npz format. The weight matrices are stored with keys W1,W2,… and the bias vectors with keys b1,b2,…. See also numpy.lib.format.

Dataset options

--load-data <value> Load the dataset from a file in NumPy .npz format. See
--save-data <value> Save the dataset to a file in NumPy .npz format. See
--preprocessed <directory> A directory containing datasets named epoch0.npz, epoch1.npz, … See I/O for information about the .npz format. This can for example be used to precompute augmented datasets. A script generate_cifar10_augmented_datasets.py is available for creating augmented CIFAR-10 datasets.

Pruning and growing options

--prune <strategy> The strategy used for pruning sparse weight matrices. The following strategies are supported:

Specification Description

Specification	Description
`Magnitude(<drop_fraction>)`	Magnitude based pruning. A fraction of the weights with the smallest absolute value is pruned.
`SET(<drop_fraction>)`	SET pruning. Positive and negative weights are treated separately. Both a fraction of the positive and a fraction of the negative weights is pruned.
`Threshold(<threshold>)`	Weights with absolute value below the given threshold are pruned.

Magnitude(<drop_fraction>)

Magnitude based pruning. A fraction of the weights with the smallest absolute value is pruned.

SET(<drop_fraction>)

SET pruning. Positive and negative weights are treated separately. Both a fraction of the positive and a fraction of the negative weights is pruned.

Threshold(<threshold>)

Weights with absolute value below the given threshold are pruned.

--grow <strategy> The strategy used for growing in sparse weight matrices. The following strategies are supported:

Specification Description

Specification	Description
`Random`	Weights are added at random positions (outside the support of the sparse matrix).

Random

Weights are added at random positions (outside the support of the sparse matrix).

--grow-weights <value> The weight generation function used for growing weights. See --layer-weights for supported values. The default value is Xavier.

Computation options

--computation <value> The computation mode that is used for backpropagation. This is used for performance measurements. The following computation modes are available:

Specification Description

Specification	Description
`eigen`	All computations are done using the Eigen library. Note that by setting the flag `EIGEN_USE_MKL_ALL` Eigen will attempt to use MKL library calls.
`mkl`	Some computations are implemented using MKL functions.
`blas`	Some computations are implemented using BLAS functions.
`sycl`	Some computations are implemented using SYCL functions.

eigen

All computations are done using the Eigen library. Note that by setting the flag EIGEN_USE_MKL_ALL Eigen will attempt to use MKL library calls.

mkl

Some computations are implemented using MKL functions.

blas

Some computations are implemented using BLAS functions.

sycl

Some computations are implemented using SYCL functions.

--clip <value> A threshold value used to set small elements of weight matrices to zero.
--threads <value> The number of threads used by the MKL and OMP libraries.
--gradient-step <value> If this value is set, gradient checks are performed with the given step size. This is very slow, and should only be used for debugging.

Miscellaneous options

--info Print detailed information about the multilayer perceptron.
--timer Print timer messages. The following values are supported:

Value Description

Value	Description
`disabled`	No timing information is displayed
`brief`	At the end, a report with accumulated timing measurements will be displayed
`full`	In addition, individual timing measurements will be displayed

disabled

No timing information is displayed

brief

At the end, a report with accumulated timing measurements will be displayed

full

In addition, individual timing measurements will be displayed

--precision <value> The precision used for printing matrix elements.
--edgeitems <value> The edgeitems used for printing matrices. This sets the number of border rows and columns that are printed.

3.2. The tool inspect_npz.py

The tool inspect_npz.py can be used to inspect the contents of a dataset stored in .npz format.

An example invocation of the inspect_npz.py tool is

python inspect_npz.py data/cifar10-flattened.npz

The output may look like this:

Xtrain   (50000x3072  )  inf-norm = 1.00000000
[[0.23137255 0.16862745 0.19607843 ... 0.54901961 0.32941176 0.28235294]
 [0.60392157 0.49411765 0.41176471 ... 0.54509804 0.55686275 0.56470588]
 [1.         0.99215686 0.99215686 ... 0.3254902  0.3254902  0.32941176]
 ...
 [0.1372549  0.15686275 0.16470588 ... 0.30196078 0.25882353 0.19607843]
 [0.74117647 0.72941176 0.7254902  ... 0.6627451  0.67058824 0.67058824]
 [0.89803922 0.9254902  0.91764706 ... 0.67843137 0.63529412 0.63137255]]

Ttrain   (50000       )  inf-norm = 9.00000000
[6 9 9 ... 9 1 1]

Xtest    (10000x3072  )  inf-norm = 1.00000000
[[0.61960784 0.62352941 0.64705882 ... 0.48627451 0.50588235 0.43137255]
 [0.92156863 0.90588235 0.90980392 ... 0.69803922 0.74901961 0.78039216]
 [0.61960784 0.61960784 0.54509804 ... 0.03137255 0.01176471 0.02745098]
 ...
 [0.07843137 0.0745098  0.05882353 ... 0.19607843 0.20784314 0.18431373]
 [0.09803922 0.05882353 0.09019608 ... 0.31372549 0.31764706 0.31372549]
 [0.28627451 0.38431373 0.38823529 ... 0.36862745 0.22745098 0.10196078]]

Ttest    (10000       )  inf-norm = 9.00000000
[3 8 8 ... 5 1 7]

With the command line option --shapes-only a summary can be obtained:

Xtrain   (50000x3072  )  inf-norm = 1.00000000
Ttrain   (50000       )  inf-norm = 9.00000000
Xtest    (10000x3072  )  inf-norm = 1.00000000
Ttest    (10000       )  inf-norm = 9.00000000

4. Overview of the code

This section gives an overview of the Python code in the Nerva-Rowwise Python Library, and some explanations about the code.

4.1. Number type

The Nerva-Rowwise Python Library uses 32-bit floats as its number type. The C++ library also supports 64-bit floats.

4.2. Module contents

The most important files in the nerva module are given in the table below.

File Description

File	Description
`multilayer_perceptron.py`	A multilayer perceptron class.
`layers.py`	Neural network layers.
`activation_functions.py`	Activation functions.
`loss_functions.py`	Loss functions.
`weights.py`	Weight initialization functions.
`optimizers.py`	Optimizer functions, for updating neural network parameters using their gradients.
`learning_rate_schedulers.py`	Learning rate schedulers, for updating the learning rate during training.
`training.py`	A stochastic gradient descent algorithm.
`prune.py`	Algorithms for pruning sparse weight matrices. This is used for dynamic sparse training.
`grow.py`	Algorithms for (re-)growing sparse weights. This is used for dynamic sparse training.

multilayer_perceptron.py

A multilayer perceptron class.

layers.py

Neural network layers.

activation_functions.py

Activation functions.

loss_functions.py

Loss functions.

weights.py

Weight initialization functions.

optimizers.py

Optimizer functions, for updating neural network parameters using their gradients.

learning_rate_schedulers.py

Learning rate schedulers, for updating the learning rate during training.

training.py

A stochastic gradient descent algorithm.

prune.py

Algorithms for pruning sparse weight matrices. This is used for dynamic sparse training.

grow.py

Algorithms for (re-)growing sparse weights. This is used for dynamic sparse training.

4.3. Classes

4.3.1. Class MultilayerPerceptron

A multilayer perceptron (MLP) is modeled using the class MultilayerPerceptron. It contains a list of layers, and has member functions feedforward, backpropagate and optimize that can be used for training the neural network. Constructing an MLP can be done as follows:

def construct_mlp1(sizes: List[int], batch_size: int):

    layer1 = Dense(input_size=sizes[0],
                   output_size=sizes[1],
                   activation=ReLU(),
                   optimizer=GradientDescent(),
                   weight_initializer=XavierNormal())

    layer2 = Dense(input_size=sizes[1],
                   output_size=sizes[2],
                   activation=ReLU(),
                   optimizer=GradientDescent(),
                   weight_initializer=XavierNormal())

    layer3 = Dense(input_size=sizes[2],
                   output_size=sizes[3],
                   activation=NoActivation(),
                   optimizer=GradientDescent(),
                   weight_initializer=XavierNormal())

    M = MultilayerPerceptron()
    M.layers = [layer1, layer2, layer3]
    M.compile(batch_size)  # Initialize the C++ data structures

    return M

This creates an MLP with three linear layers. The parameter sizes contains the input and output sizes of the three layers. The weights are initialized using Xavier.

Another way to construct MLPs is provided by the function make_layers, that offers a string based interface. An example is given in the code below:

def construct_mlp2(linear_layer_sizes: List[int], batch_size: int):

    layer_specifications = ["ReLU", "ReLU", "Linear"]
    linear_layer_densities = [1.0, 1.0, 1.0]
    linear_layer_dropouts = [0.0, 0.0, 0.0]
    linear_layer_weights = ["XavierNormal", "XavierNormal", "XavierNormal"]
    layer_optimizers = ["GradientDescent", "GradientDescent", "GradientDescent"]
    layers = make_layers(layer_specifications,
                         linear_layer_sizes,
                         linear_layer_densities,
                         linear_layer_dropouts,
                         linear_layer_weights,
                         layer_optimizers)
    M = MultilayerPerceptron()
    M.layers = layers
    M.compile(batch_size)  # Initialize the C++ data structures

    return M

Note that optimizers should be specified for linear layers, but also for batch normalization layers.

A MultilayerPerceptron needs to be compiled before it can be used. This is done by calling M.compile(batch_size). As a result of this call, a C++ object is created that contains the actual model. A reference to this object is stored in the attribute _model.

4.3.2. Class Layer

The class Layer is the base class of all neural network layers. There are three different types of layers:

Layer Description

Layer	Description
`Dense`	A dense linear layer.
`Sparse`	A sparse linear layer.
`BatchNormalization`	A batch normalization layer.

Dense

A dense linear layer.

Sparse

A sparse linear layer.

BatchNormalization

A batch normalization layer.

A Dense layer has a constructor with the following parameters:

    def __init__(self,
                 input_size: int,
                 output_size: int,
                 activation: Activation=NoActivation(),
                 optimizer: Optimizer=GradientDescent(),
                 weight_initializer: WeightInitializer=XavierNormal(),
                 dropout_rate: float=0
                ):

This only sets a number of attributes of the layer. Before using the layer the compile function must be called:

    def compile(self, batch_size: int):
        """
        Creates a C++ object for the layer.

        :param batch_size: the batch size
        :return:
        """
        activation = print_activation(self.activation)
        if self.dropout_rate == 0.0:
            layer = nervalibrowwise.make_dense_linear_layer(self.input_size, self.output_size, batch_size, activation, str(self.weight_initializer), str(self.optimizer))
        else:
            layer = nervalibrowwise.make_dense_linear_dropout_layer(self.input_size, self.output_size, batch_size, self.dropout_rate, activation, str(self.weight_initializer), str(self.optimizer))
        self._layer = layer
        return layer

As a result of this call a C++ object is created that contains the actual layer. It is stored in the attribute _layer. The normal workflow is to call the compile method of the multilayer perceptron, which will also compile the layers, as illustrated in [construct_mlp1] and [construct_mlp2].

A Sparse layer has an additional parameter density in the interval $[0,1]$, that determines the fraction of weights that are in the support. Sparse layers do not support dropout.

A BatchNormalization layer has the following constructor:

    def __init__(self,
                 input_size: int,
                 output_size: Optional[int] = None,
                 optimizer: Optimizer = GradientDescent()
                ):
        self.input_size = input_size
        self.output_size = output_size
        self.optimizer = optimizer
        if self.output_size is None:
            self.output_size = self.input_size
        assert self.output_size == self.input_size

The output size may be omitted, since by definition it is the same as the input size.

4.3.3. Class LossFunction

The class LossFunction is the base class of all loss functions. There are five loss functions available:

SquaredErrorLoss
CrossEntropyLoss
LogisticCrossEntropyLoss
NegativeLogLikelihoodLoss
SoftmaxCrossEntropyLoss

See the paper Batch Matrix-form Equations and Implementation of Multilayer Perceptrons for precise definitions of these loss functions.

4.3.4. Activation functions

The class ActivationFunction is the base class of all activation functions. The following activation functions are available:

ReLU
Sigmoid
Softmax
LogSoftmax
TReLU
LeakyReLU
AllReLU
SReLU
HyperbolicTangent

See the paper Batch Matrix-form Equations and Implementation of Multilayer Perceptrons for precise definitions of these activation functions.

4.4. Accessing C++ data structures

To a limited extent, the C++ data structures can be accessed in Python. In the file loss_test.py it is demonstrated how to modify the weight matrices and bias vectors of dense layers via the _layer attribute:

        M.layers[0]._layer.W = W1
        M.layers[0]._layer.b = b1
        M.layers[1]._layer.W = W2
        M.layers[1]._layer.b = b2
        M.layers[2]._layer.W = W3
        M.layers[2]._layer.b = b3

The weight matrices of sparse layers are not yet fully exposed to Python.

4.5. Training a neural network

The class StochasticGradientDescentAlgorithm can be used to train a neural network. It takes as input a multilayer perceptron, a dataset, a loss function, a learning rate scheduler, and a struct containing options like the number of epochs. The main loop looks like this:

for epoch in range(self.options.epochs):
    self.on_start_epoch(epoch)

    for batch_index, (X, T) in enumerate(self.train_loader):
        self.on_start_batch(batch_index)
        T = to_one_hot(T, num_classes)
        Y = M.feedforward(X)
        DY = self.loss.gradient(Y, T) / options.batch_size
        M.backpropagate(Y, DY)
        M.optimize(learning_rate)
        self.on_end_batch(k)

    self.on_end_epoch(epoch)

self.on_end_training()

We follow the PyTorch convention that the targets used for classification are provided as a one dimensional vector of integers. Using a call to to_one_hot this vector is transformed in to a one hot encoded boolean matrix of the same dimensions as the output Y.

In every epoch, the dataset is divided into a number of batches. This is handled by the DataLoader, that creates batches X of a given batch size, with corresponding targets T (i.e. the expected outputs). Each batch goes through the three steps of stochastic gradient descent:

feedforward: Given an input batch X and the neural network parameters Θ, compute the output Y.
backpropagation: Given output Y corresponding to input X and targets T, compute the gradient DY of Y with respect to the loss function. Then from Y and DY, compute the gradient DΘ of the parameters Θ.
optimization: Given the gradient DΘ, update the parameters Θ.

4.5.1. Event functions

The algorithm uses a number of event functions:

Event Description

Event	Description
`on_start_training`	Is called at the start of the training
`on_end_training`	Is called at the end of the training
`on_start_epoch`	Is called at the start of each epoch
`on_end_epoch`	Is called at the end of each epoch
`on_start_batch`	Is called at the start of each batch
`on_end_batch`	Is called at the end of each batch

on_start_training

Is called at the start of the training

on_end_training

Is called at the end of the training

on_start_epoch

Is called at the start of each epoch

on_end_epoch

Is called at the end of each epoch

on_start_batch

Is called at the start of each batch

on_end_batch

Is called at the end of each batch

The user can respond to these events by deriving from the class StochasticGgradientDescentAlgorithm. Typical use cases for these event functions are the following:

Update the learning rate.
Renew dropout masks.
Prune and grow sparse weights.

Such operations are typically done after each epoch or after a given number of batches.

The following actions take place at the start of every epoch:

A preprocessed dataset is loaded from disk, which is done to avoid the expensive computation of augmented data at every epoch.
The learning rate is updated if a learning rate scheduler is set.
Dropout masks are renewed.
Sparse weight matrices are pruned and regrown if a regrow function is specified.
Small weights in the subnormal range are clipped to zero if the clip option is set.

An example can be found in the tool mlp.py:

    def on_start_epoch(self, epoch):
        if epoch > 0 and self.reload_data_directory:
            self.reload_data(epoch)

        if self.lr_scheduler:
            self.learning_rate = self.lr_scheduler(epoch)

        if epoch > 0:
            self.M.renew_dropout_masks()

        if epoch > 0 and self.regrow:
            self.regrow(self.M)

        if epoch > 0 and self.clip > 0:
            self.M._model.clip(self.clip)

5. I/O

The Nerva-Rowwise Python Library has support for reading and writing datasets and weights + biases of a model in NumPy NPZ format. This format is used for portability between C++ and Python implementations. There is no support yet for storing a complete model, including its architecture.

5.1. NPZ format

The default storage format used in the Nerva libraries is the NumPy NPZ format, see numpy.lib.format. The reason for choosing this format is portability between C++ and Python implementations. A file in .npz format can be used to store a dictionary of arrays in a compressed format.

5.2. Preparing data

The mlp.py tool requires training and testing data to be stored in .npz format. To help with this, a script is provided to download and preprocess datasets commonly used in experiments, including MNIST and CIFAR-10.

The script is located at ../../../python/data/prepare_data.py and can be run from the command line.

5.2.1. MNIST

To download and prepare the MNIST dataset, run:

python prepare_data.py --dataset=mnist --download

This will:

Download mnist.npz from the official source if not already present.
Create a flattened and normalized version of the dataset as mnist-flattened.npz.

The output file contains:

Xtrain, Xtest: flattened and normalized image data
Ttrain, Ttest: corresponding label vectors

5.2.2. CIFAR-10

To download and prepare the CIFAR-10 dataset, run:

python prepare_data.py --dataset=cifar10 --download

This will:

Download the CIFAR-10 binary dataset from https://www.cs.toronto.edu/~kriz/cifar.html
Extract the archive
Flatten and normalize the RGB images into shape [N, 3072]
Save the result as cifar10-flattened.npz

As with MNIST, the .npz file will contain:

Xtrain, Xtest: flattened image arrays with pixel values normalized to [0, 1]
Ttrain, Ttest: integer class labels

5.2.3. Reusing Existing Files

If the required .npz files already exist, the script will detect this and skip reprocessing. You can safely rerun the script without overwriting files.

5.2.4. Help

For help with usage, run:

python data/prepare_data.py --help

This displays all options, including how to customize the output directory.

5.2.5. Inspecting `.npz` files

To inspect the contents of a .npz file (such as mnist-flattened.npz or cifar10-flattened.npz), you can use the inspect_npz.py utility included in the distribution:

python tools/inspect_npz.py data/mnist-flattened.npz

This prints the shape and values of each array stored in the file. To print only the names, shapes, and norms without dumping the full contents, use:

python tools/inspect_npz.py data/mnist-flattened.npz --shapes-only

5.3. Storing datasets and weights

The mlp.py tool has options --load-weights and --save-weights for loading and saving the weights and bias vectors of an MLP, and options --load-data and --save-data for loading and saving a dataset in NPZ format. The keys in the dictionary for the weight matrices and bias vectors of linear layers are W1, W2, … and b1, b2, …. The keys for the training data plus targets are Xtrain and Ttrain, while for the test data plus targets we use Xtest and Ttest.

5.4. Storing datasets and weights

The mlp.py tool supports saving and loading both datasets and model parameters using the NumPy .npz format. This ensures compatibility between Python and C++ implementations by storing everything in a standard dictionary of arrays.

Use --save-data and --load-data to write or read datasets.
Use --save-weights and --load-weights to store or restore the weights and biases of a trained model.

The .npz file for datasets contains the following keys:

Xtrain, Ttrain: input features and target labels for the training set
Xtest, Ttest: input features and target labels for the test set

The .npz file for model parameters stores each layer’s weights and biases under the keys:

W1, W2, …: weight matrices for the first, second, etc. linear layer
b1, b2, …: corresponding bias vectors

These arrays use standard NumPy formats and can be inspected or manipulated easily in Python using numpy.load() and numpy.savez().

Note: The architecture of the model (e.g., number of layers or activation functions) is not stored in the .npz file. This must be specified separately when reloading weights.

6. Performance

This section discusses various aspects that play a role for the performance of a neural network library.

6.1. Mini-batches

In textbooks and tutorials, the training of a neural network is usually explained in terms of individual examples. But in order to achieve high performance, it is absolutely necessary to use mini-batches. On Wikipedia this is explained as follows:

A compromise between computing the true gradient and the gradient at a single sample is to compute the gradient against more than one training sample (called a "mini-batch") at each step. This can perform significantly better than "true" stochastic gradient descent described, because the code can make use of vectorization libraries rather than computing each step separately.

To support mini-batches, the Nerva-Rowwise Python Library defines all equations that play a role in the execution of a neural network in matrix form, including the backpropagation equations, see the paper Batch Matrix-form Equations and Implementation of Multilayer Perceptrons. For the latter, many neural network frameworks rely on Automatic differentiation, see also [1]. We use explicit backpropagation equations to implement truly sparse layers and to provide an instructive resource for those studying neural network execution.

6.2. Matrix products

The performance of training a neural network largely depends on the calculation of matrix products during the backpropagation step of linear layers. In order to do this efficiently, the Intel Math Kernel library (MKL) is used. Currently, this dependency is hard coded, but there are plans to make this optional. To experiment with other implementations, like SYCL or BLAS, a global setting is used that is discussed in the next section.

6.3. Subnormal numbers

Experiments with sparse neural networks have shown that the performance can be negatively influenced by subnormal numbers. The example program subnormal_numbers.cpp illustrates the problem. The table below is the result of the following experiment. The dot product of two large vectors of floating-point numbers is computed. One vector is filled with random values between 0 and 1, and the other with powers of 10, ranging from 1 to 1e−45. For values larger than 1e−35, the time needed for this calculation is about 0.044 seconds. For smaller values we end up in the range of subnormal numbers. This causes the runtime to increase more than eight-fold to 0.37 seconds. In our experiments we observed that when layers with high sparsity are used, it may happen that subnormal values appear in weight matrices, and their amount increases every epoch.

--- multiplication1 ---
time =   0.044372 | value = 1.0e+00    | sum = -5.49552e+03
time =   0.044567 | value = 1.0e-01    | sum = -5.49572e+02
time =   0.044243 | value = 1.0e-02    | sum = -5.49304e+01
time =   0.044434 | value = 1.0e-03    | sum = -5.49612e+00
time =   0.044253 | value = 1.0e-04    | sum = -5.49862e-01
time =   0.044765 | value = 1.0e-05    | sum = -5.49653e-02
time =   0.044698 | value = 1.0e-06    | sum = -5.49624e-03
time =   0.044683 | value = 1.0e-07    | sum = -5.49642e-04
time =   0.044703 | value = 1.0e-08    | sum = -5.49491e-05
time =   0.044821 | value = 1.0e-09    | sum = -5.49454e-06
time =   0.044705 | value = 1.0e-10    | sum = -5.49557e-07
time =   0.044657 | value = 1.0e-11    | sum = -5.49730e-08
time =   0.045235 | value = 1.0e-12    | sum = -5.49563e-09
time =   0.045120 | value = 1.0e-13    | sum = -5.49706e-10
time =   0.045010 | value = 1.0e-14    | sum = -5.49719e-11
time =   0.044988 | value = 1.0e-15    | sum = -5.49464e-12
time =   0.044943 | value = 1.0e-16    | sum = -5.49629e-13
time =   0.044795 | value = 1.0e-17    | sum = -5.49573e-14
time =   0.044147 | value = 1.0e-18    | sum = -5.49449e-15
time =   0.044166 | value = 1.0e-19    | sum = -5.49589e-16
time =   0.044380 | value = 1.0e-20    | sum = -5.49722e-17
time =   0.044036 | value = 1.0e-21    | sum = -5.49430e-18
time =   0.043405 | value = 1.0e-22    | sum = -5.49577e-19
time =   0.043615 | value = 1.0e-23    | sum = -5.49548e-20
time =   0.043544 | value = 1.0e-24    | sum = -5.49570e-21
time =   0.043547 | value = 1.0e-25    | sum = -5.49694e-22
time =   0.043536 | value = 1.0e-26    | sum = -5.49365e-23
time =   0.043560 | value = 1.0e-27    | sum = -5.49488e-24
time =   0.043500 | value = 1.0e-28    | sum = -5.49657e-25
time =   0.043524 | value = 1.0e-29    | sum = -5.49783e-26
time =   0.044128 | value = 1.0e-30    | sum = -5.49559e-27
time =   0.043585 | value = 1.0e-31    | sum = -5.49745e-28
time =   0.043530 | value = 1.0e-32    | sum = -5.49488e-29
time =   0.043609 | value = 1.0e-33    | sum = -5.49569e-30
time =   0.043805 | value = 1.0e-34    | sum = -5.49446e-31
time =   0.046169 | value = 1.0e-35    | sum = -5.49661e-32
time =   0.070594 | value = 1.0e-36    | sum = -5.49664e-33
time =   0.247938 | value = 1.0e-37    | sum = -5.49684e-34
time =   0.368848 | value = 1.0e-38    | sum = -5.49553e-35
time =   0.369819 | value = 1.0e-39    | sum = -5.49426e-36
time =   0.368434 | value = 1.0e-40    | sum = -5.49607e-37
time =   0.368747 | value = 1.0e-41    | sum = -5.49801e-38
time =   0.369033 | value = 1.0e-42    | sum = -5.50173e-39
time =   0.370241 | value = 9.9e-44    | sum = -5.47762e-40
time =   0.370065 | value = 9.8e-45    | sum = -4.97559e-41
time =   0.370310 | value = 1.4e-45    | sum = -1.44152e-41

On Google Groups this problem is discussed. A possible solution is to instruct the compiler to flush subnormal values to zero. But there doesn’t seem to be a portable way to achieve this. In the Nerva-Rowwise Python Library different solutions have been tried. One of them is to periodically flush weights in the subnormal range to zero using the --clip command line option of the mlp tool. In [2] the problem of subnormal numbers is discussed.

7. Sparse neural networks

Sparse neural network layers are often simulated using binary masks, see [3]. This is caused by the lack of support for sparse tensors in popular neural network frameworks. Note that PyTorch is currently developing sparse tensors. The Nerva-Rowwise Python Library supports truly sparse layers, meaning that the weight matrices of sparse layers are stored in a sparse matrix format. Another example of truly sparse layers is given by [4].

7.1. Sparse matrices

Since we are dealing with a programming context, we say that the support of a sparse matrix refers to the set of positions (or indices) in the matrix that are explicitly stored. Elements inside the support can have a non-zero value. Elements outside the support have the value zero by definition.

Sparse matrices in the Nerva-Rowwise Python Library are stored in CSR format. This matrix representation stores arrays of column and row indices to define the support, plus an array of the corresponding values. CSR matrices are unstructured sparse matrices, meaning they have non-zero elements located at arbitrary positions. Alternatively, there are structured sparse matrices, take for example butterfly matrices [5].

7.2. Sparse evolutionary training

Sparse evolutionary training (SET) is a method for efficiently training sparse neural networks, see e.g. [6]. The idea behind this method is to start the training with a random sparse topology, and to periodically prune and regrow some of the weights.

7.3. Sparse initialization

In SET, the sparsity is not divided evenly over the sparse layers. Instead, small layers are assigned a higher density than larger ones. In [6] formula (3), Erdős–Rényi graph topology is suggested to calculate the densities of the sparse layers given a desired overall density of the sparse layers combined. In the Nerva-Rowwise Python Library this is implemented in the function compute_sparse_layer_densities, see layer_algorithms.h. The original Python implementation can be found here, along with several other sparse initialization strategies. In the tool mlp.py the option --overall-density is used for assigning Erdős–Rényi densities to the sparse layers. See [mlp_output] for an example of this. The overall density of 0.05 is converted into densities [0.042382877, 0.06357384, 1.0] for the individual layers.

7.4. Pruning weights

Pruning weights is about removing parameters from a neural network, see also Pruning (artificial_neural_network). In our context removing parameters is about removing elements from the support of a sparse weight matrix. The effect of this is that the values corresponding to these elements are zeroed.

7.4.1. Threshold pruning

In threshold pruning, all weights $w_{ij}$ with $|w_{ij}| \leq t$ for a given threshold $t$ are pruned from a weight matrix $W$.

7.4.2. Magnitude based pruning

Magnitude based pruning is special case of threshold pruning. In magnitude based pruning, the threshold $t$ is computed such that for a given fraction $\zeta$ of the weights we have $|w_{ij}| \leq t$. To ensure that the desired fraction of weights is removed, our implementation takes into account that there can be multiple weights with $|w_{ij}| = t$.

7.4.3. SET based pruning

In SET based pruning, magnitude pruning is applied to positive weights and negative weights separately. So a fraction $\zeta$ of the positive weights and a fraction $\zeta$ of the negative weights are pruned.

7.5. Growing weights

Growing weights is about adding parameters to a neural network. In our context adding parameters is about adding elements to the support of a sparse weight matrix.

7.5.1. Random growing

In random growing, a given number of elements is chosen randomly from the positions outside the support of a weight matrix. These new elements are then added to the support. Since the new elements need to be initialized, a weight initializer needs to be chosen to generate values for them.

A specific implementation of random growing for matrices in CSR format has been developed, that uses reservoir sampling to determine the new elements that are added to the support.

7.6. Classes for pruning and growing

In the Nerva-Rowwise Python Library, the classes prune_function and grow_function are used to represent generic pruning and growing strategies:

class PruneFunction(object):
    """
    Interface for pruning the weights of a sparse layer
    """
    def __call__(self, layer: Sparse):
        raise NotImplementedError

class GrowFunction(object):
    """
    Interface for growing weights of a sparse layer
    """
    def __call__(self, layer: Sparse, count: int):
        raise NotImplementedError

In the command line tool mlp.py the user can select specific implementations of these prune and grow functions. They are called at the start of each epoch of training via an attribute regrow_function that applies pruning and growing to the sparse layers of an MLP. See also the [on_start_epoch] event.

7.7. Experiments with sparse training

In [7] we report on some of our experiments with sparse neural networks.

An example of a dynamic sparse training experiment is

python ../tools/mlp.py \
    --layers="ReLU;ReLU;Linear" \
    --layer-sizes="3072;1024;1024;10" \
    --layer-weights=XavierNormal \
    --optimizers="Nesterov(0.9)" \
    --loss=SoftmaxCrossEntropy \
    --learning-rate=0.01 \
    --epochs=100 \
    --batch-size=100 \
    --threads=12 \
    --overall-density=0.05 \
    --prune="Magnitude(0.2)" \
    --grow=Random \
    --dataset=$dataset \
    --seed=123

At the start of every epoch 20% of the weights is pruned, and the same number of weights is added back at different locations. The output may look like this:

Loading dataset from file ../../data/cifar10-flattened.npz
=== Nerva python model ===
MultilayerPerceptron(
  Sparse(output_size=1024, density=0.042382812500000006, activation=ReLU(), optimizer=Nesterov(0.9), weight_initializer=Xavier),
  Sparse(output_size=1024, density=0.06357421875000001, activation=ReLU(), optimizer=Nesterov(0.9), weight_initializer=Xavier),
  Dense(output_size=10, activation=NoActivation(), optimizer=Nesterov(0.9), weight_initializer=Xavier, dropout=0.0)
)
loss = SoftmaxCrossEntropyLoss()
scheduler = 0.01
layer densities: 133325/3145728 (4.238%), 66662/1048576 (6.357%), 10240/10240 (100%)


=== Training Nerva model ===
epoch   0  lr: 0.01000000  loss: 2.30248605  train accuracy: 0.10576000  test accuracy: 0.10570000  time: 0.00000000s
epoch   1  lr: 0.01000000  loss: 2.24581714  train accuracy: 0.17630000  test accuracy: 0.18050000  time: 3.86765830s
regrowing 26665/133325 weights
regrowing 13332/66662 weights
epoch   2  lr: 0.01000000  loss: 2.02945376  train accuracy: 0.25362000  test accuracy: 0.25470000  time: 3.72007074s
regrowing 26665/133325 weights
regrowing 13332/66662 weights
epoch   3  lr: 0.01000000  loss: 1.92740701  train accuracy: 0.29394000  test accuracy: 0.29380000  time: 4.29503079s
regrowing 26665/133325 weights
regrowing 13332/66661 weights
epoch   4  lr: 0.01000000  loss: 1.86549039  train accuracy: 0.32460000  test accuracy: 0.32540000  time: 4.57273872s
regrowing 26665/133324 weights
regrowing 13332/66661 weights
epoch   5  lr: 0.01000000  loss: 1.80027306  train accuracy: 0.35640000  test accuracy: 0.35430000  time: 4.47374441s
regrowing 26665/133323 weights
regrowing 13332/66660 weights

8. Extending the library

The Nerva-Rowwise Python Library can be extended in several obvious ways, such as adding new layers, activation functions, loss functions, learning rate schedulers, and pruning or growing functions. However, the implementation of those extensions must be done in C++, as documented in the section nerva-cpp.html#extending of the C++ manual. After adding these components to C++, they can be integrated in the nerva Python module.

8.1. Adding a loss function

As an example, we will explain how the loss function SoftmaxCrossEntropyLoss is added to the nerva Python module.

The first step is to define a C++ class softmax_cross_entropy_loss in the header file loss_functions.h.
The next step is to add the class softmax_cross_entropy_loss to the Python bindings in the file python-bindings.cpp:

  py::class_<softmax_cross_entropy_loss, loss_function, std::shared_ptr<softmax_cross_entropy_loss>>(m, "softmax_cross_entropy_loss")
    .def(py::init<>(), py::return_value_policy::copy)
    ;

The third step is to define a Python class SoftmaxCrossEntropyLoss in the file loss-functions.py:

class SoftmaxCrossEntropyLoss(nervalibrowwise.softmax_cross_entropy_loss):
    def __str__(self):
        return 'SoftmaxCrossEntropyLoss()'

Note that the Python class derives from the C++ class. In the same file, an entry to the function parse_loss_function should be added.

The last step is to reinstall the nerva Python module via pip, see Installing the Python Module.

The Nerva-Rowwise Python manual

Contents

1. Introduction

2. Installation

2.1. Requirements

2.2. Environment Variables

2.3. Alternative: CMake Dependency Resolution

2.4. Compiler Requirements

2.5. Python Compatibility

2.6. Installing the Python Module

3. Command line tools

3.1. The tool mlp.py

3.1.1. Command line options of mlp.py

Parameters lists

Named parameters

General options

Random generator options

Layer configuration options

Training configuration options

Dataset options

Pruning and growing options

Computation options

Miscellaneous options

3.2. The tool inspect_npz.py

4. Overview of the code

4.1. Number type

4.2. Module contents

4.3. Classes

4.3.1. Class MultilayerPerceptron

4.3.2. Class Layer

4.3.3. Class LossFunction

4.3.4. Activation functions

4.4. Accessing C++ data structures

4.5. Training a neural network

4.5.1. Event functions

5. I/O

5.1. NPZ format

5.2. Preparing data

5.2.1. MNIST

5.2.2. CIFAR-10

5.2.3. Reusing Existing Files

5.2.4. Help

5.2.5. Inspecting .npz files

5.3. Storing datasets and weights

5.4. Storing datasets and weights

6. Performance

6.1. Mini-batches

6.2. Matrix products

6.3. Subnormal numbers

7. Sparse neural networks

7.1. Sparse matrices

7.2. Sparse evolutionary training

7.3. Sparse initialization

7.4. Pruning weights

7.4.1. Threshold pruning

7.4.2. Magnitude based pruning

7.4.3. SET based pruning

7.5. Growing weights

7.5.1. Random growing

7.6. Classes for pruning and growing

7.7. Experiments with sparse training

8. Extending the library

8.1. Adding a loss function

5.2.5. Inspecting `.npz` files