Introduction

This document describes the implementation of the Nerva-Rowwise Python Library. This library features a Python module named nerva, that is built using Python bindings to the Nerva-Rowwise C++ Library. Note that the matrix type used internally in the nerva module is torch.Tensor, to ensure an easy integration with PyTorch.

Installation

The Nerva-Rowwise Python Library Python bindings can be installed via pip. The installation is done via a setup.py script. The script has several dependencies, that need to be resolved using environment variables.

Dependencies

The MKL dependency can be resolved by setting the MKL_ROOT environment variable, or by setting the ONEAPI_ROOT environment variable.

To resolve the FMT, Eigen and pybind11 dependencies, the environment variables EIGEN_INCLUDE_DIR, FMT_INCLUDE_DIR and PYBIND11_INCLUDE_DIR can be set.

An alternative solution is to use CMake to resolve these three dependencies, see also the CMake install section in the C++ documentation. The cmake command causes the three libraries to be downloaded automatically in the _deps subdirectory of the CMake build directory. After that it is sufficient to set the environment variable CMAKE_DEPS_DIR.

The nerva Python module can then be installed using

cd python
pip install .

Command line tools

The tool mlp.py can be used to do training experiments with multilayer perceptrons.

The tool mlp.py

An example invocation of the mlp.py tool is

python ../python/tools/mlp.py \
    --layers="ReLU;ReLU;Linear" \
    --layer-sizes="3072;1024;1024;10" \
    --layer-weights=Xavier \
    --optimizers="Nesterov(0.9)" \
    --loss=SoftmaxCrossEntropy \
    --learning-rate=0.01 \
    --epochs=100 \
    --batch-size=100 \
    --threads=12 \
    --overall-density=0.05 \
    --cifar10=../data \
    --seed=123

This will train a CIFAR-10 model using an MLP consisting of three layers with activation functions ReLU, ReLU and no activation. Note that it automatically downloads the CIFAR-10 dataset in the folder ../data if it doesn’t yet exist.

The output may look like this:

=== Nerva python model ===
Sequential(
  Sparse(output_size=1024, density=0.042382812500000006, activation=ReLU(), optimizer=Nesterov(0.9), weight_initializer=Xavier),
  Sparse(output_size=1024, density=0.06357421875000001, activation=ReLU(), optimizer=Nesterov(0.9), weight_initializer=Xavier),
  Dense(output_size=10, activation=NoActivation(), optimizer=Nesterov(0.9), weight_initializer=Xavier, dropout=0.0)
)
loss = SoftmaxCrossEntropyLoss()
scheduler = ConstantScheduler(lr=0.009999999776482582)
layer densities: 133325/3145728 (4.238%), 66662/1048576 (6.357%), 10240/10240 (100%)


=== Training Nerva model ===
epoch   0  lr: 0.01000000  loss: 2.30246344  train accuracy: 0.10724000  test accuracy: 0.11390000  time: 0.00000000s
epoch   1  lr: 0.01000000  loss: 1.89570341  train accuracy: 0.32142000  test accuracy: 0.32030000  time: 4.15395873s
epoch   2  lr: 0.01000000  loss: 1.66956488  train accuracy: 0.40332000  test accuracy: 0.40220000  time: 3.60670412s
epoch   3  lr: 0.01000000  loss: 1.53549386  train accuracy: 0.45616000  test accuracy: 0.44940000  time: 3.24853144s
epoch   4  lr: 0.01000000  loss: 1.43913857  train accuracy: 0.49054000  test accuracy: 0.47920000  time: 3.29059404s
epoch   5  lr: 0.01000000  loss: 1.36875251  train accuracy: 0.51380000  test accuracy: 0.49070000  time: 3.83244992s
epoch   6  lr: 0.01000000  loss: 1.29761993  train accuracy: 0.54106000  test accuracy: 0.50710000  time: 3.59350869s
epoch   7  lr: 0.01000000  loss: 1.23931273  train accuracy: 0.56170000  test accuracy: 0.51690000  time: 3.96624650s

mlp.py command line options

This section gives an overview of the command line interface of the mlp.py tool.

Parameters lists

Some command line options take a list of items as input, for example a list of layers. These items must be separated by semicolons, e.g. --layers="ReLU;ReLU;Linear".

Named parameters

Some of the items take parameters. For this we use a function call syntax with named parameters, e.g. AllReLU(alpha=0.3). In case that there is only one parameter, the name may be omitted: AllReLU(0.3). If the parameters have default values, they may be omitted. For example, TReLU or TReLU() is equivalent to TReLU(al=0,tl=0,ar=0,tr=1).

General options
  • -?, -h, --help Display help information.

  • --debug, -d Show debug output. This prints batches, weight matrices, bias vectors, gradients etc.

Random generator options
  • --seed <value> A seed value for the random generator.

Layer configuration options
  • --layers <value> A semicolon separated list of layers. For example, --layers=ReLU;AllReLU(0.3);Linear is used to specify a neural network with three layers with an ReLU, AllReLU and no activation function. The following layers are supported:

Specification Description

Linear

Linear layer without activation

ReLU

Linear layer with ReLU activation

Sigmoid

Linear layer with sigmoid activation

Softmax

Linear layer with softmax activation

LogSoftmax

Linear layer with log-softmax activation

HyperbolicTangent

Linear layer with hyperbolic tangent activation

AllReLU(<alpha>)

Linear layer with All ReLU activation

SReLU(<al>,<tl>,<ar>,<tr>)

Linear layer with SReLU activation. The default value for the parameters are al=0, tl=0, ar=0, tr=1. For these values SReLU coincides with ReLU.

TReLU(<epsilon>)

Linear layer with trimmed ReLU activation

BatchNormalization

Batch normalization layer

  • --layer-sizes <value> A semicolon-separated list of the sizes of linear layers of the multilayer perceptron. For example, --layer-sizes=3072;1024;512;10 specifies the sizes of three linear layers. The first one has 3072 inputs and 1024 outputs, the second one 1024 inputs and 512 outputs, and the third one has 512 inputs and 10 outputs.

  • --densities <value> A comma-separated list of linear layer densities. By default, all linear layers are dense (i.e. have density 1.0). If only one value is specified, it will be used for all linear layers.

  • --dropouts <value> A comma-separated list of dropout rates of linear layers. By default, all linear layers have no dropout (i.e. dropout rate 0.0).

  • --overall-density <value> The overall density of the linear layers. This value should be in the interval \([0,1\)], and it specifies the fraction of the total number of weights that is non-zero. The overall density is not distributed evenly over the layers. Instead, small layers will be assigned a higher density than large layers.

  • --layer-weights <value> The generator that is used for initializing the weights of the linear layers. The following weight generators are supported:

Specification Description

Xavier

Xavier weights

XavierNormalized

Normalized Xavier weights

He

Kaiming He weights

Uniform

Uniform weights

Zero

All weights are zero (N.B. This usually doesn’t work)

Training configuration options
  • --epochs <value> The number of epochs of the training (default: 100).

  • --batch-size <value> The batch size of the training.

  • --no-shuffle Do not shuffle the dataset during training.

  • --no-statistics Do not display intermediate statistics during training.

  • --optimizers <value> A semicolon-separated list of optimizers used for linear and batch normalization layers. The following optimizers are supported:

Specification Description

GradientDescent

Gradient descent optimization

Momentum(mu)

Momentum optimization with momentum parameter mu

Nesterov(mu)

Nesterov optimization with momentum parameter mu

  • --learning-rate <value> A semicolon-separated list of learning rate schedulers of linear and batch normalization layers. If only one learning rate scheduler is specified, it is applied to all layers. The following learning rate schedulers are supported:

Specification Description

Constant(lr)

Constant learning rate lr

TimeBased(lr, decay)

Adaptive learning rate with decay

StepBased(lr, drop_rate, change_rate)

Step based learning rate where the learning rate is regularly dropped to a lower value

MultistepLR(lr, milestones, gamma)

Step based learning rate, where milestones contains the epoch numbers in which the learning rate is dropped.

Exponential(lr, change_rate)

Exponentially decreasing learning rate

  • --loss <value> The loss function used for training the multilayer perceptron. The following loss functions are supported:

Specification Description

SquaredError

Squared error loss.

CrossEntropy

Cross entropy loss (N.B. prone to numerical problems!)

LogisticCrossEntropy

Logistic cross entropy loss.

SoftmaxCrossEntropy

Softmax cross entropy loss. Matches CrossEntropy of PyTorch. Suitable for classification experiments.

NegativeLogLikelihood

Negative log likelihood loss.

  • --load-weights <value> Load weights and biases from a dictionary in NumPy .npz format. The weight matrices should be stored with keys W1,W2,…​ and the bias vectors with keys b1,b2,…​. See also numpy.lib.format.

  • --save-weights <value> Save weights and biases to a dictionary in NumPy .npz format. The weight matrices are stored with keys W1,W2,…​ and the bias vectors with keys b1,b2,…​. See also numpy.lib.format.

Pruning and growing options
  • --prune <strategy> The strategy used for pruning sparse weight matrices. The following strategies are supported:

Specification Description

Magnitude(<drop_fraction>)

Magnitude based pruning. A fraction of the weights with the smallest absolute value is pruned.

SET(<drop_fraction>)

SET pruning. Positive and negative weights are treated separately. Both a fraction of the positive and a fraction of the negative weights is pruned.

Threshold(<threshold>)

Weights with absolute value below the given threshold are pruned.

  • --grow <strategy> The strategy used for growing in sparse weight matrices. The following strategies are supported:

Specification Description

Random

Weights are added at random positions (outside the support of the sparse matrix).

  • --grow-weights <value> The weight generation function used for growing weights. See --layer-weights for supported values. The default value is Xavier.

Computation options
  • --computation <value> The computation mode that is used for backpropagation. This is used for performance measurements. The following computation modes are available:

Specification Description

eigen

All computations are done using the Eigen library. Note that by setting the flag EIGEN_USE_MKL_ALL Eigen will attempt to use MKL library calls.

mkl

Some computations are implemented using MKL functions.

blas

Some computations are implemented using BLAS functions.

sycl

Some computations are implemented using SYCL functions.

  • --clip <value> A threshold value used to set small elements of weight matrices to zero.

  • --threads <value> The number of threads used by the MKL and OMP libraries.

  • --gradient-step <value> If this value is set, gradient checks are performed with the given step size. This is very slow, and should only be used for debugging.

Dataset options
  • --cifar10 <directory> Specify the directory where the binary version of the CIFAR-10 dataset is stored. This is a directory with subdirectory cifar-10-batches-bin for the C++ version or cifar-10-batches-py for the Python version of the dataset.

  • --mnist <directory> Specify the directory where the MNIST dataset is stored. It should be stored in a file named mnist.npz, that can be downloaded here.

  • --load-data <value> Load the dataset from a file in NumPy .npz format. See

  • --save-data <value> Save the dataset to a file in NumPy .npz format. See

  • --normalize Normalize the dataset.

  • --preprocessed <directory> A directory containing datasets named epoch0.npz, epoch1.npz, …​ See I/O for information about the .npz format. This can for example be used to precompute augmented datasets. A script generate_cifar10_augmented_datasets.py is available for creating augmented CIFAR-10 datasets.

Miscellaneous options
  • --info Print detailed information about the multilayer perceptron.

  • --timer Print timer messages. The following values are supported:

Value Description

disabled

No timing information is displayed

brief

At the end, a report with accumulated timing measurements will be displayed

full

In addition, individual timing measurements will be displayed

  • --precision <value> The precision used for printing matrix elements.

  • --edgeitems <value> The edgeitems used for printing matrices. This sets the number of border rows and columns that are printed.

Overview of the code

This section gives an overview of the Python code in the Nerva-Rowwise Python Library, and some explanations about the code.

Number type

The Nerva-Rowwise Python Library uses 32-bit floats as its number type. The C++ library also supports 64-bit floats.

Module contents

The most important files in the nerva module are given in the table below.

File Description

multilayer_perceptron.py

A multilayer perceptron class.

layers.py

Neural network layers.

activation_functions.py

Activation functions.

loss_functions.py

Loss functions.

weights.py

Weight initialization functions.

optimizers.py

Optimizer functions, for updating neural network parameters using their gradients.

learning_rate_schedulers.py

Learning rate schedulers, for updating the learning rate during training.

training.py

A stochastic gradient descent algorithm.

prune.py

Algorithms for pruning sparse weight matrices. This is used for dynamic sparse training.

grow.py

Algorithms for (re-)growing sparse weights. This is used for dynamic sparse training.

Classes

Class MultilayerPerceptron

A multilayer perceptron (MLP) is modeled using the class MultilayerPerceptron. It contains a list of layers, and has member functions feedforward, backpropagate and optimize that can be used for training the neural network. Constructing an MLP can be done as follows:

def construct_mlp1(sizes: List[int], batch_size: int):

    layer1 = Dense(input_size=sizes[0],
                   output_size=sizes[1],
                   activation=ReLU(),
                   optimizer=GradientDescent(),
                   weight_initializer=Xavier())

    layer2 = Dense(input_size=sizes[1],
                   output_size=sizes[2],
                   activation=ReLU(),
                   optimizer=GradientDescent(),
                   weight_initializer=Xavier())

    layer3 = Dense(input_size=sizes[2],
                   output_size=sizes[3],
                   activation=NoActivation(),
                   optimizer=GradientDescent(),
                   weight_initializer=Xavier())

    M = MultilayerPerceptron()
    M.layers = [layer1, layer2, layer3]
    M.compile(batch_size)  # Initialize the C++ data structures

    return M

This creates an MLP with three linear layers. The parameter sizes contains the input and output sizes of the three layers. The weights are initialized using Xavier.

Another way to construct MLPs is provided by the function make_layers, that offers a string based interface. An example is given in the code below:

def construct_mlp2(linear_layer_sizes: List[int], batch_size: int):

    layer_specifications = ["ReLU", "ReLU", "Linear"]
    linear_layer_densities = [1.0, 1.0, 1.0]
    linear_layer_dropouts = [0.0, 0.0, 0.0]
    linear_layer_weights = ["Xavier", "Xavier", "Xavier"]
    layer_optimizers = ["GradientDescent", "GradientDescent", "GradientDescent"]
    layers = make_layers(layer_specifications,
                         linear_layer_sizes,
                         linear_layer_densities,
                         linear_layer_dropouts,
                         linear_layer_weights,
                         layer_optimizers)
    M = MultilayerPerceptron()
    M.layers = layers
    M.compile(batch_size)  # Initialize the C++ data structures

    return M

Note that optimizers should be specified for linear layers, but also for batch normalization layers.

A MultilayerPerceptron needs to be compiled before it can be used. This is done by calling M.compile(batch_size). As a result of this call, a C++ object is created that contains the actual model. A reference to this object is stored in the attribute _model.

Class Layer

The class Layer is the base class of all neural network layers. There are three different types of layers:

Layer Description

Dense

A dense linear layer.

Sparse

A sparse linear layer.

BatchNormalization

A batch normalization layer.

A Dense layer has a constructor with the following parameters:

Unresolved directive in nerva-python.adoc - include::../python/nerva/layers.py[tag=dense_constructor]

This only sets a number of attributes of the layer. Before using the layer the compile function must be called:

Unresolved directive in nerva-python.adoc - include::../python/nerva/layers.py[tag=dense_compile]

As a result of this call a C++ object is created that contains the actual layer. It is stored in the attribute _layer. The normal workflow is to call the compile method of the multilayer perceptron, which will also compile the layers, as illustrated in [construct_mlp1] and [construct_mlp2].

A Sparse layer has an additional parameter density in the interval \([0,1]\), that determines the fraction of weights that are in the support. Sparse layers do not support dropout.

A BatchNormalization layer has the following constructor:

Unresolved directive in nerva-python.adoc - include::../python/nerva/layers.py[tag=batchnormalization_constructor]

The output size may be omitted, since by definition it is the same as the input size.

Class LossFunction

The class LossFunction is the base class of all loss functions. There are five loss functions available:

  • SquaredErrorLoss

  • CrossEntropyLoss

  • LogisticCrossEntropyLoss

  • NegativeLogLikelihoodLoss

  • SoftmaxCrossEntropyLoss

See the Nerva library specifications document for precise definitions of these loss functions.

Activation functions

The class ActivationFunction is the base class of all activation functions. The following activation functions are available:

  • ReLU

  • Sigmoid

  • Softmax

  • LogSoftmax

  • TReLU

  • LeakyReLU

  • AllReLU

  • SReLU

  • HyperbolicTangent

See the Nerva library specifications document for precise definitions of these activation functions.

Accessing C++ data structures

To a limited extent, the C++ data structures can be accessed in Python. In the file loss_test.py it is demonstrated how to modify the weight matrices and bias vectors of dense layers via the _layer attribute:

        M.layers[0]._layer.W = W1
        M.layers[0]._layer.b = b1
        M.layers[1]._layer.W = W2
        M.layers[1]._layer.b = b2
        M.layers[2]._layer.W = W3
        M.layers[2]._layer.b = b3

The weight matrices of sparse layers are not yet fully exposed to Python.

Training a neural network

The class StochasticGradientDescentAlgorithm can be used to train a neural network. It takes as input a multilayer perceptron, a dataset, a loss function, a learning rate scheduler, and a struct containing options like the number of epochs. The main loop looks like this:

for epoch in range(self.options.epochs):
    self.on_start_epoch(epoch)

    for batch_index, (X, T) in enumerate(self.train_loader):
        self.on_start_batch(batch_index)
        T = to_one_hot(T, num_classes)
        Y = M.feedforward(X)
        DY = self.loss.gradient(Y, T) / options.batch_size
        M.backpropagate(Y, DY)
        M.optimize(learning_rate)
        self.on_end_batch(k)

    self.on_end_epoch(epoch)

self.on_end_training()
We follow the PyTorch convention that the targets used for classification are provided as a one dimensional vector of integers. Using a call to to_one_hot this vector is transformed in to a one hot encoded boolean matrix of the same dimensions as the output Y.

In every epoch, the dataset is divided into a number of batches. This is handled by the DataLoader, that creates batches X of a given batch size, with corresponding targets T (i.e. the expected outputs). Each batch goes through the three steps of stochastic gradient descent:

  1. feedforward: Given an input batch X and the neural network parameters Θ, compute the output Y.

  2. backpropagation: Given output Y corresponding to input X and targets T, compute the gradient DY of Y with respect to the loss function. Then from Y and DY, compute the gradient of the parameters Θ.

  3. optimization: Given the gradient , update the parameters Θ.

Note that the algorithm uses a number of event functions:

Event Description

on_start_training

Is called at the start of the training

on_end_training

Is called at the end of the training

on_start_epoch

Is called at the start of each epoch

on_end_epoch

Is called at the end of each epoch

on_start_batch

Is called at the start of each batch

on_end_batch

Is called at the end of each batch

The user can respond to these events by deriving from the class StochasticGgradientDescentAlgorithm. Typical use cases for these event functions are the following:

  • Update the learning rate.

  • Renew dropout masks.

  • Prune and grow sparse weights.

Such operations are typically done after each epoch or after a given number of batches.

An example can be found in the tool mlp:

    def on_start_epoch(self, epoch):
        if epoch > 0 and self.reload_data_directory:
            self.reload_data(epoch)

        if self.lr_scheduler:
            self.learning_rate = self.lr_scheduler(epoch)

        if epoch > 0:
            self.M.renew_dropout_masks()

        if epoch > 0 and self.regrow:
            self.regrow(self.M)

        if epoch > 0 and self.clip > 0:
            self.M._model.clip(self.clip)

Five actions take place at the start of every epoch:

  • A preprocessed dataset is loaded from disk, which is done to avoid the expensive computation of augmented data at every epoch.

  • The learning rate is updated if a learning rate scheduler is set.

  • Dropout masks are renewed.

  • Sparse weight matrices are pruned and regrown if a regrow function is specified.

  • Small weights in the subnormal range are clipped to zero if the clip option is set.

I/O

The default storage format used in the Nerva libraries is the NumPy NPZ format, see numpy.lib.format. The reason for choosing this format is portability between C++ and Python implementations. A file in .npz format can be used to store a dictionary of arrays.

The mlp.py tool has options --load-weights and --save-weights for loading and saving the weights and bias vectors of an MLP, and options --load-data and --save-data for loading and saving a dataset in NPZ format. The keys in the dictionary for the weight matrices and bias vectors of linear layers are W1, W2, …​ and b1, b2, …​. The keys for the training data plus targets are Xtrain and Ttrain, while for the test data plus targets we use Xtest and Ttest.

Extending the library

The Nerva-Rowwise Python Library can be extended in several obvious ways, such as adding new layers, activation functions, loss functions, learning rate schedulers, and pruning or growing functions. However, the implementation of those extensions must be done in C++, as documented in the section Extending the library of the C++ manual. After adding these components to C++, they can be integrated in the nerva Python module.

Adding a loss function

As an example, we will explain how the loss function SoftmaxCrossEntropyLoss is added to the nerva Python module.

  • The first step is to define a C++ class softmax_cross_entropy_loss in the header file loss_functions.h.

  • The next step is to add the class softmax_cross_entropy_loss to the Python bindings in the file python-bindings.cpp:

  py::class_<softmax_cross_entropy_loss, loss_function, std::shared_ptr<softmax_cross_entropy_loss>>(m, "softmax_cross_entropy_loss")
    .def(py::init<>(), py::return_value_policy::copy)
    ;
  • The third step is to define a Python class SoftmaxCrossEntropyLoss in the file loss-functions.py:

Unresolved directive in nerva-python.adoc - include::../python/nerva/loss_functions.py[tag=softmax_cross_entropy_loss]

Note that the Python class derives from the C++ class. In the same file, an entry to the function parse_loss_function should be added.

  • The last step is to reinstall the nerva Python module via pip, see [pip-install].