🤖

6 Hidden layer Artificial Neural Network from scratch with numpy only (no for loops for back-propagation)

Hello folks! This notion page is dedicated to showcase how to create a 6 hidden layer artificial neural network to be used in a classification task using vectorisation approach.

Table of contents

1) Dataset description

2) Network Architecture and derivation of forward pass and back-propagation , using matrix algebra and term-by-term differentiation.

Network Architecture , with the matrices for each layer.

Forward pass

Back-propagation derivation

Instead of the usual matrix differentiation which Andrew Ng and several others used, I plan to instead showcase the derivation of the back-propagation formulas using term by term partial differentiation in accordance with the multivariate version of the chain rule.

💡
IMO using the multivariable chain rule is more beginner friendly and albeit more verbosic, is more easily understood by normal undergrad students.

where M is the number of samples and N is the number of outputs. ynmy_{nm}  in this case is the true value in contrast to the prediction which is y^nm\hat{y}_{nm}. In other words, the loss for this ANN is the mean categorical-cross entropy.

Derivative with respect to the predicted output before softmax activation, i.e Z7O,MZ_7^{O,M} of which O denotes the output neuron, and M denotes the index of the input data.

of which m denotes the mth training data.

where :

Derivation of dW and dB, i.e the gradients of the Weights and Biases for the 7th layer

🌠
The derivative w.r.t the bias term cancels out to 1, hence the gradient for the bias from the 7th layer to the output node is simply the sum of the errors for all samples. The o and the m denote the oth and the mth output node and sample data, respectively.

For dW however it’s a bit more convuluted. Bellow I will try to clearly explain the derivation through term by term partial diffrentiation.

which leaves us with dz6 and etc.

💡
Do note that A0=XA_0=X or the input data with dimensions (NXM) as this is the input data for the first layer.

Derivation of dz6 ( and for all dzidz_i with 0≤i<6)

In summary:

3) Implementation of ANN in Python, as well as training on MNIST dataset.

Neural network class written in Python.

import numpy as np
from scipy.special import expit as sigmoid
from scipy.special import softmax as sm
import pandas as pd
import math
from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score
from sklearn.metrics import log_loss
from math import sqrt
from math import log

class NeuralNet:
    def __init__(self, num_features, num_hidden1,num_hidden2,num_hidden3,num_hidden4,num_hidden5,num_hidden6 ,alpha, max_epochs, num_output, _EPSILON):
        super().__init__()
        self.num_features=num_features  # number of input nodes (features)
        self.num_hidden1=num_hidden1  # number of hidden nodes for 1st hidden layer
        self.num_hidden2=num_hidden2  # number of hidden nodes for 2nd hidden layer
        self.num_hidden3=num_hidden3  # number of hidden nodes for 3rd hidden layer
        self.num_hidden4=num_hidden4  # number of hidden nodes for 4th hidden layer
        self.num_hidden5=num_hidden5  # number of hidden nodes for 5th hidden layer
        self.num_hidden6=num_hidden6  # number of hidden nodes for 6th hidden layer
        self.alpha=alpha  # learning rate
        self.max_epochs=max_epochs # maximum number of epochs
        self.num_output=num_output # number of output nodes
        self._EPSILON=_EPSILON
        self.loss = [] #list to store losses per 100 epochs 
        self.trainingaccur=[] # list to store training accuracy per 100 epochs 
        self.devaccur=[]
        self.Weights_Input_to_H1=np.random.randn(self.num_hidden1, self.num_features)*(0.1)
        self.Bias_Input_to_H1=np.zeros([self.num_hidden1,1])
        self.Weights_H1_to_H2=np.random.randn(self.num_hidden2, self.num_hidden1)*(0.1)
        self.Bias_H1_to_H2=np.zeros([self.num_hidden2,1])
        self.Weights_H2_to_H3=np.random.randn(self.num_hidden3, self.num_hidden2)*(0.1)
        self.Bias_H2_to_H3=np.zeros([self.num_hidden3,1])
        self.Weights_H3_to_H4=np.random.randn(self.num_hidden4, self.num_hidden3)*(0.1)
        self.Bias_H3_to_H4=np.zeros([self.num_hidden4,1])
        self.Weights_H4_to_H5=np.random.randn(self.num_hidden5, self.num_hidden4)*(0.1)
        self.Bias_H4_to_H5=np.zeros([self.num_hidden5,1])
        self.Weights_H5_to_H6=np.random.randn(self.num_hidden6, self.num_hidden5)*(0.1)
        self.Bias_H5_to_H6=np.zeros([self.num_hidden6,1])
        self.Weights_H6_to_output=np.random.randn(self.num_output, self.num_hidden6)*(0.1)
        self.Bias_H6_to_output=np.zeros([self.num_output,1])
        self.dWeights_Input_to_H1=np.zeros([self.num_hidden1, self.num_features])
        self.dBias_Input_to_H1=np.zeros([self.num_hidden1,1])
        self.dWeights_H1_to_H2=np.zeros([self.num_hidden2, self.num_hidden1])
        self.dBias_H1_to_H2=np.zeros([self.num_hidden2,1])
        self.dWeights_H2_to_H3=np.zeros([self.num_hidden3, self.num_hidden2])
        self.dBias_H2_to_H3=np.zeros([self.num_hidden3,1])
        self.dWeights_H3_to_H4=np.zeros([self.num_hidden4, self.num_hidden3])
        self.dBias_H3_to_H4=np.zeros([self.num_hidden4,1])
        self.dWeights_H4_to_H5=np.zeros([self.num_hidden5, self.num_hidden4])
        self.dBias_H4_to_H5=np.zeros([self.num_hidden5,1])
        self.dWeights_H5_to_H6=np.zeros([self.num_hidden6, self.num_hidden5])
        self.dBias_H5_to_H6=np.zeros([self.num_hidden6,1])
        self.dWeights_H6_to_output=np.zeros([self.num_output, self.num_hidden6])
        self.dBias_H6_to_output=np.zeros([self.num_output,1])
        
        

        
    
    def relU(self,X):
        return np.maximum(X, 0)
    
    def deriv(self,X):
        return np.where(X<=0,0,1)
        
        


    
    def softmax(self,x):
        e=np.exp(x)
        for i in range(e.shape[1]):
            e[:,i]=e[:,i]/np.sum(e[:,i])
        return e

    
    

    
        
    # TODO: complete implementation for forward pass
    def forward(self, X):
        self.z1=np.dot((self.Weights_Input_to_H1),(X))+self.Bias_Input_to_H1
        self.a1=self.relU(self.z1)
        self.z2=np.dot((self.Weights_H1_to_H2),(self.a1))+self.Bias_H1_to_H2
        self.a2=self.relU(self.z2)
        self.z3=np.dot((self.Weights_H2_to_H3),(self.a2))+self.Bias_H2_to_H3
        self.a3=self.relU(self.z3)
        self.z4=np.dot((self.Weights_H3_to_H4),(self.a3))+self.Bias_H3_to_H4
        self.a4=self.relU(self.z4)
        self.z5=np.dot((self.Weights_H4_to_H5),(self.a4))+self.Bias_H4_to_H5
        self.a5=self.relU(self.z5)
        self.z6=np.dot((self.Weights_H5_to_H6),(self.a5))+self.Bias_H5_to_H6
        self.a6=self.relU(self.z6)
        self.z7=np.dot((self.Weights_H6_to_output),(self.a6))+self.Bias_H6_to_output
        self.a7=self.softmax((self.z7))
        return self.a7
        
        
        
    
    # TODO: complete implementation for backpropagation
    # the following Numpy functions may be useful: np.dot, np.sum, np.tanh, numpy.ndarray.T
    def backprop(self, X, t):
            
        self.dWeights_Input_to_H1=np.zeros([self.num_hidden1, self.num_features])
        self.dBias_Input_to_H1=np.zeros([self.num_hidden1,1])
        self.dWeights_H1_to_H2=np.zeros([self.num_hidden2, self.num_hidden1])
        self.dBias_H1_to_H2=np.zeros([self.num_hidden2,1])
        self.dWeights_H2_to_H3=np.zeros([self.num_hidden3, self.num_hidden2])
        self.dBias_H2_to_H3=np.zeros([self.num_hidden3,1])
        self.dWeights_H3_to_H4=np.zeros([self.num_hidden4, self.num_hidden3])
        self.dBias_H3_to_H4=np.zeros([self.num_hidden4,1])
        self.dWeights_H4_to_H5=np.zeros([self.num_hidden5, self.num_hidden4])
        self.dBias_H4_to_H5=np.zeros([self.num_hidden5,1])
        self.dWeights_H5_to_H6=np.zeros([self.num_hidden6, self.num_hidden5])
        self.dBias_H5_to_H6=np.zeros([self.num_hidden6,1])
        self.dWeights_H6_to_output=np.zeros([self.num_output, self.num_hidden6])
        self.dBias_H6_to_output=np.zeros([self.num_output,1])
        self.dz7=(self.a7.reshape(self.num_output,-1)-t.reshape(self.num_output,-1))/((X.shape[1]))
        self.dBias_H6_to_output=np.sum(self.dz7,axis=1,keepdims=True)
        self.dWeights_H6_to_output=np.dot((self.dz7),self.a6.T)
        self.dz6=(np.dot(self.Weights_H6_to_output.T,self.dz7)) * (self.deriv(self.z6))
        self.dBias_H5_to_H6=np.sum(self.dz6,axis=1,keepdims=True)
        self.dWeights_H5_to_H6=np.dot((self.dz6),(self.a5.T))
        self.dz5=(np.dot(self.Weights_H5_to_H6.T,self.dz6)) * (self.deriv(self.z5))
        self.dBias_H4_to_H5=np.sum(self.dz5,axis=1,keepdims=True)
        self.dWeights_H4_to_H5=np.dot((self.dz5),(self.a4.T))
        self.dz4=(np.dot(self.Weights_H4_to_H5.T,self.dz5)) * (self.deriv(self.z4))
        self.dBias_H3_to_H4=np.sum(self.dz4,axis=1,keepdims=True)
        self.dWeights_H3_to_H4=np.dot((self.dz4),(self.a3.T))
        self.dz3=(np.dot(self.Weights_H3_to_H4.T,self.dz4)) * (self.deriv(self.z3))
        self.dBias_H2_to_H3=np.sum(self.dz3,axis=1,keepdims=True)
        self.dWeights_H2_to_H3=np.dot((self.dz3),(self.a2.T))
        self.dz2=(np.dot(self.Weights_H2_to_H3.T,self.dz3)) * (self.deriv(self.z2))
        self.dBias_H1_to_H2=np.sum(self.dz2,axis=1,keepdims=True)
        self.dWeights_H1_to_H2=np.dot((self.dz2),(self.a1.T))
        self.dz1=(np.dot(self.Weights_H1_to_H2.T,self.dz2)) * (self.deriv(self.z1))
        self.dBias_Input_to_H1=np.sum(self.dz1,axis=1,keepdims=True)
        self.dWeights_Input_to_H1=np.dot((self.dz1),X.T)
        
        
        
        
        
            
                
                
                
              
                        
                
      
        
        
    
    #TODO: complete implementation for fitting data, and change the existing code if needed
    def fit(self, x_train_data, y_train_data,x_dev_data,y_dev_data):
       
        
        
        for step in range(self.max_epochs):
            self.forward(x_train_data)
            self.backprop(x_train_data, y_train_data)
            self.CCloss=log_loss(np.transpose(y_train_data),np.transpose(self.a7),eps=self._EPSILON,normalize=True)
            self.trainingaccuracy=accuracy_score(np.argmax(y_train_data,axis=0),np.argmax(self.forward(x_train_data),axis=0))
            self.devaccuracy=accuracy_score(np.argmax(y_dev_data,axis=0),np.argmax(self.forward(x_dev_data),axis=0))
            self.Bias_H6_to_output=self.Bias_H6_to_output-((self.alpha)*(self.dBias_H6_to_output))
            self.Weights_H6_to_output=self.Weights_H6_to_output-((self.alpha)*(self.dWeights_H6_to_output))
            self.Bias_H5_to_H6=self.Bias_H5_to_H6-((self.alpha)*(self.dBias_H5_to_H6))
            self.Weights_H5_to_H6=self.Weights_H5_to_H6-((self.alpha)*(self.dWeights_H5_to_H6))
            self.Bias_H4_to_H5=self.Bias_H4_to_H5-((self.alpha)*(self.dBias_H4_to_H5))
            self.Weights_H4_to_H5=self.Weights_H4_to_H5-((self.alpha)*(self.dWeights_H4_to_H5))
            self.Bias_H3_to_H4=self.Bias_H3_to_H4-((self.alpha)*(self.dBias_H3_to_H4))
            self.Weights_H3_to_H4=self.Weights_H3_to_H4-((self.alpha)*(self.dWeights_H3_to_H4))
            self.Bias_H2_to_H3=self.Bias_H2_to_H3-((self.alpha)*(self.dBias_H2_to_H3))
            self.Weights_H2_to_H3=self.Weights_H2_to_H3-((self.alpha)*(self.dWeights_H2_to_H3))
            self.Bias_H1_to_H2=self.Bias_H1_to_H2-((self.alpha)*(self.dBias_H1_to_H2))
            self.Weights_H1_to_H2=self.Weights_H1_to_H2-((self.alpha)*(self.dWeights_H1_to_H2))
            self.Bias_Input_to_H1=self.Bias_Input_to_H1-((self.alpha)*(self.dBias_Input_to_H1))
            self.Weights_Input_to_H1=self.Weights_Input_to_H1-((self.alpha)*(self.dWeights_Input_to_H1))


            if step % 100 == 0:
                print(f'step: {step},  loss: {self.CCloss:3.150f}') 
                print(accuracy_score(np.argmax(y_train_data,axis=0),np.argmax(self.forward(x_train_data),axis=0)))
                print(accuracy_score(np.argmax(y_dev_data,axis=0),np.argmax(self.forward(x_dev_data),axis=0)))
                self.loss.append(self.CCloss)
                self.trainingaccur.append(self.trainingaccuracy)
                self.devaccur.append(self.devaccuracy)
                
              
            
            
    def predict(self,X,y=None):
        self.forward(X)
        if(self.num_output>1):
            y_hat=np.argmax(self.a7, axis=0)
            temp=accuracy_score(y_hat,y)
        else:
            y_hat=np.where(self.a7>0.5,1,0)
            temp=accuracy_score(y_hat,y)
        return temp,y_hat

Firstly, we load the dataset through importing the sklearn library.

import sklearn
from sklearn import datasets
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_digits


# import some data to play with
mnist = load_digits()
X=mnist.data
Y=mnist.target
X.shape,Y.shape
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.1)
X_train, X_dev, Y_train, Y_dev = train_test_split(X_train, Y_train, test_size=0.11)

On splitting, the training dataset is further scaled down using the min_max scaler from the sklearn library:

import sklearn as sk
scaler=sk.preprocessing.MinMaxScaler()
for a in range(X_train.shape[0]):
  X_train[a,:]=scaler.fit_transform(X_train[a,:].reshape(-1, 1)).flatten()
Y_train=np.array(pd.get_dummies(np.array(Y_train)))
Y_dev=np.array(pd.get_dummies(np.array(Y_dev)))

The Y_train vector then becomes a matrix with 1 denoting the value each image corresponds to:

The dataset is then further transposed:

X_train=np.transpose(X_train)
X_dev=np.transpose(X_dev)
Y_train=np.transpose(Y_train)
Y_dev=np.transpose(Y_dev)

The class is then initialised:

numHidden1 = 500 # number of hidden nodes
numHidden2 = 500# number of hidden nodes
numHidden3 = 500# number of hidden nodes
numHidden4 = 500# number of hidden nodes
numHidden5 = 500# number of hidden nodes
numHidden6 = 500# number of hidden nodes
num_features = X_train.shape[0]
numOutput = Y_train.shape[0]
max_epoches = 1000000
alpha = 0.01
epsilon=0.00000000001
NN = NeuralNet(num_features, numHidden1,numHidden2,numHidden3,numHidden4,numHidden5,numHidden6, alpha, max_epoches, numOutput,epsilon)

The dataset is then fitted:

NN.fit(X_train,Y_train,X_dev,Y_dev)

Weights and biases are updated using the gradient descent update rule, i.e:

Wupdated=Wold−α∗∂L∂WoldW_{updated} =W_{old}- \alpha * \frac{ \partial L}{ \partial W_{old}} 

Biasupdated=Biasold−α∗∂L∂BiasoldBias _{updated} =Bias_{old}- \alpha * \frac{ \partial L}{ \partial Bias_{old}} 

The resulting test accuracy is then :

accuracy_score((Y_test),np.argmax(NN.forward(X_test.T),axis=0))

4) Investigation on training loss, training and dev accuracy.

Let us first investigate the training loss:

import matplotlib.pyplot as plt
x_loss=range(0,len(NN.loss)*100,100)


line1=plt.plot(x_loss,NN.loss,linestyle='-',label='training loss')  

plt.title('Training loss')
plt.xlabel('Epochs')
plt.ylabel('Training loss')
legend = plt.legend(loc='best', shadow=True)

Let us investigate the Training and Dev accuracies:

x_training_accur=range(0,len(NN.trainingaccur)*100,100)
x_devaccur=range(0,len(NN.devaccur)*100,100)

line1=plt.plot(x_training_accur,NN.trainingaccur,linestyle='-',label='training accuracy') 
line2=plt.plot(x_devaccur,NN.devaccur,linestyle='-',label='dev accuracy')
                              
plt.title('Training and Dev Accuracies')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
legend = plt.legend(loc='best', shadow=True)

5) Conclusion

🌠
We managed to build a 8 layer NN that manages to classify images at a rather high accuracy. It might be better to increase the number of hidden nodes to 1000, or add more layers to make the NN more capable at learning patterns from given training data. The issue with overfitting can be remedied by stopping the training once the dev loss /accuracy is no longer decreasing/increasing.