Backpropagation Implementation: Neural Networks Learning From Theory To Action

We’ve focused on the math behind neural networks learning and proof of the backpropagation algorithm. Let’s face it, mathematical background of the algorihm is complex. Implementation might make the discipline easier to be figured out.

neural-networks-3


🙋‍♂️ You may consider to enroll my top-rated machine learning course on Udemy

Decision Trees for Machine Learning

Now, it’s implementation time. We would transform extracted formulas into the code. I would prefer to impelement the core algorithm in Java. This post would also be a tutorial of the neural network project that I’ve already shared on my GitHub profile. You might play around the code before reading this post.

Non-linear sinus wave is chosen as dataset. The same dataset is used in the time-series post. Thus, we’ll be able to compare the prospective forecasts for both neural network and time series approaches. Basically, a random point in the wave would be predicted based on previous known points.

Three layered network consisting of input, hidden and output layers is modeled as illustrated below. Firstly, input nodes correspond to previous 5 points and additional bias unit. Secondly, hidden layer consisting of 4 nodes and additional bias unit. The decision of hidden node size is based on the following rule. The hidden nodes should be 2/3 sum of the the input layer and output layer size (Heaton, 2000, pp. 159). I strongly recommend you to use this rule for modeling non-deep neural networks. Also, learning rate is picked up as 0.01 whereas epoch (or training time) is assigned as 1M as learning parameters.

nn-model-for-sinus-wave
Neural Network Model for Sinus Wave Forecasting

Creating Network

Neural networks consist of two main elements: nodes and weights. The both elements are defined as classes. Weight classes would store following information: unique index, weight value, node index connected from / to. Node classes would store unique index, boolean bias unit control, netoutput value and error reflection on this node as smallDelta. Network creation is implemented by following block.

//node creation
List<Node> nodes = createNodes(numberOfInputs, hiddenNodes);

//weight creation
List<Weight> weights = createWeights(nodes, numberOfInputs, hiddenNodes);

Input Output Normalization

Sigmoid function will be used as activation function in the network. This function produces outputs in scale of [0, 1] whereas input is meaningful between [-4, +4]. Because, inputs out of this range produces same outputs. That’s why, inputs should be normalize in scale of [-5, +5] whereas actual outputs should be normalized in scale of [0, 1]. In order to normalize a set in any scale of [normalized_min, normalized_max], the following formula can be used.

double normalized_value = (normalized_max - normalized_min)
* ( (value - min_item_in_set) / (max_item_in_set - min_item_in_set) )
+ normalized_min;

Neural network must be fed by normalized values. After all, forecasts should be denormalized when calculation complete. Denormalized values can be calculated by following formula.

double denormalized_value = ( (normalized_value - normalized_min)
/ (normalized_max - normalized_min) )
* (max_item_in_set - min_item_in_set)
+ min_item_in_set;

Weight Initialization

Having initial values for weights are required to implement forward propagation. Importantly, they need to be initialized randomly. Otherwise, backpropagation will not work because all nodes will update the same value. Researchs show that weights should be initialized as a random value between [-epsilon, +epsilon]. Epsilon calculation and initializing should be performed as following logic.

 Random r = new Random();
 double rangeMin = 0, rangeMax = 1;
 double epsilon= (double) Math.sqrt(6) / Math.sqrt(numberOfInputs + 1);
 double rand = rangeMin + (rangeMax - rangeMin) * r.nextDouble();
 randomValue = rand * (2 * epsilon) - epsilon;

Backpropagation

We need to perform forward propagation first to calculate network output (or forecast) and compare with the actual value. Thus, we’ll calculate the error on output. Then, back propagation would be applied to decide how much the calculated error should be reflected to any weight. After then, stockastic gradient descent should be run to update weights. We put these procedures together as mentioned below.





Firstly, we’ll calculate that how much calculated error should be reflected to nodes. This calculation is called as node delta calculation. Then, we’ll reflect this deltas to weights to update.

Node Delta Calculation

nodes = applyForwardPropagation(historicalData.get(i), nodes, weights, bias, dump);

Node outputNode = nodes.get(nodes.size()-1);

double actualValue = historicalData.get(i).getAttributes()
.get(numberOfInputsAttributes).getValue();
double predictValue = outputNode.getValue();

double smallDelta = actualValue - predictValue;

nodes.get(nodes.size()-1).setSmallDelta(smallDelta);

for(int j=nodes.size()-2;j>numberOfInputsAttributes;j--){
//output delta already calculated on the step above. that is why nodes.size()-2

   //look for connections including from nodes.get(j)
   int targetIndex = nodes.get(j).getIndex();
   double sumOfSmallDelta = 0;

   for(int k=0;k<weights.size();k++){

      if(weights.get(k).getFromIndex() == targetIndex){

         double affectingTheta = weights.get(k).getValue();
         double affectingSmallDelta = 1;
         int targetSmallDeltaIndex = weights.get(k).getToIndex();

         for(int m=0;m<nodes.size();m++){
            if(nodes.get(m).getIndex() == targetSmallDeltaIndex){
               affectingSmallDelta = nodes.get(m).getSmallDelta();
               break;
            }
         }

         double newlySmallDelta = affectingTheta * affectingSmallDelta;
         sumOfSmallDelta = sumOfSmallDelta + newlySmallDelta;

      }

   }

   nodes.get(j).setSmallDelta(sumOfSmallDelta);

}

Now, we can use calculated delta values to update weights.

Weight Updates

//apply stockastic gradient descent to update weights
for(int j=0;j<weights.size();j++){

   double weightFromNodeValue = 0, weightToNodeDelta = 0, weightToNodeValue = 0;

   for(int k=0;k<nodes.size();k++){

      if(nodes.get(k).getIndex() == weights.get(j).getFromIndex()){
         weightFromNodeValue = nodes.get(k).getValue();
      }

      if(nodes.get(k).getIndex() == weights.get(j).getToIndex()){
         weightToNodeDelta = nodes.get(k).getSmallDelta();
         weightToNodeValue = nodes.get(k).getValue();
      }

   }

   double derivative = weightToNodeDelta
   * weightToNodeValue
* (1 - weightToNodeValue) * weightFromNodeValue;
   weights.get(j).setValue(weights.get(j).getValue() + learningRate * derivative);

}

Cost Function

Each gradient descent iteration updates weights and cost should be decreased over iterations. However, we’ll apply stockastic gradient descent. This means cost might not decrease in every iteration. Cost must be calculated after each gradient descent iteration. All historical instances costs will be calculated with new weights, and compute the average cost for current iteration as illustrated below. We also dumped every 10Kth iteration’s cost.

double J = 0;

for(int i=0;i<historicalData.size();i++){

   nodes = applyForwardPropagation();

   double predict = nodes.get(nodes.size() - 1).getValue();
   double actual = historicalData.get(i).getAttributes()
      .get(historicalData.get(i).getAttributes().size()-1).getValue();

   double cost = (predict - actual)*(predict - actual);
   cost = cost / 2;

   J = J + cost;

}

J = J / historicalData.size();

Cost value decreases over gradient descent iterations as expected. We would like to minimize the cost. Epoch value is defined as 1M. In other words, gradient descent is terminated in 1Mth iteration. As iteration approaches infinity, then cost approaches zero. Alternatively, limit of gradient descent function goes to infinity, then forecasts approaches excellent.

cost-for-sinus-wave-forecast
Cost Change over Gradient Descent Iterations

Forecasts

Forecasts and actual values are almost the same as illustrated above. This means non-linear time series could be modeled by neural networks. Moreover, beyond the time series problems, most of non-linear operations like logic functions can be implemented by neural networks. Furthermore, it seems neural networks producing much more successful forecasts than exponential smoothing methods.

nn-forecasts-for-sinus-wave
Neural Network Forecasts and Actual Values

You might want to build and run backpropagation algorithm on your local environment. The project including dataset is already shared on my GitHub profile. However, the algorithm includes random weight initialization. That’s why, the algorithm would produce different outputs at every run. Still, cost per iteration should decrease. You could check the stability of the algorithm by monitoring cost values. Cost should progressively converge the zero. Also, you might change the dataset and run the algorithm for different problems. Thus, you’d better monitor how discipline would be adapted for different fields.

I hope this post would contribute to make sense of neural network learning.






Like this blog? Support me on Patreon

Buy me a coffee