Random Initialization in Neural Networks

Neural networks requires to apply several state-of-the-art techniques such as choice of activation function, or network design to push their limits. Initializing way for weights is another state-of-the-art technique for neural networks .

random-number-generator
That is the problem with randomness

Random initialization did not exist in legacy version of perceptron. Adding hidden layers was not enough to generalize non-linear problems. Let’s monitor how initializing all weight values as zero fails for multi-layer perceptron. It cannot generalize even an xor gate problem even though it have a hidden layer including 4 nodes.


🙋‍♂️ You may consider to enroll my top-rated machine learning course on Udemy

Decision Trees for Machine Learning

def initialize_weights(layer_index, rows, columns):
 weights = np.zeros((rows+1, columns))
weight-init-to-zero-reason
Skipping random weight initialization causes to fail backpropagation

As seen, final weight values are same for same layers. This is the reason of failing.

Pure Initialization

On the other hand, initializing weights randomly enables to back propagate. You can create populate the weights with random samples from a uniform distribution over [0, 1].

def initialize_weights(rows, columns):
 weights = np.random.random((rows+1, columns)) #+1 refers to bias unit

 

weight-init-to-zero-to-one
Initializing weights in scale of 0 and 1

Xaiver Initialization

You can improve converge performance by applying some additional techniques. Initializing weights is based on the layer it connected from. This is called Xavier Initialization. This initialization is good for tanh activation.

def initialize_weights(rows, columns):
 weights = np.random.randn(rows+1, columns) #normal distribution, +1 refers to bias unit
 weights = weights * np.sqrt(1/rows)
 return weights

random-weight-initialization
Xavier initialization

Xavier Initialization for ReLU

Modifying dividend works better for ReLU.

weights = weights * np.sqrt(2/(rows+1)) #+1 refers to bias unit

Normalized Initialization

Same research proposes another initialization technique called normalized initialization based on the size of previous layer and following layer.

weights = weights * np.sqrt(6/((rows+1) + columns)) #+1 refers to bias unit

Weight initialization

You can create weights’ initial values in python as coded below:

num_of_layers = len(hidden_layers) + 2 #plus input layer and output layer
w = [0 for i in range(num_of_layers-1)]

#weights from input layer to first hidden layer
w[0] = initialize_weights(num_of_features, hidden_layers[0])

#weights connecting a hidden layer to another hidden layer
if len(hidden_layers)-1 != 0:
 for i in range(len(hidden_layers) - 1):
  w[i+1] = initialize_weights(hidden_layers[i], hidden_layers[i+1])

#weights from final hidden layer to output layer
w[num_of_layers-2] = initialize_weights(hidden_layers[len(hidden_layers) - 1], num_of_classes)

So, we have focused on why random initialization is important for neural networks. Also, we’ve mentioned some initialization techniques. However, applying one of these initialization approaches are not must. Neural networks can handle any problem if they just initialized randomly. I’ve finally pushed weight initialization logic into GitHub.






Like this blog? Support me on Patreon

Buy me a coffee