Weight Initialization

#programming #ml
When we have a large number of input neurons, weight initialization becomes a challenge. We can no longer stick to weight initialization with mean 0 and standard deviation 1.

With very large number of inputs, if majority of them are on, or, say >1, then we end up with a huge z=wjxj+b term to any neuron, which is likely to have a very broad gaussian distribution, and the hidden neurons are likely to be saturated since z1 or z1, making it's derivative 0, and it being generally bad for learning.

Thus the strategy becomes to initialize with mean 0 and standard deviation of the form similar to 1/nin where nin are the number of input neurons. Thus the weighted sum z will be more sharply peaked at 0, and not be spread out.

Bias terms are not this susceptible to saturation, and hence that is left alone.

Xavier/Glorot initialization

Used when activation functions are tanh or sigmoid, this initialization tries to deal with vanishing gradients and exploding activations, by initialization a normal distro with mean 0 and a standard deviation of $$\sigma = \sqrt{ \frac{2}{n_{in} + n_{out}} }$$
Formally, $$W \sim \mathcal{N}\left( 0, \frac{2}{n_{in} + n_{out} } \right)$$
Equivalently, the weights can be initialized in a uniform distribution of the form $$W \sim \mathcal{U}\left( -\sqrt{ \frac{6}{n_{in} + n_{out}} }\ ,\ \sqrt{ \frac{6}{n_{in} + n_{out}} } \right)$$

He/Kaiming initialization

Used when activations functions are of ReLU or leaky ReLU type, and the weights are initialized as a normal distribution with mean 0 and standard deviation $$ \sigma = \sqrt{ \frac{2}{n}}$$

It has sub-classifications;

Formally, $$W \sim \mathcal{N}\left( 0, \frac{2}{n} \right)$$
Or a uniform distribution of the form $$W \sim \mathcal{U}\left( -\sqrt{ \frac{6}{n}}\ ,\ \sqrt{\frac{6}{n}}\right)$$

Code example

Numpy

weights_he = np.random.normal(0, np.sqrt(2 / n_inputs), (n_inputs, n_inputs))

PyTorch

Follows similar pattern to that as of numpy.

Note

Just because the functions for normal and uniform distribution are provided, doesn't mean they must be used.
Initializing a torch.randn(size) variable (default mean = 0, std dev = 1) also works, just remember to multiply it by required standard deviation.
For uniform, we make a torch.randn multiply by 2×bound then subtract bound from it


Further reading :
Optimizing Neural Networks

Continue reading
Optimizing Hyperparameters

Powered by Forestry.md