Weight Initialization

#programming #ml
When we have a large number of input neurons, weight initialization becomes a challenge. We can no longer stick to weight initialization with mean 0 and standard deviation 1.

With very large number of inputs, if majority of them are on, or, say $> 1$ , then we end up with a huge $z = \sum w_{j} x_{j} + b$ term to any neuron, which is likely to have a very broad gaussian distribution, and the hidden neurons are likely to be saturated since $z ≫ 1$ or $z ≪ - 1$ , making it's derivative 0, and it being generally bad for learning.

Thus the strategy becomes to initialize with mean 0 and standard deviation of the form similar to $1 / \sqrt{n_{i n}}$ where $n_{i n}$ are the number of input neurons. Thus the weighted sum $z$ will be more sharply peaked at 0, and not be spread out.

Bias terms are not this susceptible to saturation, and hence that is left alone.

Xavier/Glorot initialization

Used when activation functions are $\tanh$ or $sigmoid$ , this initialization tries to deal with vanishing gradients and exploding activations, by initialization a normal distro with mean 0 and a standard deviation of $$\sigma = \sqrt{ \frac{2}{n_{in} + n_{out}} }$$
Formally, $$W \sim \mathcal{N}\left( 0, \frac{2}{n_{in} + n_{out} } \right)$$
Equivalently, the weights can be initialized in a uniform distribution of the form $$W \sim \mathcal{U}\left( -\sqrt{ \frac{6}{n_{in} + n_{out}} }\ ,\ \sqrt{ \frac{6}{n_{in} + n_{out}} } \right)$$

He/Kaiming initialization

Used when activations functions are of ReLU or leaky ReLU type, and the weights are initialized as a normal distribution with mean 0 and standard deviation $$ \sigma = \sqrt{ \frac{2}{n}}$$

It has sub-classifications;

Fan-in mode : $n = n_{i n}$
Fan-out mode : $n = n_{o u t}$

Formally, $$W \sim \mathcal{N}\left( 0, \frac{2}{n} \right)$$
Or a uniform distribution of the form $$W \sim \mathcal{U}\left( -\sqrt{ \frac{6}{n}}\ ,\ \sqrt{\frac{6}{n}}\right)$$

Code example

Numpy

weights_he = np.random.normal(0, np.sqrt(2 / n_inputs), (n_inputs, n_inputs))

PyTorch

Follows similar pattern to that as of numpy.

Note

Just because the functions for normal and uniform distribution are provided, doesn't mean they must be used.
Initializing a torch.randn(size) variable (default mean = 0, std dev = 1) also works, just remember to multiply it by required standard deviation.
For uniform, we make a torch.randn multiply by $2 \times bound$ then subtract $bound$ from it

Further reading :
Optimizing Neural Networks

Continue reading
Optimizing Hyperparameters