Weight Initialization
#programming #ml
When we have a large number of input neurons, weight initialization becomes a challenge. We can no longer stick to weight initialization with mean 0 and standard deviation 1.
With very large number of inputs, if majority of them are on, or, say
Thus the strategy becomes to initialize with mean 0 and standard deviation of the form similar to
Bias terms are not this susceptible to saturation, and hence that is left alone.
Xavier/Glorot initialization
Used when activation functions are
Formally, $$W \sim \mathcal{N}\left( 0, \frac{2}{n_{in} + n_{out} } \right)$$
Equivalently, the weights can be initialized in a uniform distribution of the form $$W \sim \mathcal{U}\left( -\sqrt{ \frac{6}{n_{in} + n_{out}} }\ ,\ \sqrt{ \frac{6}{n_{in} + n_{out}} } \right)$$
He/Kaiming initialization
Used when activations functions are of ReLU or leaky ReLU type, and the weights are initialized as a normal distribution with mean 0 and standard deviation $$ \sigma = \sqrt{ \frac{2}{n}}$$
It has sub-classifications;
- Fan-in mode :
- Fan-out mode :
Formally, $$W \sim \mathcal{N}\left( 0, \frac{2}{n} \right)$$
Or a uniform distribution of the form $$W \sim \mathcal{U}\left( -\sqrt{ \frac{6}{n}}\ ,\ \sqrt{\frac{6}{n}}\right)$$
Code example
Numpy
weights_he = np.random.normal(0, np.sqrt(2 / n_inputs), (n_inputs, n_inputs))
PyTorch
Follows similar pattern to that as of numpy.
Just because the functions for normal and uniform distribution are provided, doesn't mean they must be used.
Initializing a torch.randn(size) variable (default mean = 0, std dev = 1) also works, just remember to multiply it by required standard deviation.
For uniform, we make a torch.randn multiply by
Further reading :
Optimizing Neural Networks
Continue reading
Optimizing Hyperparameters