Regularization

#programming #ml
This notes elaborates on a few commonly used regularization techniques used in Deep Learning.
It answers a general question of "How do reduce overfitting?"

Regularization : A technique used to reduce overfitting in large models.

Weight decay or L2 regularization

Why are small weights better?

I mean, why do we need small weights to ensure the model doesn't overfit?
Why small weights prevent overfitting

L1 regularization $$\begin{align}C = C_{0} + \frac{\lambda}{n}|w| \ \ w\to w - \eta\frac{\lambda}{n}\pu{ sgn }(w) - \eta\frac{\partial C_{0}}{\partial w} \end{align}$$

Note

  • This means that when w is large, L1 reduces it by a much smaller margin than L2, and when w is small, L1 reduces it by a much larger margin.
  • Thus L1 tends to focus the weights into a small number of high importance connections, while driving other weights to 0.

Dropout

Quote from the original paper to use dropout

  • This technique reduces complex co-adaptations of neurons, since a neuron cannot rely on the presence of particular other neurons. It is, therefore, forced to learn more robust features that are useful in conjunction with many different random subsets of the other neurons.

Most common way to do this has to be randomly switch off half the neurons in any hidden layer, and let the rest do the work.
When switching on and combining the entire network finally, make sure each outgoing weight is halved, to account for the twice as many neurons.

An obvious technique

More training data

Obviously, results show that more training data leads to better model performance, irrespective of algorithm or architecture used.
This begs the question,

What is the asymptotic notation of how well an algorithm can perform when the size of it's training data-set tends to infinity?


Continue Reading

Powered by Forestry.md