Regularization

#programming #ml
This notes elaborates on a few commonly used regularization techniques used in Deep Learning.
It answers a general question of "How do reduce overfitting?"

Regularization : A technique used to reduce overfitting in large models.

Weight decay or L2 regularization

We edit the cost function to include a function of weights, trying to capture the magnitude of the weights, in the manner $$C = C_{0} + \frac{\lambda}{2n} \sum_{i}w^2_{i}$$
$\sum w^{2}$ is the sum of all the weights in the network, and $λ > 0$ is a carefully chosen constant. Note the absence of bias
$n$ is the size of training set, not number of weights.
The function basically is a compromise of having small weights versus making the original cost function smaller, the relative importance being in the hands of $λ$ .
This doesn't greatly complicate the backpropogation equations, since we already know $\frac{\partial C_{0}}{\partial w}$ , the other term is just $\frac{λ}{n} w$
$w \to w (1 - \frac{η λ}{n}) - η \frac{\partial C_{0}}{\partial w}$
The re-scaling of the original w term in the w -= x equation is called the "weight decay".

Why are small weights better?

I mean, why do we need small weights to ensure the model doesn't overfit?
Why small weights prevent overfitting

L1 regularization $$\begin{align}C = C_{0} + \frac{\lambda}{n}|w| \ \ w\to w - \eta\frac{\lambda}{n}\pu{ sgn }(w) - \eta\frac{\partial C_{0}}{\partial w} \end{align}$$

By using the modulus, we remove the dependency of weight during the update statement.

Note

This means that when $w$ is large, L1 reduces it by a much smaller margin than L2, and when $w$ is small, L1 reduces it by a much larger margin.
Thus L1 tends to focus the weights into a small number of high importance connections, while driving other weights to 0.

Dropout

Quite literally the name, we implement by "dropping out" certain neurons in the hidden layer.
We randomly and temporarily delete half the neurons in the hidden layer and train with the remaining as normal, pushing a forward pass, and backpropogating the error, and updating the required parameters.
Then we restore the lost neurons and repeat.
It can be thought as similar to training multiple networks in parallel and then combining their results and parameters into one monster.

Quote from the original paper to use dropout

This technique reduces complex co-adaptations of neurons, since a neuron cannot rely on the presence of particular other neurons. It is, therefore, forced to learn more robust features that are useful in conjunction with many different random subsets of the other neurons.

Dropout has been experimentally successful for reducing overfitting, especially in huge deep learning networks.

Most common way to do this has to be randomly switch off half the neurons in any hidden layer, and let the rest do the work.
When switching on and combining the entire network finally, make sure each outgoing weight is halved, to account for the twice as many neurons.

An obvious technique

More training data

Obviously, results show that more training data leads to better model performance, irrespective of algorithm or architecture used.
This begs the question,

What is the asymptotic notation of how well an algorithm can perform when the size of it's training data-set tends to infinity?