Multiclass SVM Loss

This note elaborates on Multiclass SVM loss, otherwise also known as Hinge Loss

Multiclass SVM Loss (Hinge Loss)

A loss function commonly used in the context of "Support Vector Machines".
Since SVMs are used for classification problems, the loss function is only trying to ensure that the correct class maintains the highest score, by atleast a margin of $Δ$ .
Therefore, the loss for a single training example is given by $$L_{i} = \sum_{j \neq y_{i}} \pu{ max}(0, s_{j} - s_{y_{i}} + \Delta)$$
where $s_{j}$ represents prediction vector, and $s_{y_{i}}$ is the score assigned to the correct class.
The total loss for all training examples is just the average. $$L = \frac{1}{N}\sum L_{i}$$

Explanation :

Let $s_{j}$ is the vector of scores assigned, and say the index at $2$ is the correct class, then $s_{j} [2] = s_{y_{i}}$ , and we effectively sum up the distances of the score of each class from the correct class, iff it falls within a margin of $Δ$ .

$Δ$ is an arbitrarily chosen constant, since with a linear classifier, Hinge Loss does not lead to a unique minima in loss space, in fact, different $Δ$ will just end up re-scaling our weight matrix/matrices to accommodate for that, but it's role will not change. For further discussions, consider $Δ = 1$

A problem with this loss is that $W$ found from gradient descent is not unique, and is impervious to transformations. If $L = 0$ , it means that $W$ can be from a vector subspace, and be basically anything.
If a $W$ satisfies $L = 0$ , then so will $α W$ , where $α \geq 1$ , and thus we can see it forms a vector subspace.

To combat this issue, since we don't want very large $W$ that make training slow, therefore we introduce the notion of regularization to ensure that the magnitude of each weight does not grow out of bounds.
Thus the loss function will look like $$L = \frac{1}{N}\sum_{i=1}^N\sum_{j \neq y_{i}}\pu{ max}(0, s_{j} - s_{y_{i}} + 1) + \lambda R(W)$$where $R (W)$ is some function of the weights, refer Regularization.

Initial loss

When the weight matrix is initialized with all weights $\approx 0$ , then we will expect our initial $s$ vectors to be almost equal, therefore, the total loss is $$L \approx \pu{ No. of classes} - 1$$
A good debugging tool to keep in mind when designing SVM loss

Weighted Hinge Loss

In many real-world datasets, some misclassifications are worse than others, so we introduce a cost term $C_{j}$ that encapsulates how bad this misclassification is $$L_i = \sum_{j \neq y_i} \mathbf{C}j \cdot \max(0, s_j - s + \Delta)$$
You introduce a cost weight $C_{j}$ for each class. If class $j$ is particularly "dangerous" to confuse with the correct class, you increase its weight so the loss "screams" louder when that specific margin is violated.