Loss functions is a method of evaluating how well specific algorithm models the given data. If predictions deviates too much from actual results, loss function will compute a large positive number. Gradually, with the help of some optimization function, the model will make better predictions and reduce overall loss.
The cost function is the average of the losses. You first calculate the loss, one for each data point, based on your prediction and your ground truth label. Then, you average these losses which corresponds to your cost.
However, due to squaring, predictions which are far away from actual values are penalized heavily in comparison to less deviated predictions. Plus MSE has nice mathematical properties which makes it easier to calculate gradients. L2 loss is sensitive to outliers, but gives a more stable and closed form solution (by setting its derivative to 0.)
Like MSE, this as well measures the magnitude of error without considering their direction. Unlike MSE, MAE needs more complicated tools such as linear programming to compute the gradients. Plus MAE is more robust to outliers since it does not make use of square.
Huber loss is less sensitive to outliers in data than the squared error loss. It’s also differentiable at 0. It’s basically absolute error, which becomes quadratic when error is small.
log(cosh(x))
is approximately equal to (x ** 2) / 2
for small x
and to abs(x) - log(2)
for large x
.
This means that 'logcosh' works mostly like the mean squared error, but will not be so strongly affected by the occasional wildly incorrect prediction.
It has all the advantages of Huber loss, and it’s twice differentiable everywhere, unlike Huber loss.
But Log-cosh loss isn’t perfect.
It still suffers from the problem of gradient and hessian for very large off-target predictions being constant, therefore resulting in the absence of splits for XGBoost.
It could determine if the model has positive bias or negative bias.
Cross-entropy loss increases as the predicted probability diverges from the actual label. An important aspect of this is that cross entropy loss penalizes heavily the predictions that are confident but wrong.
In simple terms, the score of correct category should be greater than sum of scores of all incorrect categories by some safety margin (usually one). And hence hinge loss is used for maximum-margin classification, most notably for support vector machines. Although not differentiable, it’s a convex function which makes it easy to work with usual convex optimizers used in machine learning domain.
Measure of how one probability distribution is different from a second, reference probability distribution.
Embedding loss function
A more efficient loss function for Siamese NN
So as long as the negative value is further than the positive value + alpha there will be no gain for the algorithm to condense the positive and the anchor.
Also can be used for Siamese NN
where m > 0 is a margin. The margin defines a radius around GW(X)
Minimax refers to an optimization strategy in two-player turn-based games for minimizing the loss or cost for the worst case of the other player.
discriminator: maximize log D(x) + log(1 – D(G(z)))
generator: minimize log(1 – D(G(z)))
In other words, D and G play the following two-player minimax game with value function V (G, D):
In practice, this loss function for the generator saturates. This means that if it cannot learn as quickly as the discriminator, the discriminator wins, the game ends, and the model cannot be trained effectively.