Skip to content

Instantly share code, notes, and snippets.

@wael34218
Last active December 16, 2021 16:21
Show Gist options
  • Save wael34218/f08b153c9c4b12641c5b30cb8dc151d0 to your computer and use it in GitHub Desktop.
Save wael34218/f08b153c9c4b12641c5b30cb8dc151d0 to your computer and use it in GitHub Desktop.

Machine Learning Loss Functions

Loss functions is a method of evaluating how well specific algorithm models the given data. If predictions deviates too much from actual results, loss function will compute a large positive number. Gradually, with the help of some optimization function, the model will make better predictions and reduce overall loss.

The cost function is the average of the losses. You first calculate the loss, one for each data point, based on your prediction and your ground truth label. Then, you average these losses which corresponds to your cost.

Regression Losses

Root Mean Squared Error (L2):

However, due to squaring, predictions which are far away from actual values are penalized heavily in comparison to less deviated predictions. Plus MSE has nice mathematical properties which makes it easier to calculate gradients. L2 loss is sensitive to outliers, but gives a more stable and closed form solution (by setting its derivative to 0.)

RMSE Loss

Mean Absolute Error (L1):

Like MSE, this as well measures the magnitude of error without considering their direction. Unlike MSE, MAE needs more complicated tools such as linear programming to compute the gradients. Plus MAE is more robust to outliers since it does not make use of square.

Mean Absolute Error

Huber Loss:

Huber loss is less sensitive to outliers in data than the squared error loss. It’s also differentiable at 0. It’s basically absolute error, which becomes quadratic when error is small.

Mean Absolute Error

Log-Cosh Loss:

log(cosh(x)) is approximately equal to (x ** 2) / 2 for small x and to abs(x) - log(2) for large x. This means that 'logcosh' works mostly like the mean squared error, but will not be so strongly affected by the occasional wildly incorrect prediction. It has all the advantages of Huber loss, and it’s twice differentiable everywhere, unlike Huber loss. But Log-cosh loss isn’t perfect. It still suffers from the problem of gradient and hessian for very large off-target predictions being constant, therefore resulting in the absence of splits for XGBoost.

Mean Absolute Error

Mean Bias Error:

It could determine if the model has positive bias or negative bias.

Mean Absolute Error

Classification Losses

Cross Entropy Loss:

Cross-entropy loss increases as the predicted probability diverges from the actual label. An important aspect of this is that cross entropy loss penalizes heavily the predictions that are confident but wrong.

Cross Entropy Loss

Hinge Loss / Maximum Margin Classification:

In simple terms, the score of correct category should be greater than sum of scores of all incorrect categories by some safety margin (usually one). And hence hinge loss is used for maximum-margin classification, most notably for support vector machines. Although not differentiable, it’s a convex function which makes it easy to work with usual convex optimizers used in machine learning domain.

Hinge Loss

Kullback Leibler Divergence Loss:

Measure of how one probability distribution is different from a second, reference probability distribution.

KL Loss

Cosine Similarity Loss:

Embedding loss function

KL Loss

Triplet Loss

A more efficient loss function for Siamese NN

KL Loss

So as long as the negative value is further than the positive value + alpha there will be no gain for the algorithm to condense the positive and the anchor.

KL Loss

Contrastive Loss Function

Also can be used for Siamese NN

KL Loss

KL Loss

where m > 0 is a margin. The margin defines a radius around GW(X)

Minimax GAN Loss

Minimax refers to an optimization strategy in two-player turn-based games for minimizing the loss or cost for the worst case of the other player.

discriminator: maximize log D(x) + log(1 – D(G(z)))

generator: minimize log(1 – D(G(z)))

In other words, D and G play the following two-player minimax game with value function V (G, D):

KL Loss

In practice, this loss function for the generator saturates. This means that if it cannot learn as quickly as the discriminator, the discriminator wins, the game ends, and the model cannot be trained effectively.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment