Skip to content

Instantly share code, notes, and snippets.

@CMCDragonkai
Last active July 2, 2020 06:54
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save CMCDragonkai/465b0b18c6f38da53e41939cdca730b1 to your computer and use it in GitHub Desktop.
Save CMCDragonkai/465b0b18c6f38da53e41939cdca730b1 to your computer and use it in GitHub Desktop.
ROC Curves #python

ROC Curves

Receiver Operating Characteristic curves are useful in measuring the effectiveness of a classifier.

To calculate a ROC curve, you first need to prepare dataset and pass it through the classifier.

You will have these 2 vectors:

  1. truth vector with values of 0 (negative-class) and 1 (positive-class).
  2. score vector with values with "smaller" or "negative" values pointing to the negative-class and "larger" or "positive" values pointing to the positive-class

Note that the score vector contains values that have not yet been passed through a decision threshold.

Now we can use sklearn.metrics:

import numpy as np
from sklearn.metrics import roc_curve, auc

truths = np.array([
0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0
])

scores = np.array([
-0.76301132, -0.20224493, 0.11801481, -0.90780855, -0.01116192, -0.6048727 , 0.02283491, -0.61076876, -0.37572754, -0.47017411, -0.42224234, -0.3355867 , -0.22723929, -0.07856729, -0.53383361, 0.12163662, -0.71356947, -0.55111511, 0.37991331, -0.11107635, -0.70713712, -0.02392675, -0.25045747, 0.12675547, -0.68210402, -0.08001795, -0.03259341, -0.04953425, -0.12974835, -0.19299299, -0.3619768 , -0.22818639, -0.06196433, -0.52455061, -0.40026409, -0.35056585, -0.05770139, -1.11907501, 0.19599366, -0.04299172, -0.48108269, 0.1741885 , -0.41416456, -0.01053513, 0.01645355, -0.11932181, -0.70817199, -0.77303401, -0.61489613, -0.96334774, -0.31037723, -0.31952657, -0.35306417, 0.12127427, -0.6643231 , -0.55149778, -0.55695146, -0.41111447, -0.49463336, 0.06910059, -0.23036784, 0.30342285, 0.17642852, -0.1906155 , -0.42910413, -0.67759563, -0.32958811, -0.97119543, 0.02088168, -0.08177305, -0.41466962, -0.30436228, 0.18869727, 0.24966175, -0.39980476
])

(fprs, tprs, thresholds) = roc_curve(truths, scores)

print(truths.shape)      # (75,)
print(scores.shape)      # (75,)
print(fprs.shape)        # (22,)
print(tprs.shape)        # (22,)
print(thresholds.shape)  # (22,)

The fprs is a vector of false positive rates.

A false positive rate (FPR) is the percentage of negative-class elements being predicted as positive-class.

A false positive rate of 0.6 means 60% of negative-class elements are being predicted as positive-class.

You want a small false positive rate, ideally 0.0 (0%) of negative-class elements are predicted as positive-class.

The tprs is a vector of true positive rates.

A true positive rate (TPR) is the percentage of positive-class elements being predicted as positive-class.

A true positive rate of 0.6 means 60% of positive-class elements are being predicted as positive-class.

You want a big true positive rate, ideally 1.0 (100%) of positive-class elements are predicted as positive-class.

However practically speaking, there is a tradeoff between false positive rate and true positive rate.

That as you make false positive rate smaller, you end up making true positive rate smaller.

And as you make true positive rate larger, you end up making false positive rate larger.

Therefore there is an optimal point where true positive rate and false positive rate are both large and small respectively.

These points are determined by the thresholds.

The roc_curve function will give back a vector of thresholds.

The thresholds will contain values from scores that determine points that tradeoff TPR and FPR.

This is done by considering each score as a potential threshold.

Any other score that is greater or equal to that threshold is then calculated as a positive-class prediction.

There are 2 extra parameters of roc_curve that are useful.

(fprs, tprs, thresholds) = roc_curve(truths, scores, pos_label=1, drop_intermediate=True)

The pos_label tells roc_curve which label is considered the positive-class. By default it will be set to 1.

The drop_intermediate tells roc_curve to remove thresholds that don't change the FPR or TPR.

If you set drop_intermediate to False. You will notice that the length of thresholds will be equal to the the number of scores plus 1.

The extra threshold is the first threshold thresholds[0] and the value comes from max(scores) + 1.

This means no scores will be above or equal to thresholds[0], and this is the "nullary" threshold set so that no elements are predicted to be the positive-class.


Just a note if the scores are cosine distances. Scores approaching 0 would represent the positive-class, and scores approaching 1 and 2 would represent the negative-class.

To use these scores, you have to flip them on the number line by multiplying by -1 to get negative numbers.

Then change the pos_label to be 0.

That way the thresholds make sense.

To be able to use the thresholds on the cosine distances, you have to multiply them by -1 again.

You can then the thresholds as "any number below or equal to the threshold" will be considered positive.

The TPR and FPR calculated would then also make sense.

@CMCDragonkai
Copy link
Author

Example using cosine distances:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, auc
import contextlib
from typing import Iterator

# here the "positive" class is 0
truths = np.array([
    1,
    1,
    1,
    1,
    0,
    0,
    0,
    0
])

print('Truths', truths)

# approaching 0s -> positive class
# approaching 1s -> negative class
scores = np.array([
    0.9,
    0.6,
    0.2,
    0.5,
    0.1,
    0.4,
    0.2,
    0.3
])

print('Scores', scores)

scores = -1 * scores

print('Flipped Scores', scores)

@contextlib.contextmanager
def plt_figure(*args, **kwargs) -> Iterator:
    fig = plt.figure(*args, **kwargs)
    yield fig
    plt.close(fig)

(fprs, tprs, thresholds) = roc_curve(truths, scores, pos_label=0)
roc_auc = auc(fprs, tprs)

print('Thresholds', thresholds)

print('Flipped Thresholds', thresholds * -1)

with plt_figure() as fig:

    ax = fig.add_subplot(1, 1, 1)

    ax.plot(fprs, tprs, 'ro-', lw=2, label=f'ROC curve (area = {roc_auc})')


    ax.plot([0, 1], [0, 1], color='navy', linestyle='--')

    for (fpr, tpr, threshold) in zip(fprs[1:-1], tprs[1:-1], thresholds[1:-1]):

        # by multiplying the threshold by -1
        # we get back a threshold for "positive-class" cosine distances
        # meaning any number is that is less or equal to this threshold
        # will get the tradeoff of FPR and TPR
        ax.annotate(
            threshold * -1,
            (fpr, tpr),
            textcoords="offset points",
            xytext=(-15,5),
            ha='center'
        )

    ax.set_xlim([-0.05, 1.05])
    ax.set_ylim([-0.05, 1.05])

    ax.set_xlabel('False Positive Rate')
    ax.set_ylabel('True Positive Rate')
    ax.set_title('ROC')
    ax.legend(loc='lower right')

    plt.show()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment