CMCDragonkai/roc_curves.md

## roc_curves.md

      
    Raw
  

              roc_curves.md
            
          
    ROC Curves

Receiver Operating Characteristic curves are useful in measuring the effectiveness of a classifier.
To calculate a ROC curve, you first need to prepare dataset and pass it through the classifier.
You will have these 2 vectors:

truth vector with values of 0 (negative-class) and 1 (positive-class).
score vector with values with "smaller" or "negative" values pointing to the negative-class and "larger" or "positive" values pointing to the positive-class

Note that the score vector contains values that have not yet been passed through a decision threshold.
Now we can use sklearn.metrics:
import numpy as np
from sklearn.metrics import roc_curve, auc

truths = np.array([
0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0
])

scores = np.array([
-0.76301132, -0.20224493, 0.11801481, -0.90780855, -0.01116192, -0.6048727 , 0.02283491, -0.61076876, -0.37572754, -0.47017411, -0.42224234, -0.3355867 , -0.22723929, -0.07856729, -0.53383361, 0.12163662, -0.71356947, -0.55111511, 0.37991331, -0.11107635, -0.70713712, -0.02392675, -0.25045747, 0.12675547, -0.68210402, -0.08001795, -0.03259341, -0.04953425, -0.12974835, -0.19299299, -0.3619768 , -0.22818639, -0.06196433, -0.52455061, -0.40026409, -0.35056585, -0.05770139, -1.11907501, 0.19599366, -0.04299172, -0.48108269, 0.1741885 , -0.41416456, -0.01053513, 0.01645355, -0.11932181, -0.70817199, -0.77303401, -0.61489613, -0.96334774, -0.31037723, -0.31952657, -0.35306417, 0.12127427, -0.6643231 , -0.55149778, -0.55695146, -0.41111447, -0.49463336, 0.06910059, -0.23036784, 0.30342285, 0.17642852, -0.1906155 , -0.42910413, -0.67759563, -0.32958811, -0.97119543, 0.02088168, -0.08177305, -0.41466962, -0.30436228, 0.18869727, 0.24966175, -0.39980476
])

(fprs, tprs, thresholds) = roc_curve(truths, scores)

print(truths.shape)      # (75,)
print(scores.shape)      # (75,)
print(fprs.shape)        # (22,)
print(tprs.shape)        # (22,)
print(thresholds.shape)  # (22,)
The fprs is a vector of false positive rates.
A false positive rate (FPR) is the percentage of negative-class elements being predicted as positive-class.
A false positive rate of 0.6 means 60% of negative-class elements are being predicted as positive-class.
You want a small false positive rate, ideally 0.0 (0%) of negative-class elements are predicted as positive-class.
The tprs is a vector of true positive rates.
A true positive rate (TPR) is the percentage of positive-class elements being predicted as positive-class.
A true positive rate of 0.6 means 60% of positive-class elements are being predicted as positive-class.
You want a big true positive rate, ideally 1.0 (100%) of positive-class elements are predicted as positive-class.
However practically speaking, there is a tradeoff between false positive rate and true positive rate.
That as you make false positive rate smaller, you end up making true positive rate smaller.
And as you make true positive rate larger, you end up making false positive rate larger.
Therefore there is an optimal point where true positive rate and false positive rate are both large and small respectively.
These points are determined by the thresholds.
The roc_curve function will give back a vector of thresholds.
The thresholds will contain values from scores that determine points that tradeoff TPR and FPR.
This is done by considering each score as a potential threshold.
Any other score that is greater or equal to that threshold is then calculated as a positive-class prediction.
There are 2 extra parameters of roc_curve that are useful.
(fprs, tprs, thresholds) = roc_curve(truths, scores, pos_label=1, drop_intermediate=True)
The pos_label tells roc_curve which label is considered the positive-class. By default it will be set to 1.
The drop_intermediate tells roc_curve to remove thresholds that don't change the FPR or TPR.
If you set drop_intermediate to False. You will notice that the length of thresholds will be equal to the the number of scores plus 1.
The extra threshold is the first threshold thresholds[0] and the value comes from max(scores) + 1.
This means no scores will be above or equal to thresholds[0], and this is the "nullary" threshold set so that no elements are predicted to be the positive-class.

Just a note if the scores are cosine distances. Scores approaching 0 would represent the positive-class, and scores approaching 1 and 2 would represent the negative-class.
To use these scores, you have to flip them on the number line by multiplying by -1 to get negative numbers.
Then change the pos_label to be 0.
That way the thresholds make sense.
To be able to use the thresholds on the cosine distances, you have to multiply them by -1 again.
You can then the thresholds as "any number below or equal to the threshold" will be considered positive.
The TPR and FPR calculated would then also make sense.