Receiver Operating Characteristic curves are useful in measuring the effectiveness of a classifier.
To calculate a ROC curve, you first need to prepare dataset and pass it through the classifier.
You will have these 2 vectors:
- truth vector with values of
0
(negative-class) and1
(positive-class). - score vector with values with "smaller" or "negative" values pointing to the negative-class and "larger" or "positive" values pointing to the positive-class
Note that the score vector contains values that have not yet been passed through a decision threshold.
Now we can use sklearn.metrics
:
import numpy as np
from sklearn.metrics import roc_curve, auc
truths = np.array([
0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0
])
scores = np.array([
-0.76301132, -0.20224493, 0.11801481, -0.90780855, -0.01116192, -0.6048727 , 0.02283491, -0.61076876, -0.37572754, -0.47017411, -0.42224234, -0.3355867 , -0.22723929, -0.07856729, -0.53383361, 0.12163662, -0.71356947, -0.55111511, 0.37991331, -0.11107635, -0.70713712, -0.02392675, -0.25045747, 0.12675547, -0.68210402, -0.08001795, -0.03259341, -0.04953425, -0.12974835, -0.19299299, -0.3619768 , -0.22818639, -0.06196433, -0.52455061, -0.40026409, -0.35056585, -0.05770139, -1.11907501, 0.19599366, -0.04299172, -0.48108269, 0.1741885 , -0.41416456, -0.01053513, 0.01645355, -0.11932181, -0.70817199, -0.77303401, -0.61489613, -0.96334774, -0.31037723, -0.31952657, -0.35306417, 0.12127427, -0.6643231 , -0.55149778, -0.55695146, -0.41111447, -0.49463336, 0.06910059, -0.23036784, 0.30342285, 0.17642852, -0.1906155 , -0.42910413, -0.67759563, -0.32958811, -0.97119543, 0.02088168, -0.08177305, -0.41466962, -0.30436228, 0.18869727, 0.24966175, -0.39980476
])
(fprs, tprs, thresholds) = roc_curve(truths, scores)
print(truths.shape) # (75,)
print(scores.shape) # (75,)
print(fprs.shape) # (22,)
print(tprs.shape) # (22,)
print(thresholds.shape) # (22,)
The fprs
is a vector of false positive rates.
A false positive rate (FPR) is the percentage of negative-class elements being predicted as positive-class.
A false positive rate of 0.6 means 60% of negative-class elements are being predicted as positive-class.
You want a small false positive rate, ideally 0.0 (0%) of negative-class elements are predicted as positive-class.
The tprs
is a vector of true positive rates.
A true positive rate (TPR) is the percentage of positive-class elements being predicted as positive-class.
A true positive rate of 0.6 means 60% of positive-class elements are being predicted as positive-class.
You want a big true positive rate, ideally 1.0 (100%) of positive-class elements are predicted as positive-class.
However practically speaking, there is a tradeoff between false positive rate and true positive rate.
That as you make false positive rate smaller, you end up making true positive rate smaller.
And as you make true positive rate larger, you end up making false positive rate larger.
Therefore there is an optimal point where true positive rate and false positive rate are both large and small respectively.
These points are determined by the thresholds
.
The roc_curve
function will give back a vector of thresholds
.
The thresholds
will contain values from scores
that determine points that tradeoff TPR and FPR.
This is done by considering each score as a potential threshold.
Any other score that is greater or equal to that threshold is then calculated as a positive-class prediction.
There are 2 extra parameters of roc_curve
that are useful.
(fprs, tprs, thresholds) = roc_curve(truths, scores, pos_label=1, drop_intermediate=True)
The pos_label
tells roc_curve
which label is considered the positive-class. By default it will be set to 1
.
The drop_intermediate
tells roc_curve
to remove thresholds that don't change the FPR or TPR.
If you set drop_intermediate
to False
. You will notice that the length of thresholds will be equal to the the number of scores plus 1.
The extra threshold is the first threshold thresholds[0]
and the value comes from max(scores) + 1
.
This means no scores will be above or equal to thresholds[0]
, and this is the "nullary" threshold set so that no elements are predicted to be the positive-class.
Just a note if the scores are cosine distances. Scores approaching 0 would represent the positive-class, and scores approaching 1 and 2 would represent the negative-class.
To use these scores, you have to flip them on the number line by multiplying by -1
to get negative numbers.
Then change the pos_label
to be 0
.
That way the thresholds make sense.
To be able to use the thresholds on the cosine distances, you have to multiply them by -1
again.
You can then the thresholds as "any number below or equal to the threshold" will be considered positive.
The TPR and FPR calculated would then also make sense.
Example using cosine distances: