Let's get some useful thresholds for models. Generally, these thresholds are going to look a lot worse than they really are -- mostly because of labels we used to train are messy and incomplete. We're targeting at least 70% precision, but we're likely to get that when we ask for 50% precision -- and in some cases, we'll still get it when we target even lower precision.
So! We're going to use ORES "threshold optimization" querying system. We'll need to make a call for each topic in order to get an appropriate threshold:
- Culture.Biography.Biography* [maximum recall @ precision >= 0.5]
{
"!f1": 0.925,
"!precision": 0.996,
"!recall": 0.863,
"accuracy": 0.877,
"f1": 0.662,
"filter_rate": 0.759,
"fpr": 0.137,
"match_rate": 0.241,
"precision": 0.5,
"recall": 0.977,
"threshold": 0.086
}
- Culture.Biography.Women [maximum recall @ precision >= 0.5]
{
"!f1": 0.993,
"!precision": 0.995,
"!recall": 0.99,
"accuracy": 0.985,
"f1": 0.572,
"filter_rate": 0.981,
"fpr": 0.01,
"match_rate": 0.019,
"precision": 0.501,
"recall": 0.668,
"threshold": 0.667
}
- Culture.Media.Entertainment [maximum recall @ precision >= 0.5]
{
"!f1": 0.998,
"!precision": 0.998,
"!recall": 0.998,
"accuracy": 0.996,
"f1": 0.47,
"filter_rate": 0.997,
"fpr": 0.002,
"match_rate": 0.003,
"precision": 0.503,
"recall": 0.442,
"threshold": 0.646
}
- STEM.Mathematics maximum recall @ precision >= 0.3]
{
"!f1": 1.0,
"!precision": 1.0,
"!recall": 0.999,
"accuracy": 0.999,
"f1": 0.401,
"filter_rate": 0.999,
"fpr": 0.001,
"match_rate": 0.001,
"precision": 0.309,
"recall": 0.571,
"threshold": 0.903
}
Here, we can see some diversity. Culture.Biography.Biography* is easy to model and it's very common in the labeled data, so we can get very high precision and very high recall and a strict threshold. STEM.Mathematics is on the other end of the spectrum. There are very few math-related articles at all. I've relaxed the minimum precision to 0.3 in order to get a threshold.
I think you mean precision instead of recall here?
Also, are these threshold values inverted? You say the 0.086 for geography is good and the 0.903 for math is bad. Does that mean that to get articles that have a 30% chance of being about math, I should look for scores higher than 1 - 0.903 = 0.097?