Skip to content

Instantly share code, notes, and snippets.

@halfak
Last active February 14, 2020 12:33
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save halfak/630dc3fd811995c2a0260d43da462645 to your computer and use it in GitHub Desktop.
Save halfak/630dc3fd811995c2a0260d43da462645 to your computer and use it in GitHub Desktop.
ORES Thresholds
$ python get_thresholds.py arwiki
------------------------------------------- -------- --------- --------- ------
label pop rate threshold precision recall
Culture.Biography.Biography* 0.123 0.338 0.7 0.975
Culture.Biography.Women 0.015 0.617 0.5 0.661
Culture.Food and drink 0.002 0.792 0.7 0.61
Culture.Internet culture 0.004 0.818 0.7 0.702
Culture.Linguistics 0.007 0.251 0.7 0.739
Culture.Literature 0.016 0.707 0.7 0.636
Culture.Media.Books 0.004 0.583 0.7 0.727
Culture.Media.Entertainment 0.004 0.218 0.15 0.675
Culture.Media.Films 0.011 0.207 0.7 0.847
Culture.Media.Media* 0.059 0.635 0.7 0.768
Culture.Media.Music 0.024 0.268 0.7 0.825
Culture.Media.Radio 0.002 0.311 0.3 0.618
Culture.Media.Software 0.001 0.797 0.3 0.598
Culture.Media.Television 0.009 0.621 0.7 0.539
Culture.Media.Video games 0.003 0.361 0.7 0.893
Culture.Performing arts 0.003 0.375 0.3 0.577
Culture.Philosophy and religion 0.011 0.453 0.5 0.52
Culture.Sports 0.071 0.064 0.7 0.96
Culture.Visual arts.Architecture 0.011 0.454 0.7 0.688
Culture.Visual arts.Comics and Anime 0.002 0.839 0.7 0.692
Culture.Visual arts.Fashion 0.001 0.489 0.3 0.736
Culture.Visual arts.Visual arts* 0.018 0.6 0.7 0.67
Geography.Geographical 0.024 0.409 0.7 0.753
Geography.Regions.Africa.Africa* 0.008 0.905 0.7 0.616
Geography.Regions.Africa.Central Africa 0.0 0.9 < 0.15
Geography.Regions.Africa.Eastern Africa 0.0 0.305 0.3 0.814
Geography.Regions.Africa.Northern Africa 0.001 0.834 0.3 0.592
Geography.Regions.Africa.Southern Africa 0.001 0.603 0.5 0.786
Geography.Regions.Africa.Western Africa 0.001 0.519 0.5 0.782
Geography.Regions.Americas.Central America 0.003 0.707 0.7 0.553
Geography.Regions.Americas.North America 0.064 0.381 0.5 0.765
Geography.Regions.Americas.South America 0.006 0.468 0.7 0.764
Geography.Regions.Asia.Asia* 0.046 0.532 0.7 0.821
Geography.Regions.Asia.Central Asia 0.001 0.924 0.5 0.569
Geography.Regions.Asia.East Asia 0.011 0.507 0.7 0.732
Geography.Regions.Asia.North Asia 0.001 0.457 0.15 0.72
Geography.Regions.Asia.South Asia 0.015 0.092 0.7 0.886
Geography.Regions.Asia.Southeast Asia 0.006 0.257 0.7 0.798
Geography.Regions.Asia.West Asia 0.011 0.651 0.7 0.765
Geography.Regions.Europe.Eastern Europe 0.013 0.542 0.7 0.72
Geography.Regions.Europe.Europe* 0.076 0.639 0.7 0.708
Geography.Regions.Europe.Northern Europe 0.031 0.594 0.7 0.599
Geography.Regions.Europe.Southern Europe 0.013 0.653 0.7 0.669
Geography.Regions.Europe.Western Europe 0.019 0.666 0.7 0.652
Geography.Regions.Oceania 0.015 0.072 0.7 0.918
History and Society.Business and economics 0.01 0.286 0.3 0.674
History and Society.Education 0.007 0.246 0.3 0.575
History and Society.History 0.011 0.462 0.3 0.504
History and Society.Military and warfare 0.014 0.748 0.7 0.541
History and Society.Politics and government 0.028 0.647 0.7 0.519
History and Society.Society 0.013 0.207 0.15 0.669
History and Society.Transportation 0.015 0.198 0.7 0.915
STEM.Biology 0.034 0.109 0.7 0.879
STEM.Chemistry 0.002 0.82 0.5 0.626
STEM.Computing 0.003 0.587 0.3 0.742
STEM.Earth and environment 0.005 0.788 0.7 0.508
STEM.Engineering 0.005 0.799 0.7 0.601
STEM.Libraries & Information 0.001 0.512 0.3 0.658
STEM.Mathematics 0.0 0.94 0.5 0.528
STEM.Medicine & Health 0.006 0.797 0.7 0.596
STEM.Physics 0.001 0.418 0.15 0.742
STEM.STEM* 0.069 0.429 0.7 0.89
STEM.Space 0.006 0.089 0.7 0.948
STEM.Technology 0.005 0.578 0.3 0.677
------------------------------------------- -------- --------- --------- ------
$ python get_thresholds.py cswiki
------------------------------------------- -------- --------- --------- ------
label pop rate threshold precision recall
Culture.Biography.Biography* 0.123 0.191 0.7 0.964
Culture.Biography.Women 0.015 0.498 0.7 0.864
Culture.Food and drink 0.002 0.77 0.7 0.742
Culture.Internet culture 0.004 0.791 0.7 0.76
Culture.Linguistics 0.007 0.25 0.7 0.846
Culture.Literature 0.016 0.645 0.7 0.707
Culture.Media.Books 0.004 0.541 0.7 0.84
Culture.Media.Entertainment 0.004 0.626 0.5 0.506
Culture.Media.Films 0.011 0.258 0.7 0.899
Culture.Media.Media* 0.059 0.55 0.7 0.882
Culture.Media.Music 0.024 0.208 0.7 0.925
Culture.Media.Radio 0.002 0.723 0.7 0.568
Culture.Media.Software 0.001 0.833 0.3 0.589
Culture.Media.Television 0.009 0.291 0.7 0.905
Culture.Media.Video games 0.003 0.238 0.7 0.957
Culture.Performing arts 0.003 0.739 0.7 0.616
Culture.Philosophy and religion 0.011 0.588 0.5 0.566
Culture.Sports 0.071 0.04 0.7 0.965
Culture.Visual arts.Architecture 0.011 0.535 0.7 0.756
Culture.Visual arts.Comics and Anime 0.002 0.338 0.7 0.914
Culture.Visual arts.Fashion 0.001 0.635 0.5 0.76
Culture.Visual arts.Visual arts* 0.018 0.579 0.7 0.757
Geography.Geographical 0.024 0.731 0.7 0.56
Geography.Regions.Africa.Africa* 0.008 0.638 0.7 0.652
Geography.Regions.Africa.Central Africa 0.0 0.9 < 0.15
Geography.Regions.Africa.Eastern Africa 0.0 0.415 0.3 0.728
Geography.Regions.Africa.Northern Africa 0.001 0.416 0.3 0.734
Geography.Regions.Africa.Southern Africa 0.001 0.701 0.7 0.521
Geography.Regions.Africa.Western Africa 0.001 0.116 0.3 0.603
Geography.Regions.Americas.Central America 0.003 0.482 0.7 0.695
Geography.Regions.Americas.North America 0.064 0.451 0.7 0.672
Geography.Regions.Americas.South America 0.006 0.348 0.7 0.76
Geography.Regions.Asia.Asia* 0.046 0.505 0.7 0.812
Geography.Regions.Asia.Central Asia 0.001 0.84 0.5 0.509
Geography.Regions.Asia.East Asia 0.011 0.414 0.7 0.8
Geography.Regions.Asia.North Asia 0.001 0.613 0.15 0.639
Geography.Regions.Asia.South Asia 0.015 0.134 0.7 0.869
Geography.Regions.Asia.Southeast Asia 0.006 0.325 0.7 0.791
Geography.Regions.Asia.West Asia 0.011 0.434 0.7 0.813
Geography.Regions.Europe.Eastern Europe 0.013 0.585 0.5 0.694
Geography.Regions.Europe.Europe* 0.076 0.75 0.7 0.615
Geography.Regions.Europe.Northern Europe 0.031 0.416 0.7 0.706
Geography.Regions.Europe.Southern Europe 0.013 0.66 0.7 0.609
Geography.Regions.Europe.Western Europe 0.019 0.755 0.7 0.579
Geography.Regions.Oceania 0.015 0.187 0.7 0.813
History and Society.Business and economics 0.01 0.465 0.5 0.655
History and Society.Education 0.007 0.568 0.7 0.553
History and Society.History 0.011 0.382 0.3 0.724
History and Society.Military and warfare 0.014 0.79 0.7 0.553
History and Society.Politics and government 0.028 0.61 0.7 0.508
History and Society.Society 0.013 0.428 0.3 0.577
History and Society.Transportation 0.015 0.201 0.7 0.952
STEM.Biology 0.034 0.114 0.7 0.915
STEM.Chemistry 0.002 0.806 0.5 0.75
STEM.Computing 0.003 0.866 0.5 0.627
STEM.Earth and environment 0.005 0.767 0.7 0.653
STEM.Engineering 0.005 0.737 0.7 0.714
STEM.Libraries & Information 0.001 0.765 0.5 0.629
STEM.Mathematics 0.0 0.862 0.5 0.789
STEM.Medicine & Health 0.006 0.641 0.7 0.703
STEM.Physics 0.001 0.676 0.3 0.724
STEM.STEM* 0.069 0.41 0.7 0.916
STEM.Space 0.006 0.096 0.7 0.973
STEM.Technology 0.005 0.829 0.5 0.547
------------------------------------------- -------- --------- --------- ------
$ python get_thresholds.py enwiki
------------------------------------------- -------- --------- --------- ------
label pop rate threshold precision recall
Culture.Biography.Biography* 0.123 0.247 0.7 0.946
Culture.Biography.Women 0.015 0.667 0.5 0.668
Culture.Food and drink 0.002 0.782 0.7 0.661
Culture.Internet culture 0.004 0.797 0.7 0.722
Culture.Linguistics 0.007 0.201 0.7 0.814
Culture.Literature 0.016 0.763 0.7 0.618
Culture.Media.Books 0.004 0.858 0.7 0.516
Culture.Media.Entertainment 0.004 0.387 0.3 0.593
Culture.Media.Films 0.011 0.318 0.7 0.864
Culture.Media.Media* 0.059 0.637 0.7 0.814
Culture.Media.Music 0.024 0.146 0.7 0.908
Culture.Media.Radio 0.002 0.365 0.7 0.824
Culture.Media.Software 0.001 0.639 0.15 0.543
Culture.Media.Television 0.009 0.573 0.7 0.722
Culture.Media.Video games 0.003 0.335 0.7 0.921
Culture.Performing arts 0.003 0.816 0.7 0.547
Culture.Philosophy and religion 0.011 0.499 0.5 0.551
Culture.Sports 0.071 0.03 0.7 0.97
Culture.Visual arts.Architecture 0.011 0.641 0.7 0.682
Culture.Visual arts.Comics and Anime 0.002 0.932 0.7 0.614
Culture.Visual arts.Fashion 0.001 0.778 0.5 0.645
Culture.Visual arts.Visual arts* 0.018 0.728 0.7 0.66
Geography.Geographical 0.024 0.416 0.7 0.712
Geography.Regions.Africa.Africa* 0.008 0.74 0.7 0.785
Geography.Regions.Africa.Central Africa 0.0 0.9 < 0.15
Geography.Regions.Africa.Eastern Africa 0.0 0.99 0.7 0.506
Geography.Regions.Africa.Northern Africa 0.001 0.818 0.5 0.627
Geography.Regions.Africa.Southern Africa 0.001 0.884 0.7 0.628
Geography.Regions.Africa.Western Africa 0.001 0.381 0.3 0.838
Geography.Regions.Americas.Central America 0.003 0.755 0.7 0.59
Geography.Regions.Americas.North America 0.064 0.53 0.7 0.678
Geography.Regions.Americas.South America 0.006 0.652 0.7 0.684
Geography.Regions.Asia.Asia* 0.046 0.473 0.7 0.867
Geography.Regions.Asia.Central Asia 0.001 0.944 0.7 0.61
Geography.Regions.Asia.East Asia 0.011 0.542 0.7 0.762
Geography.Regions.Asia.North Asia 0.001 0.448 0.15 0.689
Geography.Regions.Asia.South Asia 0.015 0.065 0.7 0.94
Geography.Regions.Asia.Southeast Asia 0.006 0.243 0.7 0.853
Geography.Regions.Asia.West Asia 0.011 0.3 0.7 0.84
Geography.Regions.Europe.Eastern Europe 0.013 0.534 0.7 0.746
Geography.Regions.Europe.Europe* 0.076 0.648 0.7 0.678
Geography.Regions.Europe.Northern Europe 0.031 0.607 0.7 0.622
Geography.Regions.Europe.Southern Europe 0.013 0.619 0.7 0.642
Geography.Regions.Europe.Western Europe 0.019 0.71 0.7 0.537
Geography.Regions.Oceania 0.015 0.117 0.7 0.904
History and Society.Business and economics 0.01 0.395 0.3 0.565
History and Society.Education 0.007 0.211 0.3 0.673
History and Society.History 0.011 0.364 0.3 0.559
History and Society.Military and warfare 0.014 0.673 0.7 0.647
History and Society.Politics and government 0.028 0.514 0.7 0.628
History and Society.Society 0.013 0.31 0.3 0.532
History and Society.Transportation 0.015 0.301 0.7 0.898
STEM.Biology 0.034 0.067 0.7 0.914
STEM.Chemistry 0.002 0.588 0.3 0.668
STEM.Computing 0.003 0.765 0.3 0.511
STEM.Earth and environment 0.005 0.645 0.7 0.67
STEM.Engineering 0.005 0.77 0.7 0.645
STEM.Libraries & Information 0.001 0.702 0.3 0.529
STEM.Mathematics 0.0 0.903 0.3 0.571
STEM.Medicine & Health 0.006 0.735 0.7 0.613
STEM.Physics 0.001 0.83 0.3 0.51
STEM.STEM* 0.069 0.389 0.7 0.895
STEM.Space 0.006 0.069 0.7 0.937
STEM.Technology 0.005 0.63 0.3 0.588
"""
Queries for optimal thresholds from ORES.
Usage:
get_thresholds (-h|--help)
get_thresholds <wiki>
Options:
-h --help Prints this documentation
<wiki> The DBname of the wiki to query thresholds for.
"""
import docopt
import requests
from tabulate import tabulate
ORES_HOST = "https://ores.wikimedia.org"
PATH = "/v3/scores"
MODEL = "articletopic"
PRECISION_TARGETS = [0.7, 0.5, 0.3, 0.15]
def main(argv=None):
args = docopt.docopt(__doc__, argv=argv)
wiki = args['<wiki>']
headers = [['label', 'pop rate', 'threshold', 'precision', 'recall']]
table_data = headers
for label, pop_rate in get_labels(wiki, MODEL):
threshold, precision, recall = get_best_threshold(wiki, label)
row = [label, pop_rate, threshold, precision, recall]
table_data.append(row)
print(tabulate(table_data))
def get_labels(wiki, model):
doc = requests.get(
ORES_HOST + PATH + "/" + wiki + "/",
params={
'models': MODEL,
'model_info': "params|statistics.rates"
}
).json()
labels = doc[wiki]['models'][MODEL]['params']['labels']
pop_rates = doc[wiki]['models'][MODEL]['statistics']['rates']['population']
return [(l, pop_rates[l]) for l in labels]
def get_threshold(wiki, label, target):
doc = requests.get(
ORES_HOST + PATH + "/" + wiki + "/",
params={
'models': MODEL,
'model_info': "statistics.thresholds.{0}.'maximum recall @ precision >= {1}'".format(repr(label), target)
}
).json()
thresholds = doc[wiki]['models'][MODEL]['statistics']['thresholds'][label]
if len(thresholds) == 1 and thresholds[0] is not None:
return thresholds[0]['threshold'], thresholds[0]['recall']
else:
return None, None
def get_best_threshold(wiki, label):
for target in PRECISION_TARGETS:
threshold, recall = get_threshold(wiki, label, target)
if recall is not None and recall >= 0.5:
return threshold, target, recall
return 0.9, "< 0.15", None
if __name__ == '__main__':
main()

Let's get some useful thresholds for models. Generally, these thresholds are going to look a lot worse than they really are -- mostly because of labels we used to train are messy and incomplete. We're targeting at least 70% precision, but we're likely to get that when we ask for 50% precision -- and in some cases, we'll still get it when we target even lower precision.

So! We're going to use ORES "threshold optimization" querying system. We'll need to make a call for each topic in order to get an appropriate threshold:

              {
                "!f1": 0.925,
                "!precision": 0.996,
                "!recall": 0.863,
                "accuracy": 0.877,
                "f1": 0.662,
                "filter_rate": 0.759,
                "fpr": 0.137,
                "match_rate": 0.241,
                "precision": 0.5,
                "recall": 0.977,
                "threshold": 0.086
              }
              {
                "!f1": 0.993,
                "!precision": 0.995,
                "!recall": 0.99,
                "accuracy": 0.985,
                "f1": 0.572,
                "filter_rate": 0.981,
                "fpr": 0.01,
                "match_rate": 0.019,
                "precision": 0.501,
                "recall": 0.668,
                "threshold": 0.667
              }
              {
                "!f1": 0.998,
                "!precision": 0.998,
                "!recall": 0.998,
                "accuracy": 0.996,
                "f1": 0.47,
                "filter_rate": 0.997,
                "fpr": 0.002,
                "match_rate": 0.003,
                "precision": 0.503,
                "recall": 0.442,
                "threshold": 0.646
              }
              {
                "!f1": 1.0,
                "!precision": 1.0,
                "!recall": 0.999,
                "accuracy": 0.999,
                "f1": 0.401,
                "filter_rate": 0.999,
                "fpr": 0.001,
                "match_rate": 0.001,
                "precision": 0.309,
                "recall": 0.571,
                "threshold": 0.903
              }

Here, we can see some diversity. Culture.Biography.Biography* is easy to model and it's very common in the labeled data, so we can get very high precision and very high recall and a strict threshold. STEM.Mathematics is on the other end of the spectrum. There are very few math-related articles at all. I've relaxed the minimum precision to 0.3 in order to get a threshold.

$ python get_thresholds.py kowiki
------------------------------------------- -------- --------- --------- ------
label pop rate threshold precision recall
Culture.Biography.Biography* 0.123 0.236 0.7 0.954
Culture.Biography.Women 0.015 0.739 0.7 0.608
Culture.Food and drink 0.002 0.688 0.7 0.76
Culture.Internet culture 0.004 0.851 0.7 0.661
Culture.Linguistics 0.007 0.276 0.7 0.797
Culture.Literature 0.016 0.657 0.7 0.705
Culture.Media.Books 0.004 0.552 0.7 0.759
Culture.Media.Entertainment 0.004 0.414 0.3 0.627
Culture.Media.Films 0.011 0.301 0.7 0.876
Culture.Media.Media* 0.059 0.62 0.7 0.826
Culture.Media.Music 0.024 0.255 0.7 0.883
Culture.Media.Radio 0.002 0.264 0.5 0.717
Culture.Media.Software 0.001 0.827 0.3 0.585
Culture.Media.Television 0.009 0.484 0.7 0.725
Culture.Media.Video games 0.003 0.419 0.7 0.91
Culture.Performing arts 0.003 0.541 0.5 0.623
Culture.Philosophy and religion 0.011 0.458 0.5 0.594
Culture.Sports 0.071 0.043 0.7 0.948
Culture.Visual arts.Architecture 0.011 0.508 0.7 0.724
Culture.Visual arts.Comics and Anime 0.002 0.876 0.7 0.743
Culture.Visual arts.Fashion 0.001 0.442 0.3 0.772
Culture.Visual arts.Visual arts* 0.018 0.597 0.7 0.715
Geography.Geographical 0.024 0.647 0.7 0.574
Geography.Regions.Africa.Africa* 0.008 0.676 0.7 0.677
Geography.Regions.Africa.Central Africa 0.0 0.9 < 0.15
Geography.Regions.Africa.Eastern Africa 0.0 0.083 0.15 0.844
Geography.Regions.Africa.Northern Africa 0.001 0.785 0.5 0.617
Geography.Regions.Africa.Southern Africa 0.001 0.329 0.5 0.804
Geography.Regions.Africa.Western Africa 0.001 0.044 0.3 0.794
Geography.Regions.Americas.Central America 0.003 0.495 0.7 0.715
Geography.Regions.Americas.North America 0.064 0.452 0.7 0.68
Geography.Regions.Americas.South America 0.006 0.432 0.7 0.765
Geography.Regions.Asia.Asia* 0.046 0.724 0.7 0.725
Geography.Regions.Asia.Central Asia 0.001 0.79 0.5 0.659
Geography.Regions.Asia.East Asia 0.011 0.668 0.5 0.739
Geography.Regions.Asia.North Asia 0.001 0.792 0.3 0.599
Geography.Regions.Asia.South Asia 0.015 0.111 0.7 0.889
Geography.Regions.Asia.Southeast Asia 0.006 0.422 0.7 0.764
Geography.Regions.Asia.West Asia 0.011 0.373 0.7 0.804
Geography.Regions.Europe.Eastern Europe 0.013 0.543 0.7 0.761
Geography.Regions.Europe.Europe* 0.076 0.575 0.7 0.772
Geography.Regions.Europe.Northern Europe 0.031 0.28 0.7 0.825
Geography.Regions.Europe.Southern Europe 0.013 0.543 0.7 0.722
Geography.Regions.Europe.Western Europe 0.019 0.558 0.7 0.712
Geography.Regions.Oceania 0.015 0.149 0.7 0.85
History and Society.Business and economics 0.01 0.512 0.5 0.614
History and Society.Education 0.007 0.451 0.7 0.683
History and Society.History 0.011 0.604 0.5 0.543
History and Society.Military and warfare 0.014 0.654 0.7 0.623
History and Society.Politics and government 0.028 0.563 0.7 0.549
History and Society.Society 0.013 0.387 0.3 0.577
History and Society.Transportation 0.015 0.191 0.7 0.939
STEM.Biology 0.034 0.106 0.7 0.909
STEM.Chemistry 0.002 0.701 0.5 0.762
STEM.Computing 0.003 0.865 0.5 0.554
STEM.Earth and environment 0.005 0.663 0.7 0.623
STEM.Engineering 0.005 0.657 0.7 0.699
STEM.Libraries & Information 0.001 0.737 0.5 0.634
STEM.Mathematics 0.0 0.955 0.5 0.562
STEM.Medicine & Health 0.006 0.561 0.7 0.7
STEM.Physics 0.001 0.74 0.3 0.662
STEM.STEM* 0.069 0.413 0.7 0.905
STEM.Space 0.006 0.051 0.7 0.962
STEM.Technology 0.005 0.588 0.3 0.704
------------------------------------------- -------- --------- --------- ------
$ python get_thresholds.py viwiki
------------------------------------------- -------- --------- --------- ------
label pop rate threshold precision recall
Culture.Biography.Biography* 0.123 0.193 0.7 0.954
Culture.Biography.Women 0.015 0.545 0.5 0.788
Culture.Food and drink 0.002 0.73 0.7 0.679
Culture.Internet culture 0.004 0.788 0.7 0.757
Culture.Linguistics 0.007 0.25 0.7 0.81
Culture.Literature 0.016 0.583 0.7 0.752
Culture.Media.Books 0.004 0.522 0.7 0.763
Culture.Media.Entertainment 0.004 0.557 0.5 0.641
Culture.Media.Films 0.011 0.334 0.7 0.881
Culture.Media.Media* 0.059 0.528 0.7 0.873
Culture.Media.Music 0.024 0.11 0.7 0.95
Culture.Media.Radio 0.002 0.202 0.7 0.747
Culture.Media.Software 0.001 0.76 0.3 0.726
Culture.Media.Television 0.009 0.382 0.7 0.821
Culture.Media.Video games 0.003 0.27 0.7 0.957
Culture.Performing arts 0.003 0.767 0.7 0.597
Culture.Philosophy and religion 0.011 0.394 0.5 0.65
Culture.Sports 0.071 0.028 0.7 0.963
Culture.Visual arts.Architecture 0.011 0.359 0.7 0.804
Culture.Visual arts.Comics and Anime 0.002 0.694 0.7 0.898
Culture.Visual arts.Fashion 0.001 0.713 0.5 0.807
Culture.Visual arts.Visual arts* 0.018 0.485 0.7 0.807
Geography.Geographical 0.024 0.661 0.7 0.519
Geography.Regions.Africa.Africa* 0.008 0.859 0.7 0.614
Geography.Regions.Africa.Central Africa 0.0 0.9 < 0.15
Geography.Regions.Africa.Eastern Africa 0.0 0.346 0.3 0.73
Geography.Regions.Africa.Northern Africa 0.001 0.917 0.5 0.502
Geography.Regions.Africa.Southern Africa 0.001 0.67 0.5 0.662
Geography.Regions.Africa.Western Africa 0.001 0.154 0.3 0.905
Geography.Regions.Americas.Central America 0.003 0.833 0.7 0.52
Geography.Regions.Americas.North America 0.064 0.29 0.7 0.812
Geography.Regions.Americas.South America 0.006 0.746 0.7 0.689
Geography.Regions.Asia.Asia* 0.046 0.658 0.7 0.794
Geography.Regions.Asia.Central Asia 0.001 0.659 0.5 0.785
Geography.Regions.Asia.East Asia 0.011 0.875 0.7 0.614
Geography.Regions.Asia.North Asia 0.001 0.838 0.3 0.584
Geography.Regions.Asia.South Asia 0.015 0.103 0.7 0.911
Geography.Regions.Asia.Southeast Asia 0.006 0.89 0.7 0.534
Geography.Regions.Asia.West Asia 0.011 0.194 0.7 0.894
Geography.Regions.Europe.Eastern Europe 0.013 0.52 0.7 0.834
Geography.Regions.Europe.Europe* 0.076 0.434 0.7 0.846
Geography.Regions.Europe.Northern Europe 0.031 0.238 0.7 0.809
Geography.Regions.Europe.Southern Europe 0.013 0.319 0.7 0.826
Geography.Regions.Europe.Western Europe 0.019 0.304 0.7 0.846
Geography.Regions.Oceania 0.015 0.253 0.7 0.812
History and Society.Business and economics 0.01 0.719 0.7 0.507
History and Society.Education 0.007 0.497 0.7 0.617
History and Society.History 0.011 0.376 0.3 0.662
History and Society.Military and warfare 0.014 0.7 0.7 0.719
History and Society.Politics and government 0.028 0.591 0.7 0.517
History and Society.Society 0.013 0.368 0.3 0.625
History and Society.Transportation 0.015 0.09 0.7 0.964
STEM.Biology 0.034 0.094 0.7 0.967
STEM.Chemistry 0.002 0.829 0.5 0.622
STEM.Computing 0.003 0.752 0.5 0.74
STEM.Earth and environment 0.005 0.647 0.7 0.652
STEM.Engineering 0.005 0.582 0.7 0.828
STEM.Libraries & Information 0.001 0.612 0.5 0.761
STEM.Mathematics 0.0 0.779 0.5 0.82
STEM.Medicine & Health 0.006 0.635 0.7 0.733
STEM.Physics 0.001 0.72 0.3 0.7
STEM.STEM* 0.069 0.371 0.7 0.937
STEM.Space 0.006 0.055 0.7 0.965
STEM.Technology 0.005 0.842 0.5 0.53
------------------------------------------- -------- --------- --------- ------
@halfak
Copy link
Author

halfak commented Feb 4, 2020

Edit: changed 'recall' to 'precision'

I propose a multi-step process.

  1. Try to get a threshold at precision >= 0.5
  2. If that doesn't work, try to get a threshold at precision >= 0.3
  3. If that doesn't work, set the threshold to 90%

@catrope
Copy link

catrope commented Feb 5, 2020

I propose a multi-step process.

1. Try to get a threshold at `recall >= 0.5`

2. If that doesn't work, try to get a threshold at `recall >= 0.3`

I think you mean precision instead of recall here?

Also, are these threshold values inverted? You say the 0.086 for geography is good and the 0.903 for math is bad. Does that mean that to get articles that have a 30% chance of being about math, I should look for scores higher than 1 - 0.903 = 0.097?

@halfak
Copy link
Author

halfak commented Feb 5, 2020

Woops. Yes. Precision and not recall. I'll edit.

Being able to get high precision at a low threshold is really good. That means we're also likely to get high recall. If you want to find something that has a 30% of being about math (according to our test data -- but it's much higher in practice), you'd look for anything with a path prediction above 0.903. I'll have a table of these values for all topics for you to reference in an hour or so.

@halfak
Copy link
Author

halfak commented Feb 5, 2020

-------------------------------------------  --------  ---------  ------  ---------  ------  ---------  ------  ---------  ------
                                                         precision=0.7      precision=0.5      precision=0.3      precision=0.15
label                                        pop rate  threshold  recall  threshold  recall  threshold  recall  threshold  recall
Culture.Biography.Biography*                 0.123     0.247      0.946   0.086      0.977   0.021      0.995   0.004      1.0
Culture.Biography.Women                      0.015     0.918      0.276   0.667      0.668   0.101      0.933   0.019      0.976
Culture.Food and drink                       0.002     0.782      0.661   0.239      0.821   0.071      0.885   0.021      0.929
Culture.Internet culture                     0.004     0.797      0.722   0.552      0.789   0.254      0.864   0.073      0.931
Culture.Linguistics                          0.007     0.201      0.814   0.065      0.878   0.024      0.912   0.01       0.946
Culture.Literature                           0.016     0.763      0.618   0.427      0.77    0.158      0.882   0.047      0.954
Culture.Media.Books                          0.004     0.858      0.516   0.479      0.712   0.169      0.837   0.047      0.907
Culture.Media.Entertainment                  0.004     0.818      0.334   0.646      0.442   0.387      0.593   0.148      0.756
Culture.Media.Films                          0.011     0.318      0.864   0.093      0.922   0.029      0.948   0.009      0.977
Culture.Media.Media*                         0.059     0.637      0.814   0.265      0.923   0.064      0.981   0.015      0.997
Culture.Media.Music                          0.024     0.146      0.908   0.027      0.958   0.009      0.977   0.004      0.985
Culture.Media.Radio                          0.002     0.365      0.824   0.122      0.876   0.036      0.917   0.011      0.946
Culture.Media.Software                       0.001                                           0.936      0.17    0.639      0.543
Culture.Media.Television                     0.009     0.573      0.722   0.229      0.823   0.071      0.897   0.021      0.948
Culture.Media.Video games                    0.003     0.335      0.921   0.109      0.945   0.03       0.961   0.008      0.978
Culture.Performing arts                      0.003     0.816      0.547   0.478      0.69    0.142      0.819   0.038      0.9
Culture.Philosophy and religion              0.011     0.928      0.238   0.499      0.551   0.151      0.759   0.049      0.872
Culture.Sports                               0.071     0.03       0.97    0.013      0.981   0.006      0.991   0.003      0.997
Culture.Visual arts.Architecture             0.011     0.641      0.682   0.243      0.826   0.063      0.911   0.016      0.962
Culture.Visual arts.Comics and Anime         0.002     0.932      0.614   0.574      0.769   0.153      0.876   0.036      0.936
Culture.Visual arts.Fashion                  0.001     0.992      0.34    0.778      0.645   0.304      0.796   0.087      0.875
Culture.Visual arts.Visual arts*             0.018     0.728      0.66    0.343      0.821   0.116      0.91    0.039      0.961
Geography.Geographical                       0.024     0.416      0.712   0.191      0.823   0.067      0.907   0.019      0.964
Geography.Regions.Africa.Africa*             0.008     0.74       0.785   0.259      0.911   0.102      0.953   0.045      0.972
Geography.Regions.Africa.Central Africa      0.0
Geography.Regions.Africa.Eastern Africa      0.0       0.99       0.506   0.802      0.737   0.231      0.857   0.071      0.902
Geography.Regions.Africa.Northern Africa     0.001     0.983      0.406   0.818      0.627   0.331      0.779   0.074      0.868
Geography.Regions.Africa.Southern Africa     0.001     0.884      0.628   0.37       0.8     0.125      0.881   0.042      0.915
Geography.Regions.Africa.Western Africa      0.001                        1.0        0.147   0.381      0.838   0.054      0.937
Geography.Regions.Americas.Central America   0.003     0.755      0.59    0.252      0.766   0.072      0.862   0.024      0.922
Geography.Regions.Americas.North America     0.064     0.53       0.678   0.179      0.867   0.044      0.959   0.013      0.991
Geography.Regions.Americas.South America     0.006     0.652      0.684   0.135      0.863   0.03       0.931   0.011      0.963
Geography.Regions.Asia.Asia*                 0.046     0.473      0.867   0.157      0.944   0.052      0.976   0.018      0.991
Geography.Regions.Asia.Central Asia          0.001     0.944      0.61    0.648      0.748   0.233      0.839   0.061      0.898
Geography.Regions.Asia.East Asia             0.011     0.542      0.762   0.13       0.892   0.038      0.945   0.014      0.971
Geography.Regions.Asia.North Asia            0.001                                           0.928      0.339   0.448      0.689
Geography.Regions.Asia.South Asia            0.015     0.065      0.94    0.023      0.959   0.01       0.974   0.004      0.986
Geography.Regions.Asia.Southeast Asia        0.006     0.243      0.853   0.069      0.91    0.025      0.939   0.011      0.96
Geography.Regions.Asia.West Asia             0.011     0.3        0.84    0.071      0.917   0.024      0.953   0.009      0.977
Geography.Regions.Europe.Eastern Europe      0.013     0.534      0.746   0.194      0.862   0.059      0.932   0.019      0.972
Geography.Regions.Europe.Europe*             0.076     0.648      0.678   0.263      0.877   0.069      0.966   0.019      0.992
Geography.Regions.Europe.Northern Europe     0.031     0.607      0.622   0.176      0.844   0.045      0.934   0.015      0.976
Geography.Regions.Europe.Southern Europe     0.013     0.619      0.642   0.178      0.818   0.05       0.907   0.017      0.954
Geography.Regions.Europe.Western Europe      0.019     0.71       0.537   0.195      0.801   0.056      0.908   0.02       0.963
Geography.Regions.Oceania                    0.015     0.117      0.904   0.043      0.942   0.019      0.96    0.008      0.976
History and Society.Business and economics   0.01      0.971      0.054   0.77       0.297   0.395      0.565   0.114      0.795
History and Society.Education                0.007     0.918      0.256   0.571      0.475   0.211      0.673   0.075      0.811
History and Society.History                  0.011     0.937      0.111   0.724      0.314   0.364      0.559   0.121      0.769
History and Society.Military and warfare     0.014     0.673      0.647   0.341      0.774   0.121      0.88    0.033      0.959
History and Society.Politics and government  0.028     0.514      0.628   0.237      0.763   0.081      0.885   0.027      0.95
History and Society.Society                  0.013     0.829      0.19    0.601      0.338   0.31       0.532   0.132      0.732
History and Society.Transportation           0.015     0.301      0.898   0.073      0.946   0.022      0.971   0.008      0.985
STEM.Biology                                 0.034     0.067      0.914   0.022      0.951   0.008      0.977   0.004      0.987
STEM.Chemistry                               0.002     0.986      0.265   0.92       0.452   0.588      0.668   0.109      0.862
STEM.Computing                               0.003                        0.944      0.165   0.765      0.511   0.214      0.854
STEM.Earth and environment                   0.005     0.645      0.67    0.264      0.788   0.085      0.856   0.025      0.915
STEM.Engineering                             0.005     0.77       0.645   0.363      0.777   0.144      0.855   0.043      0.919
STEM.Libraries & Information                 0.001                        0.99       0.231   0.702      0.529   0.23       0.709
STEM.Mathematics                             0.0                                             0.903      0.571   0.368      0.784
STEM.Medicine & Health                       0.006     0.735      0.613   0.285      0.768   0.065      0.869   0.019      0.929
STEM.Physics                                 0.001     0.997      0.105   0.983      0.236   0.83       0.51    0.372      0.717
STEM.STEM*                                   0.069     0.389      0.895   0.161      0.944   0.059      0.976   0.018      0.995
STEM.Space                                   0.006     0.069      0.937   0.014      0.964   0.004      0.98    0.002      0.989
STEM.Technology                              0.005     0.965      0.158   0.868      0.35    0.63       0.588   0.232      0.798
-------------------------------------------  --------  ---------  ------  ---------  ------  ---------  ------  ---------  ------

@halfak
Copy link
Author

halfak commented Feb 5, 2020

Now, I'll apply the rules above to the script to get a recommended threshold.

@halfak
Copy link
Author

halfak commented Feb 5, 2020

-------------------------------------------  --------  ---------  ---------  ------
label                                        pop rate  threshold  precision  recall
-------------------------------------------  --------  ---------  ---------  ------
Culture.Biography.Biography*                 0.123     0.247      0.7        0.946
Culture.Biography.Women                      0.015     0.667      0.5        0.668
Culture.Food and drink                       0.002     0.782      0.7        0.661
Culture.Internet culture                     0.004     0.797      0.7        0.722
Culture.Linguistics                          0.007     0.201      0.7        0.814
Culture.Literature                           0.016     0.763      0.7        0.618
Culture.Media.Books                          0.004     0.858      0.7        0.516
Culture.Media.Entertainment                  0.004     0.387      0.3        0.593
Culture.Media.Films                          0.011     0.318      0.7        0.864
Culture.Media.Media*                         0.059     0.637      0.7        0.814
Culture.Media.Music                          0.024     0.146      0.7        0.908
Culture.Media.Radio                          0.002     0.365      0.7        0.824
Culture.Media.Software                       0.001     0.639      0.15       0.543
Culture.Media.Television                     0.009     0.573      0.7        0.722
Culture.Media.Video games                    0.003     0.335      0.7        0.921
Culture.Performing arts                      0.003     0.816      0.7        0.547
Culture.Philosophy and religion              0.011     0.499      0.5        0.551
Culture.Sports                               0.071     0.03       0.7        0.97
Culture.Visual arts.Architecture             0.011     0.641      0.7        0.682
Culture.Visual arts.Comics and Anime         0.002     0.932      0.7        0.614
Culture.Visual arts.Fashion                  0.001     0.778      0.5        0.645
Culture.Visual arts.Visual arts*             0.018     0.728      0.7        0.66
Geography.Geographical                       0.024     0.416      0.7        0.712
Geography.Regions.Africa.Africa*             0.008     0.74       0.7        0.785
Geography.Regions.Africa.Central Africa      0.0       0.9        < 0.15
Geography.Regions.Africa.Eastern Africa      0.0       0.99       0.7        0.506
Geography.Regions.Africa.Northern Africa     0.001     0.818      0.5        0.627
Geography.Regions.Africa.Southern Africa     0.001     0.884      0.7        0.628
Geography.Regions.Africa.Western Africa      0.001     0.381      0.3        0.838
Geography.Regions.Americas.Central America   0.003     0.755      0.7        0.59
Geography.Regions.Americas.North America     0.064     0.53       0.7        0.678
Geography.Regions.Americas.South America     0.006     0.652      0.7        0.684
Geography.Regions.Asia.Asia*                 0.046     0.473      0.7        0.867
Geography.Regions.Asia.Central Asia          0.001     0.944      0.7        0.61
Geography.Regions.Asia.East Asia             0.011     0.542      0.7        0.762
Geography.Regions.Asia.North Asia            0.001     0.448      0.15       0.689
Geography.Regions.Asia.South Asia            0.015     0.065      0.7        0.94
Geography.Regions.Asia.Southeast Asia        0.006     0.243      0.7        0.853
Geography.Regions.Asia.West Asia             0.011     0.3        0.7        0.84
Geography.Regions.Europe.Eastern Europe      0.013     0.534      0.7        0.746
Geography.Regions.Europe.Europe*             0.076     0.648      0.7        0.678
Geography.Regions.Europe.Northern Europe     0.031     0.607      0.7        0.622
Geography.Regions.Europe.Southern Europe     0.013     0.619      0.7        0.642
Geography.Regions.Europe.Western Europe      0.019     0.71       0.7        0.537
Geography.Regions.Oceania                    0.015     0.117      0.7        0.904
History and Society.Business and economics   0.01      0.395      0.3        0.565
History and Society.Education                0.007     0.211      0.3        0.673
History and Society.History                  0.011     0.364      0.3        0.559
History and Society.Military and warfare     0.014     0.673      0.7        0.647
History and Society.Politics and government  0.028     0.514      0.7        0.628
History and Society.Society                  0.013     0.31       0.3        0.532
History and Society.Transportation           0.015     0.301      0.7        0.898
STEM.Biology                                 0.034     0.067      0.7        0.914
STEM.Chemistry                               0.002     0.588      0.3        0.668
STEM.Computing                               0.003     0.765      0.3        0.511
STEM.Earth and environment                   0.005     0.645      0.7        0.67
STEM.Engineering                             0.005     0.77       0.7        0.645
STEM.Libraries & Information                 0.001     0.702      0.3        0.529
STEM.Mathematics                             0.0       0.903      0.3        0.571
STEM.Medicine & Health                       0.006     0.735      0.7        0.613
STEM.Physics                                 0.001     0.83       0.3        0.51
STEM.STEM*                                   0.069     0.389      0.7        0.895
STEM.Space                                   0.006     0.069      0.7        0.937
STEM.Technology                              0.005     0.63       0.3        0.588
-------------------------------------------  --------  ---------  ---------  ------

@halfak
Copy link
Author

halfak commented Feb 5, 2020

I changed the algorithm for selecting thresholds.

label_thresholds = {}
for label in labels:
  for target in [0.7, 0.5, 0.3, 0.15]: 
    optimization = get_threshold_at_precision(target)
    if optimization is not None and optimization['recall'] > 0.5:
      label_thresholds[label] = optimization['threshold']
      break

  if label not in label_thresholds:
    label_thresholds[label] = 0.9  # Best guess

@kostajh
Copy link

kostajh commented Feb 13, 2020

@halfak could you please post the latest version of get_thresholds.py? (Or alternately, post JSON output instead of table output from the script, as I need that for kostajh/newcomertasks-drafttopic#1) I don't see how your last comment applies to get_thresholds.py. Thanks!

@halfak
Copy link
Author

halfak commented Feb 13, 2020

I have posted the most recent version of get_thresholds.py. See https://gist.github.com/halfak/630dc3fd811995c2a0260d43da462645#file-get_thresholds-py-L68

@kostajh
Copy link

kostajh commented Feb 14, 2020

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment