Skip to content

Instantly share code, notes, and snippets.

@he7d3r
Last active May 22, 2020 17:19
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save he7d3r/1a617f50ab63ba57a9254377eddd42d1 to your computer and use it in GitHub Desktop.
Save he7d3r/1a617f50ab63ba57a9254377eddd42d1 to your computer and use it in GitHub Desktop.
Comparison of articlequality models for ptwiki, depending on the dataset used

Full period; Bots included

Model Information: - type: GradientBoosting - version: 0.8.0 - params: {'min_samples_split': 2, 'label_weights': None, 'max_depth': 7, 'min_impurity_split': None, 'learning_rate': 0.01, 'verbose': 0, 'max_features': 'log2', 'center': True, 'subsample': 1.0, 'n_estimators': 300, 'warm_start': False, 'multilabel': False, 'min_samples_leaf': 1, 'labels': ['1', '2', '3', '4', '5', '6'], 'scale': True, 'presort': 'auto', 'population_rates': None, 'loss': 'deviance', 'random_state': None, 'max_leaf_nodes': None, 'init': None, 'n_iter_no_change': None, 'criterion': 'friedman_mse', 'min_impurity_decrease': 0.0, 'validation_fraction': 0.1, 'tol': 0.0001, 'min_weight_fraction_leaf': 0.0} Environment: - revscoring_version: '2.6.9' - platform: 'Linux-4.9.0-11-amd64-x86_64-with-debian-9.12' - machine: 'x86_64' - version: '#1 SMP Debian 4.9.189-3+deb9u1 (2019-09-20)' - system: 'Linux' - processor: '' - python_build: ('default', 'Sep 27 2018 17:25:39') - python_compiler: 'GCC 6.3.0 20170516' - python_branch: '' - python_implementation: 'CPython' - python_revision: '' - python_version: '3.5.3' - release: '4.9.0-11-amd64'

Statistics:
counts (n=8921):
	label       n         ~1    ~2    ~3    ~4    ~5    ~6
	-------  ----  ---  ----  ----  ----  ----  ----  ----
	'1'      1495  -->  1319   107    62     5     2     0
	'2'      1494  -->    90  1057   300    21    21     5
	'3'      1497  -->    31    94  1066   166    97    43
	'4'      1489  -->    15    31   296   332   451   364
	'5'      1483  -->    10    12   116   227   750   368
	'6'      1463  -->    14    17    50   150   293   939
rates:
	              '1'    '2'    '3'    '4'    '5'    '6'
	----------  -----  -----  -----  -----  -----  -----
	sample      0.168  0.167  0.168  0.167  0.166  0.164
	population  0.712  0.188  0.055  0.033  0.006  0.007
match_rate (micro=0.494, macro=0.208):
	    1      2      3      4      5      6
	-----  -----  -----  -----  -----  -----
	0.635  0.161  0.144  0.081  0.118  0.108
filter_rate (micro=0.506, macro=0.792):
	    1      2      3      4      5      6
	-----  -----  -----  -----  -----  -----
	0.365  0.839  0.856  0.919  0.882  0.892
recall (micro=0.815, macro=0.612):
	    1      2      3      4      5      6
	-----  -----  -----  -----  -----  -----
	0.882  0.707  0.712  0.223  0.506  0.642
!recall (micro=0.968, macro=0.923):
	    1      2      3      4      5      6
	-----  -----  -----  -----  -----  -----
	0.978  0.965  0.889  0.923  0.884  0.895
precision (micro=0.878, macro=0.373):
	   1      2     3     4      5      6
	----  -----  ----  ----  -----  -----
	0.99  0.823  0.27  0.09  0.025  0.041
!precision (micro=0.822, macro=0.942):
	    1      2      3      4      5      6
	-----  -----  -----  -----  -----  -----
	0.771  0.935  0.982  0.972  0.997  0.997
f1 (micro=0.834, macro=0.39):
	    1      2      3      4      5      6
	-----  -----  -----  -----  -----  -----
	0.933  0.761  0.392  0.128  0.048  0.076
!f1 (micro=0.886, macro=0.929):
	    1      2      3      4      5      6
	-----  -----  -----  -----  -----  -----
	0.862  0.949  0.933  0.947  0.937  0.944
accuracy (micro=0.909, macro=0.897):
	   1      2      3    4      5      6
	----  -----  -----  ---  -----  -----
	0.91  0.917  0.879  0.9  0.882  0.894
fpr (micro=0.032, macro=0.077):
	    1      2      3      4      5      6
	-----  -----  -----  -----  -----  -----
	0.022  0.035  0.111  0.077  0.116  0.105
roc_auc (micro=0.963, macro=0.889):
	    1      2    3      4      5      6
	-----  -----  ---  -----  -----  -----
	0.982  0.951  0.9  0.764  0.847  0.892
pr_auc (micro=0.9, macro=0.418):
	    1      2      3      4      5      6
	-----  -----  -----  -----  -----  -----
	0.993  0.878  0.449  0.088  0.047  0.053

 - score_schema: {'type': 'object', 'properties': {'probability': {'type': 'object', 'description': 'A mapping of probabilities onto each of the potential output labels', 'properties': {'3': {'type': 'number'}, '4': {'type': 'number'}, '1': {'type': 'number'}, '5': {'type': 'number'}, '2': {'type': 'number'}, '6': {'type': 'number'}}}, 'prediction': {'type': 'string', 'description': 'The most likely label predicted by the estimator'}}, 'title': 'Scikit learn-based classifier score with probability'}

Full period; Bots removed

Model Information: - type: GradientBoosting - version: 0.8.0 - params: {'center': True, 'validation_fraction': 0.1, 'init': None, 'presort': 'auto', 'scale': True, 'min_impurity_decrease': 0.0, 'labels': ['1', '2', '3', '4', '5', '6'], 'learning_rate': 0.01, 'subsample': 1.0, 'n_estimators': 300, 'random_state': None, 'max_leaf_nodes': None, 'min_samples_leaf': 1, 'label_weights': None, 'min_weight_fraction_leaf': 0.0, 'criterion': 'friedman_mse', 'warm_start': False, 'loss': 'deviance', 'min_samples_split': 2, 'min_impurity_split': None, 'multilabel': False, 'max_depth': 7, 'tol': 0.0001, 'max_features': 'log2', 'verbose': 0, 'n_iter_no_change': None, 'population_rates': None} Environment: - revscoring_version: '2.7.2' - platform: 'Linux-4.9.0-8-amd64-x86_64-with-debian-9.4' - machine: 'x86_64' - version: '#1 SMP Debian 4.9.144-3.1 (2019-02-19)' - system: 'Linux' - processor: '' - python_build: ('default', 'Sep 27 2018 17:25:39') - python_compiler: 'GCC 6.3.0 20170516' - python_branch: '' - python_implementation: 'CPython' - python_revision: '' - python_version: '3.5.3' - release: '4.9.0-8-amd64'

Statistics:
counts (n=8750):
	label       n         ~1    ~2    ~3    ~4    ~5    ~6
	-------  ----  ---  ----  ----  ----  ----  ----  ----
	'1'      1483  -->  1054   358    52    10     6     3
	'2'      1493  -->   245   844   347    24    25     8
	'3'      1484  -->    52   283   832   176    99    42
	'4'      1476  -->    20    80   227   366   467   316
	'5'      1490  -->     7    39    98   239   806   301
	'6'      1324  -->    13    54    38   127   295   797
rates:
	              '1'    '2'    '3'    '4'    '5'    '6'
	----------  -----  -----  -----  -----  -----  -----
	sample      0.169  0.171  0.17   0.169  0.17   0.151
	population  0.712  0.188  0.055  0.033  0.006  0.007
match_rate (micro=0.418, macro=0.192):
	   1      2     3      4      5      6
	----  -----  ----  -----  -----  -----
	0.52  0.197  0.13  0.085  0.125  0.094
filter_rate (micro=0.582, macro=0.808):
	   1      2     3      4      5      6
	----  -----  ----  -----  -----  -----
	0.48  0.803  0.87  0.915  0.875  0.906
recall (micro=0.658, macro=0.538):
	    1      2      3      4      5      6
	-----  -----  -----  -----  -----  -----
	0.711  0.565  0.561  0.248  0.541  0.602
!recall (micro=0.936, macro=0.907):
	    1      2      3      4      5     6
	-----  -----  -----  -----  -----  ----
	0.954  0.888  0.895  0.921  0.877  0.91
precision (micro=0.811, macro=0.319):
	    1      2      3      4      5      6
	-----  -----  -----  -----  -----  -----
	0.974  0.538  0.236  0.096  0.026  0.044
!precision (micro=0.673, macro=0.901):
	    1      2      3      4      5      6
	-----  -----  -----  -----  -----  -----
	0.571  0.898  0.972  0.973  0.997  0.997
f1 (micro=0.712, macro=0.329):
	    1      2      3      4      5      6
	-----  -----  -----  -----  -----  -----
	0.822  0.551  0.332  0.138  0.049  0.082
!f1 (micro=0.77, macro=0.895):
	    1      2      3      4      5      6
	-----  -----  -----  -----  -----  -----
	0.714  0.893  0.932  0.946  0.933  0.951
accuracy (micro=0.8, macro=0.861):
	    1      2      3      4      5      6
	-----  -----  -----  -----  -----  -----
	0.781  0.827  0.877  0.899  0.875  0.908
fpr (micro=0.064, macro=0.093):
	    1      2      3      4      5     6
	-----  -----  -----  -----  -----  ----
	0.046  0.112  0.105  0.079  0.123  0.09
roc_auc (micro=0.923, macro=0.863):
	    1      2      3      4      5      6
	-----  -----  -----  -----  -----  -----
	0.951  0.872  0.852  0.762  0.849  0.889
pr_auc (micro=0.824, macro=0.346):
	    1      2     3      4      5      6
	-----  -----  ----  -----  -----  -----
	0.976  0.567  0.35  0.094  0.047  0.042

 - score_schema: {'type': 'object', 'properties': {'prediction': {'description': 'The most likely label predicted by the estimator', 'type': 'string'}, 'probability': {'type': 'object', 'description': 'A mapping of probabilities onto each of the potential output labels', 'properties': {'4': {'type': 'number'}, '5': {'type': 'number'}, '2': {'type': 'number'}, '1': {'type': 'number'}, '3': {'type': 'number'}, '6': {'type': 'number'}}}}, 'title': 'Scikit learn-based classifier score with probability'}

Since 2014 (inclusive); Bots included

Model Information: - type: GradientBoosting - version: 0.8.0 - params: {'learning_rate': 0.01, 'center': True, 'criterion': 'friedman_mse', 'min_impurity_decrease': 0.0, 'scale': True, 'subsample': 1.0, 'max_leaf_nodes': None, 'min_samples_split': 2, 'multilabel': False, 'n_iter_no_change': None, 'validation_fraction': 0.1, 'min_weight_fraction_leaf': 0.0, 'max_depth': 7, 'warm_start': False, 'max_features': 'log2', 'labels': ['1', '2', '3', '4', '5', '6'], 'population_rates': None, 'n_estimators': 300, 'label_weights': None, 'verbose': 0, 'min_samples_leaf': 1, 'tol': 0.0001, 'loss': 'deviance', 'min_impurity_split': None, 'presort': 'auto', 'init': None, 'random_state': None} Environment: - revscoring_version: '2.7.2' - platform: 'Linux-4.9.0-8-amd64-x86_64-with-debian-9.4' - machine: 'x86_64' - version: '#1 SMP Debian 4.9.144-3.1 (2019-02-19)' - system: 'Linux' - processor: '' - python_build: ('default', 'Sep 27 2018 17:25:39') - python_compiler: 'GCC 6.3.0 20170516' - python_branch: '' - python_implementation: 'CPython' - python_revision: '' - python_version: '3.5.3' - release: '4.9.0-8-amd64'

Statistics:
counts (n=6220):
	label       n         ~1    ~2    ~3    ~4    ~5    ~6
	-------  ----  ---  ----  ----  ----  ----  ----  ----
	'1'      1429  -->  1072   293    50     7     5     2
	'2'      1499  -->   241   974   228    16    36     4
	'3'      1270  -->    91   330   657    72    87    33
	'4'       688  -->    19    60   139   106   158   206
	'5'       650  -->     6    34    81   109   309   111
	'6'       684  -->    25    30    59   121   101   348
rates:
	              '1'    '2'    '3'    '4'    '5'    '6'
	----------  -----  -----  -----  -----  -----  -----
	sample      0.23   0.241  0.204  0.111  0.105  0.11
	population  0.712  0.188  0.055  0.033  0.006  0.007
match_rate (micro=0.454, macro=0.191):
	    1     2      3      4      5      6
	-----  ----  -----  -----  -----  -----
	0.557  0.25  0.135  0.062  0.072  0.067
filter_rate (micro=0.546, macro=0.809):
	    1     2      3      4      5      6
	-----  ----  -----  -----  -----  -----
	0.443  0.75  0.865  0.938  0.928  0.933
recall (micro=0.696, macro=0.509):
	   1     2      3      4      5      6
	----  ----  -----  -----  -----  -----
	0.75  0.65  0.517  0.154  0.475  0.509
!recall (micro=0.905, macro=0.909):
	   1      2      3      4      5      6
	----  -----  -----  -----  -----  -----
	0.92  0.842  0.887  0.941  0.931  0.936
precision (micro=0.789, macro=0.305):
	    1      2     3      4      5      6
	-----  -----  ----  -----  -----  -----
	0.959  0.487  0.21  0.082  0.039  0.052
!precision (micro=0.695, macro=0.907):
	    1      2     3     4      5      6
	-----  -----  ----  ----  -----  -----
	0.598  0.912  0.97  0.97  0.997  0.996
f1 (micro=0.725, macro=0.328):
	    1      2      3      4      5      6
	-----  -----  -----  -----  -----  -----
	0.842  0.557  0.299  0.107  0.073  0.094
!f1 (micro=0.775, macro=0.902):
	    1      2      3      4      5      6
	-----  -----  -----  -----  -----  -----
	0.725  0.876  0.927  0.956  0.962  0.965
accuracy (micro=0.81, macro=0.875):
	    1      2      3      4      5      6
	-----  -----  -----  -----  -----  -----
	0.799  0.806  0.867  0.915  0.928  0.933
fpr (micro=0.095, macro=0.091):
	   1      2      3      4      5      6
	----  -----  -----  -----  -----  -----
	0.08  0.158  0.113  0.059  0.069  0.064
roc_auc (micro=0.908, macro=0.862):
	    1      2      3      4      5      6
	-----  -----  -----  -----  -----  -----
	0.934  0.856  0.824  0.785  0.888  0.887
pr_auc (micro=0.822, macro=0.341):
	   1      2      3      4      5      6
	----  -----  -----  -----  -----  -----
	0.97  0.605  0.267  0.079  0.065  0.062

 - score_schema: {'properties': {'prediction': {'description': 'The most likely label predicted by the estimator', 'type': 'string'}, 'probability': {'description': 'A mapping of probabilities onto each of the potential output labels', 'properties': {'6': {'type': 'number'}, '5': {'type': 'number'}, '4': {'type': 'number'}, '3': {'type': 'number'}, '2': {'type': 'number'}, '1': {'type': 'number'}}, 'type': 'object'}}, 'title': 'Scikit learn-based classifier score with probability', 'type': 'object'}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment