Skip to content

Instantly share code, notes, and snippets.

@Bomberus
Created April 24, 2017 18:21
Show Gist options
  • Save Bomberus/36dbfb3266551d2332fd33bae44ecd7c to your computer and use it in GitHub Desktop.
Save Bomberus/36dbfb3266551d2332fd33bae44ecd7c to your computer and use it in GitHub Desktop.
w2v Validation
stopwords:
'too','was','where','just','how','have','so','this','has','into','or','what','now', 'about', 'when', 'their','will','some','off','all','can','your','his','you','over', 'no','out','more','not','who','its', 'up','it','be', 'after','that','are','by','but', 'from','an', 'as', 'at','with', 'is','on','and','for','of','to','in','a','the'
labels:
['Arts' 'Business' 'Health' 'Opinion' 'Politics' 'Science' 'Sports' 'Style' 'Technology']
Title (min 35 Character) :
Train : 7204
Test : 1802
Text
Train: 8010
Test: 991
Training : {'Technology': 897, 'Politics': 889, 'Science': 899, 'Health': 882, 'Business': 905, 'Opinion': 898, 'Style': 893, 'Arts': 867, 'Sports': 880}
=================================================================================================================================
Word2Vec + title
0.600998890122
precision recall f1-score support
Arts 0.55 0.52 0.53 215
Business 0.62 0.59 0.60 200
Health 0.64 0.69 0.67 209
Opinion 0.46 0.61 0.52 189
Politics 0.80 0.69 0.74 212
Science 0.64 0.55 0.59 208
Sports 0.64 0.64 0.64 198
Style 0.54 0.59 0.57 187
Technology 0.58 0.52 0.54 184
avg / total 0.61 0.60 0.60 1802
0.551147612883
array([[111, 12, 4, 21, 4, 18, 12, 20, 13],
[ 14, 118, 11, 16, 3, 4, 13, 10, 11],
[ 4, 10, 144, 10, 7, 9, 10, 7, 8],
[ 8, 4, 13, 115, 6, 6, 4, 20, 13],
[ 5, 11, 13, 15, 147, 6, 9, 1, 5],
[ 20, 7, 9, 18, 5, 115, 8, 14, 12],
[ 10, 11, 9, 18, 3, 10, 127, 8, 2],
[ 19, 12, 6, 23, 1, 3, 6, 111, 6],
[ 12, 6, 15, 16, 8, 9, 10, 13, 95]])
=================================================================================================================================
Word2Vec + text
0.860746720484
precision recall f1-score support
Arts 0.89 0.88 0.88 133
Business 0.79 0.79 0.79 95
Health 0.87 0.88 0.87 119
Opinion 0.76 0.80 0.78 102
Politics 0.87 0.90 0.89 112
Science 0.82 0.81 0.82 99
Sports 0.97 0.98 0.97 121
Style 0.89 0.84 0.86 106
Technology 0.86 0.83 0.84 104
avg / total 0.86 0.86 0.86 991
0.843118495678
array([[117, 1, 1, 5, 0, 2, 0, 5, 2],
[ 1, 75, 2, 4, 0, 3, 1, 3, 6],
[ 0, 2, 105, 2, 3, 5, 0, 1, 1],
[ 5, 4, 4, 82, 4, 1, 0, 0, 2],
[ 2, 1, 3, 5, 101, 0, 0, 0, 0],
[ 2, 2, 4, 5, 3, 80, 0, 0, 3],
[ 0, 1, 0, 0, 0, 1, 118, 1, 0],
[ 4, 4, 2, 3, 0, 3, 1, 89, 0],
[ 1, 5, 0, 2, 5, 2, 2, 1, 86]])
=================================================================================================================================
Glove 50d + Titel
0.57935627081
precision recall f1-score support
Arts 0.53 0.52 0.52 215
Business 0.58 0.56 0.57 200
Health 0.57 0.66 0.61 209
Opinion 0.43 0.51 0.47 189
Politics 0.74 0.67 0.70 212
Science 0.61 0.51 0.56 208
Sports 0.63 0.69 0.66 198
Style 0.60 0.60 0.60 187
Technology 0.56 0.48 0.52 184
avg / total 0.58 0.58 0.58 1802
0.526658624682
array([[111, 9, 14, 16, 3, 15, 17, 23, 7],
[ 11, 111, 15, 21, 5, 7, 14, 10, 6],
[ 5, 11, 138, 10, 9, 8, 9, 7, 12],
[ 13, 12, 17, 97, 8, 8, 8, 10, 16],
[ 4, 10, 16, 18, 143, 6, 7, 4, 4],
[ 26, 4, 12, 22, 12, 107, 7, 7, 11],
[ 14, 9, 8, 13, 1, 6, 136, 5, 6],
[ 12, 8, 9, 20, 1, 8, 8, 112, 9],
[ 14, 16, 14, 11, 12, 9, 9, 10, 89]])
=================================================================================================================================
Glove 50d + Text
0.835519677094
precision recall f1-score support
Arts 0.88 0.86 0.87 133
Business 0.73 0.80 0.76 95
Health 0.84 0.85 0.85 119
Opinion 0.72 0.79 0.75 102
Politics 0.84 0.84 0.84 112
Science 0.88 0.78 0.82 99
Sports 0.97 0.95 0.96 121
Style 0.87 0.85 0.86 106
Technology 0.77 0.76 0.76 104
avg / total 0.84 0.84 0.84 991
0.814743199368
array([[115, 3, 0, 2, 2, 3, 0, 6, 2],
[ 1, 76, 3, 4, 0, 0, 2, 2, 7],
[ 0, 2, 101, 4, 4, 4, 1, 1, 2],
[ 5, 4, 5, 81, 5, 0, 0, 0, 2],
[ 0, 3, 2, 9, 94, 0, 0, 0, 4],
[ 2, 3, 5, 6, 1, 77, 0, 1, 4],
[ 0, 1, 0, 2, 0, 0, 115, 2, 1],
[ 6, 3, 2, 1, 0, 2, 0, 90, 2],
[ 1, 9, 2, 4, 6, 2, 0, 1, 79]])
=================================================================================================================================
Glove 200d + Titel
0.620976692564
precision recall f1-score support
Arts 0.57 0.52 0.54 215
Business 0.57 0.57 0.57 200
Health 0.63 0.70 0.67 209
Opinion 0.46 0.54 0.50 189
Politics 0.81 0.74 0.77 212
Science 0.65 0.59 0.62 208
Sports 0.68 0.71 0.70 198
Style 0.61 0.67 0.64 187
Technology 0.62 0.53 0.57 184
avg / total 0.63 0.62 0.62 1802
0.573537233536
array([[111, 12, 13, 14, 2, 18, 13, 25, 7],
[ 13, 115, 13, 19, 4, 9, 9, 9, 9],
[ 4, 13, 147, 14, 4, 6, 8, 6, 7],
[ 13, 13, 12, 103, 6, 11, 5, 13, 13],
[ 3, 10, 10, 13, 157, 5, 9, 3, 2],
[ 17, 6, 10, 19, 6, 122, 8, 9, 11],
[ 10, 11, 7, 15, 2, 5, 141, 6, 1],
[ 14, 9, 7, 13, 1, 3, 6, 125, 9],
[ 11, 11, 14, 13, 12, 8, 8, 9, 98]])
=================================================================================================================================
Glove 200d + Text
0.855701311806
precision recall f1-score support
Arts 0.90 0.88 0.89 133
Business 0.77 0.83 0.80 95
Health 0.84 0.86 0.85 119
Opinion 0.81 0.81 0.81 102
Politics 0.87 0.88 0.87 112
Science 0.85 0.83 0.84 99
Sports 0.97 0.95 0.96 121
Style 0.87 0.85 0.86 106
Technology 0.80 0.79 0.79 104
avg / total 0.86 0.86 0.86 991
0.837472044774
=================================================================================================================================
Glove 300d + Title
0.617092119867
precision recall f1-score support
Arts 0.58 0.55 0.57 215
Business 0.59 0.61 0.60 200
Health 0.64 0.71 0.67 209
Opinion 0.47 0.51 0.49 189
Politics 0.80 0.72 0.76 212
Science 0.65 0.58 0.61 208
Sports 0.68 0.68 0.68 198
Style 0.57 0.65 0.61 187
Technology 0.59 0.53 0.56 184
avg / total 0.62 0.62 0.62 1802
0.569163771336
array([[118, 10, 11, 13, 2, 19, 11, 24, 7],
[ 12, 123, 9, 18, 3, 9, 8, 11, 7],
[ 5, 12, 148, 10, 4, 6, 8, 5, 11],
[ 12, 15, 9, 97, 6, 11, 7, 16, 16],
[ 3, 8, 14, 17, 153, 5, 7, 3, 2],
[ 17, 6, 13, 15, 5, 120, 9, 11, 12],
[ 11, 11, 8, 12, 5, 5, 134, 10, 2],
[ 14, 11, 8, 10, 1, 4, 7, 121, 11],
[ 10, 12, 13, 15, 12, 7, 7, 10, 98]])
=================================================================================================================================
Glove 300d + Text
0.862764883956
precision recall f1-score support
Arts 0.88 0.89 0.88 133
Business 0.77 0.81 0.79 95
Health 0.86 0.87 0.86 119
Opinion 0.83 0.80 0.82 102
Politics 0.88 0.89 0.89 112
Science 0.86 0.82 0.84 99
Sports 0.97 0.95 0.96 121
Style 0.88 0.87 0.87 106
Technology 0.81 0.84 0.82 104
avg / total 0.86 0.86 0.86 991
0.845401185392
array([[118, 2, 0, 4, 1, 1, 0, 6, 1],
[ 2, 77, 1, 3, 0, 2, 1, 2, 7],
[ 0, 3, 103, 2, 3, 5, 0, 2, 1],
[ 6, 4, 5, 82, 2, 1, 0, 0, 2],
[ 1, 2, 4, 2, 100, 1, 0, 0, 2],
[ 1, 2, 4, 3, 1, 81, 0, 1, 6],
[ 1, 1, 0, 2, 0, 0, 115, 1, 1],
[ 5, 3, 2, 0, 0, 2, 1, 92, 1],
[ 0, 6, 1, 1, 6, 1, 1, 1, 87]])
======
Memory-Usage
Google : 3.000.000 words * 300 features * 4bytes/feature ~ 3,6GB
Glove 50d : 400.000 * 50 features * 4bytes/feature ~ 0.08GB
Glove 200d : 400.000 * 200 features * 4bytes/feature ~ 0.32GB
Glove 300d : 400.000 * 300 features * 4bytes/feature ~ 0.48GB
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment