So this is just a review pr :) I want to get some feedback (or prob better ideas? )
-
some sfs are not spottable cause the discounts are not enough:
- Google, Apple, Supreme Court, Office for National Statistics, Eurovision, hipster, Süleymaniye
-
there are some negative sf's probabilities i.e: Giants.
I took a look at some numbers inside the raw files:
- listed the surfaceForms by number of superSurfaceForms and, also printed the sumOfAnnotated counts of their superSurfaceForms.
at top of that file you can see things like ( sf sumOFAnnotatedSuperSf numberOfSuperSf
):
river 356957 18086
AED 204197 20057
state 950068 20094
district 372421 22164
John 409929 23890
nationalism 660402 24402
in 197477 25718
school 389412 31380
DE 327536 34451
and 442262 34566
's 689626 60722
the 1710758 107866
of 3337517 189646
As expected very general stuff i.e:
River
has many super sf:Thames River
,Seine River
- we don't want to raise such sf's that much cause they are general adn their superSf SumAnnotatedCount is also quite big
- cases with such high numbers cover in aguess less than 1% of the file
- At the bottom(and through most of the file (90%)) we got cases such as:
"Soapy" Smith 2 1
"Squeaky" Fromme 17 1
"Stonewall" Jackson 34 1
"Sue" 2 1
- just a bit before the top scores I pointed out we get more interesting cases such as :
google
,apple
. Apple is a good example, it doenst have a massive amount of superSf , but its sum of superAnnotatedCount has a nice value.
...
Google 7800 398
...
Apple 13160 715
...
- These are sf's we want to give good discounts.
- plotting these stats will get you a long flat line, which will grow very high suddenly at the end
so the simplest idea after seeing this is to discount based on the number of superSf. I commented the formula and bit and bolts in the given pr.
- So I calculate a probability of a surfaceForm being general based on the number of superSF. Given that the actual max number of superSf is very high,I randomly picked
airport
's numberofsuperSf to be a max limit,airport
is near the point where the numberOfSuperSf sky rocket
a new model was generated using the ne discount mechanism. I try to evaluate how much crap
and new useful topics
I was getting.
I couln't see any crap going out, and got the sf's named at the problem section spotted ( except for Google
).