Skip to content

Instantly share code, notes, and snippets.

@dav009
Last active August 29, 2015 14:00
Show Gist options
  • Save dav009/67cfc2787d07761a55d5 to your computer and use it in GitHub Desktop.
Save dav009/67cfc2787d07761a55d5 to your computer and use it in GitHub Desktop.

Trying to improve discounts

So this is just a review pr :) I want to get some feedback (or prob better ideas? )

Problem

  • some sfs are not spottable cause the discounts are not enough:

    • Google, Apple, Supreme Court, Office for National Statistics, Eurovision, hipster, Süleymaniye
  • there are some negative sf's probabilities i.e: Giants.

taking a look at the raw files..

I took a look at some numbers inside the raw files:

  • listed the surfaceForms by number of superSurfaceForms and, also printed the sumOfAnnotated counts of their superSurfaceForms.

at top of that file you can see things like ( sf sumOFAnnotatedSuperSf numberOfSuperSf):

river	356957	18086
AED	204197	20057
state	950068	20094
district	372421	22164
John	409929	23890
nationalism	660402	24402
in	197477	25718
school	389412	31380
DE	327536	34451
and	442262	34566
's	689626	60722
the	1710758	107866
of	3337517	189646

As expected very general stuff i.e:

  • River has many super sf: Thames River, Seine River
  • we don't want to raise such sf's that much cause they are general adn their superSf SumAnnotatedCount is also quite big
  • cases with such high numbers cover in aguess less than 1% of the file

  • At the bottom(and through most of the file (90%)) we got cases such as:
"Soapy" Smith	2	1
"Squeaky" Fromme	17	1
"Stonewall" Jackson	34	1
"Sue"	2	1

  • just a bit before the top scores I pointed out we get more interesting cases such as : google, apple. Apple is a good example, it doenst have a massive amount of superSf , but its sum of superAnnotatedCount has a nice value.
...
Google	7800	398
...
Apple	13160	715
...
  • These are sf's we want to give good discounts.
  • plotting these stats will get you a long flat line, which will grow very high suddenly at the end

so the simplest idea after seeing this is to discount based on the number of superSf. I commented the formula and bit and bolts in the given pr.

  • So I calculate a probability of a surfaceForm being general based on the number of superSF. Given that the actual max number of superSf is very high,I randomly picked airport's numberofsuperSf to be a max limit, airport is near the point where the numberOfSuperSf sky rocket

The generated model

a new model was generated using the ne discount mechanism. I try to evaluate how much crap and new useful topics I was getting. I couln't see any crap going out, and got the sf's named at the problem section spotted ( except for Google ).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment