Last active
August 29, 2015 14:15
-
-
Save krishnanraman/807f07cb2ce1c7bf7f77 to your computer and use it in GitHub Desktop.
Bayes Factor in Scalding
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
$ scalding/scripts/scald.rb --repl --local | |
scalding> case class BuyNotBuy(token:String, app:String, buy:Boolean) | |
defined class BuyNotBuy | |
scalding> val pipe = TextLine("bnblist.txt") | |
.read | |
.map( 'line -> 'line ){x:String => | |
val cols = x.split("\t") | |
BuyNotBuy(cols(0), cols(1), cols(2).toBoolean) | |
} | |
.toTypedPipe[BuyNotBuy]('line) | |
scalding> pipe: com.twitter.scalding.typed.TypedPipe[BuyNotBuy] = com.twitter.scalding.typed.TypedPipeInst@7b480442 | |
scalding> pipe.groupBy{ x => x.app } | |
.foldLeft((Set[String](), Set[String]())){(a,b) => if (b.buy) (a._1 ++ Set(b.token), a._2) else (a._1, a._2 ++ Set(b.token))} | |
.map{ x=> (x._1,x._2._1.size, x._2._2.size)} | |
.map{ x=> (x._1, if (x._2 == 0) 0.00001 else x._2, if (x._3 ==0) 0.00001 else x._3)} | |
.map{ x => (x._1, x._2/(x._3+0.0))} | |
.groupAll | |
.sortBy{ x=> -x._2} | |
.values | |
.save(TypedTsv[(String, Double)]("bayesfactors.txt")) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Reference: http://en.wikipedia.org/wiki/Bayes_factor | |
Bayes Factor is a model selection method based on Bayesian factors. | |
Motivating Example: | |
Say you sell a fremium app. | |
All of your free+paid users download your app. | |
a. paid users - A very small percentage of your users will actually buy your app. Great! | |
b. free users - Based on what other apps they run on their phone, you'd like to predict if your free users will buy your app or not. | |
Say you have a large dataset ( millions of rows ) that contains the above info - | |
----- bnblist.txt --------- | |
UserID App Paid? | |
u1 farmvile T | |
u2 viber F | |
u2 facebook F | |
u2 twitter F | |
u3 chrome T | |
u3 facebook T | |
----- | |
Columns: | |
UserID are the users ( free + paid ) of your app. | |
You are able to datamine user's phones & figure out what other apps they are using - that's the middle column. | |
The third column indicates if the user is a paid user. | |
Procedure: | |
Group by app. | |
For every app a, compute the ratio of conditional probabilities. | |
K = Probability(app a |paid user)/Probability(app a |free user) | |
If K is high ( 10 & above ), there is strong evidence that the user who runs app a will convert to paid! |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment