Skip to content

Instantly share code, notes, and snippets.

@krishnanraman
Last active August 29, 2015 14:15
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save krishnanraman/807f07cb2ce1c7bf7f77 to your computer and use it in GitHub Desktop.
Save krishnanraman/807f07cb2ce1c7bf7f77 to your computer and use it in GitHub Desktop.
Bayes Factor in Scalding
$ scalding/scripts/scald.rb --repl --local
scalding> case class BuyNotBuy(token:String, app:String, buy:Boolean)
defined class BuyNotBuy
scalding> val pipe = TextLine("bnblist.txt")
.read
.map( 'line -> 'line ){x:String =>
val cols = x.split("\t")
BuyNotBuy(cols(0), cols(1), cols(2).toBoolean)
}
.toTypedPipe[BuyNotBuy]('line)
scalding> pipe: com.twitter.scalding.typed.TypedPipe[BuyNotBuy] = com.twitter.scalding.typed.TypedPipeInst@7b480442
scalding> pipe.groupBy{ x => x.app }
.foldLeft((Set[String](), Set[String]())){(a,b) => if (b.buy) (a._1 ++ Set(b.token), a._2) else (a._1, a._2 ++ Set(b.token))}
.map{ x=> (x._1,x._2._1.size, x._2._2.size)}
.map{ x=> (x._1, if (x._2 == 0) 0.00001 else x._2, if (x._3 ==0) 0.00001 else x._3)}
.map{ x => (x._1, x._2/(x._3+0.0))}
.groupAll
.sortBy{ x=> -x._2}
.values
.save(TypedTsv[(String, Double)]("bayesfactors.txt"))
Reference: http://en.wikipedia.org/wiki/Bayes_factor
Bayes Factor is a model selection method based on Bayesian factors.
Motivating Example:
Say you sell a fremium app.
All of your free+paid users download your app.
a. paid users - A very small percentage of your users will actually buy your app. Great!
b. free users - Based on what other apps they run on their phone, you'd like to predict if your free users will buy your app or not.
Say you have a large dataset ( millions of rows ) that contains the above info -
----- bnblist.txt ---------
UserID App Paid?
u1 farmvile T
u2 viber F
u2 facebook F
u2 twitter F
u3 chrome T
u3 facebook T
-----
Columns:
UserID are the users ( free + paid ) of your app.
You are able to datamine user's phones & figure out what other apps they are using - that's the middle column.
The third column indicates if the user is a paid user.
Procedure:
Group by app.
For every app a, compute the ratio of conditional probabilities.
K = Probability(app a |paid user)/Probability(app a |free user)
If K is high ( 10 & above ), there is strong evidence that the user who runs app a will convert to paid!
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment