The other day on regular commute, I listened to another brilliant episode of Linear Digressions named “Better Know a Distribution: The Poisson Distribution” and I thought this will be a nice topic to explain aided with some code (in R) as a blog post. So here goes.
As per wikipedia, the Poisson distribution, named after French mathematician Siméon Denis Poisson, is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time or space if these events occur with a known constant rate and independently of the time since the last event.
Lets understand what exactly that means.
https://gist.github.com/3576b8a37d27ee3c67fb5c4c26a21408
https://gist.github.com/d76c6cb2d649c1e6e9c079a124702f2a
For this exercise, I was looking for FIFA matches data and using the
latest new resource from our friends over at Google
(Google
Dataset Search), I found this amazing dataset
International
football results from 1872 to 2018. This is a dataset for all the
soccer matches from 1872 to 2018, all 39,669 of them!
Reading in.
https://gist.github.com/036f9b8e854017f20c7a8ba05fb19d4a
## # A tibble: 39,669 x 9
## date home_team away_team home_score away_score tournament city
## <date> <chr> <chr> <int> <int> <chr> <chr>
## 1 1872-11-30 Scotland England 0 0 Friendly Glas…
## 2 1873-03-08 England Scotland 4 2 Friendly Lond…
## 3 1874-03-07 Scotland England 2 1 Friendly Glas…
## 4 1875-03-06 England Scotland 2 2 Friendly Lond…
## 5 1876-03-04 Scotland England 3 0 Friendly Glas…
## 6 1876-03-25 Scotland Wales 4 0 Friendly Glas…
## 7 1877-03-03 England Scotland 1 3 Friendly Lond…
## 8 1877-03-05 Wales Scotland 0 2 Friendly Wrex…
## 9 1878-03-02 Scotland England 7 2 Friendly Glas…
## 10 1878-03-23 Scotland Wales 9 0 Friendly Glas…
## # ... with 39,659 more rows, and 2 more variables: country <chr>,
## # neutral <lgl>
https://gist.github.com/b300c271d37ebb43c81a906f078f1109
## date home_team away_team
## Min. :1872-11-30 Length:39669 Length:39669
## 1st Qu.:1977-02-02 Class :character Class :character
## Median :1996-10-06 Mode :character Mode :character
## Mean :1989-10-17
## 3rd Qu.:2008-01-22
## Max. :2018-07-10
## home_score away_score tournament city
## Min. : 0.000 Min. : 0.000 Length:39669 Length:39669
## 1st Qu.: 1.000 1st Qu.: 0.000 Class :character Class :character
## Median : 1.000 Median : 1.000 Mode :character Mode :character
## Mean : 1.748 Mean : 1.188
## 3rd Qu.: 2.000 3rd Qu.: 2.000
## Max. :31.000 Max. :21.000
## country neutral
## Length:39669 Mode :logical
## Class :character FALSE:29848
## Mode :character TRUE :9821
##
##
##
- We have data from Nov 30th 1872 to July 10th 2018. Whoo!
- A max home_score value of 31 and max away_score of 21?! Some matches to look into!
- About 25% of the matches are played in neutral territory. Are these all World Cup matches?
https://gist.github.com/51eb72325433681fec3d092c430783e6
For a random variable k
to be Poisson, it needs to hold the following
4 conditions
(wikipedia)
-
k
is the number of times an event occurs in an interval and k can take values 0, 1, 2, …. i.e.k
needs to be an integer (a major distinction from the more popular Gaussian Distribution, where the variable is continuous). -
The occurrence of one event does not affect the probability that a second event will occur. That is, events occur independently.
-
The rate at which events occur is constant. The rate cannot be higher in some intervals and lower in other intervals.
-
Two events cannot occur at exactly the same instant; instead, at each very small sub-interval exactly one event either occurs or does not occur.
Or
The actual probability distribution is given by a binomial distribution and the number of trials is sufficiently bigger than the number of successes one is asking about.
Now, lets first identify our k
and the interval
and see if they hold
the above 4 conditions. Lets explore the following 3 options -
k
is total number of goals andinterval
is 1 year.k
is total number of goals andinterval
is 1 day.k
is total number of goals andinterval
is 1 match.
Although we have kept our 3 options such that Condition 1 & 2 will always hold, i.e. The number of goals is always an integer and 1 goal is independent of another (for the most part). But we will need to explore Condtions 3 & 4 for each of these options.
https://gist.github.com/2408a8f7b6b76721658cb32d3b247c44
## # A tibble: 147 x 6
## year `mean(totalGoal… `sum(totalGoals… `min(totalGoals…
## <fct> <dbl> <int> <dbl>
## 1 1872 0 0 0
## 2 1873 6 6 6
## 3 1874 3 3 3
## 4 1875 4 4 4
## 5 1876 3.5 7 3
## 6 1877 3 6 2
## 7 1878 9 18 9
## 8 1879 5 15 3
## 9 1880 6.67 20 5
## 10 1881 4.67 14 1
## # ... with 137 more rows, and 2 more variables: `max(totalGoals)` <dbl>,
## # `median(totalGoals)` <dbl>
https://gist.github.com/b5d67d1ce84b83da148b49b250908f20
https://gist.github.com/3a525955ae51637ee5530318fcc57cb7
As we see in the above 2 plots, even though the mean number of goals
remains more or less constant over the years, but the total number of
goals per year increases, this violates our condition 3 for it to be a
poisson distribution. Also, as per condition 4, number of trials should
be sufficiently bigger than number of successes, which is also violated
in this case because we have 147 trials (i.e. number of years in the
dataset) and successes to the order of ~1000 or more (i.e. total number
of goals per year).
Even logically, we can think that if there are more number of matches in
a year, then there will be more number of total goals in that year,
which violates condition 3.
Based on above, we can also assume that our option 2 (i.e. total number of goals in 1 day), although will be closer to being a poisson distribution as compared to option 1, but it still wont be because more number of matches in a day will mean more number of goals which will violate condition 3 that the rate at which events occur needs to be constant. Lets visualize this for option 2.
https://gist.github.com/5c668b62a25387262f42dde2cf3a159f
## # A tibble: 14,863 x 2
## date n
## <date> <int>
## 1 2012-02-29 66
## 2 2016-03-29 63
## 3 2008-03-26 60
## 4 2014-03-05 59
## 5 2012-11-14 56
## 6 2011-10-11 54
## 7 2011-11-11 54
## 8 2008-10-11 53
## 9 2011-11-15 53
## 10 2011-09-02 52
## # ... with 14,853 more rows
https://gist.github.com/712a6381347442f21244ac03fc3b8c79
## # A tibble: 14,863 x 6
## date `mean(totalGoal… `sum(totalGoals… `min(totalGoals…
## <date> <dbl> <int> <dbl>
## 1 2012-02-29 2.73 180 0
## 2 2016-03-29 2.79 176 0
## 3 2008-03-26 2.72 163 0
## 4 2011-10-11 3 162 0
## 5 2014-03-05 2.66 157 0
## 6 2013-10-15 3.08 154 0
## 7 2008-10-11 2.85 151 0
## 8 2011-09-02 2.88 150 0
## 9 2011-09-06 2.87 149 0
## 10 2011-11-15 2.77 147 0
## # ... with 14,853 more rows, and 2 more variables:
## # `max(totalGoals)` <dbl>, `median(totalGoals)` <dbl>
https://gist.github.com/0d42a1ad2cc8b83723cd23fa49436d5f
https://gist.github.com/201dd85fd8eeb58d0ac88c02680acb84
So, even though number of successes is fairly low compared to number of
trials (condition 4 satisfied), rate of event occuring is not constant
and is dependent on number of matches played for option 2. Therefore, we
reject option 2 as a poisson distribution as well.
Lets finally explore option 3.
https://gist.github.com/e91b8f8f1e8f4f1c9524664c1fa6c46d
## # A tibble: 1 x 1
## `mean(totalGoals)`
## <dbl>
## 1 2.94
https://gist.github.com/01b3e3f4755560570684977b73a04c8c
Eureka! We have a constant rate of number of goals per match with a peak at around 3 goals and a mean of 2.935642 goals per match. Number of goals scored (‘the event’ being a goal being scored) is an integer where one goal is independent of another and the number of matches (i.e. trials) is way higher than number of goals (i.e. successes) per match. Therefore, we have found our Poisson Distribution!
Now that we have our poisson distribution, we can calculate the
probability of k
events happening in an interval
using the
following:
P(k event**s i**n a**n interva**l) = e − λ * λk/k!
where,
$\lambda$
= Mean number of events per interval, i.e. mean number of
goals per match.
k
= Number of events for probability estimation, i.e. number of
goals,
e
= is the euler number and
k!
= is the factorial of k.
As per our exploration above, we have mean number of goals as λ = 2.935642, we can plug-in this value to the formula above to calculate the probability of any number of goals being scored in a match.
For example,
P(5 goals score**d i**n a match) = e − 2.935642 * 2.9356425/5! P(5 goals score**d i**n a match) = 0.09647195841
Lets use R to calculate the above.
https://gist.github.com/1d58b075eeb730598616226ea767c802
## [1] 0.09647199
And we see the same value as calculated above.
We can also see how the probability varies as we increase the number of
events i.e. number of goals from 0 to 8.
https://gist.github.com/cef2f6ea6d1a96c3d8a6ad75dd4992aa
Poisson distribution’s probability calculation formula can be a nifty little trick under anyone’s belt to evaluate the probabily of an event happening. It is also widely used in the industry with applications like estimating the probability of k number of customers arriving at a store in order to optimize resources or probability that a webpage has seen some k updates in order to optimize the rate at which to crawl a webpage by a search engine.