Skip to content

Instantly share code, notes, and snippets.

@adityamangal410
Created November 14, 2018 16:32
Show Gist options
  • Save adityamangal410/bfcb95bc50557e2c2eff98a3f2d72f37 to your computer and use it in GitHub Desktop.
Save adityamangal410/bfcb95bc50557e2c2eff98a3f2d72f37 to your computer and use it in GitHub Desktop.

The Poisson Distribution

The other day on regular commute, I listened to another brilliant episode of Linear Digressions named “Better Know a Distribution: The Poisson Distribution” and I thought this will be a nice topic to explain aided with some code (in R) as a blog post. So here goes.

As per wikipedia, the Poisson distribution, named after French mathematician Siméon Denis Poisson, is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time or space if these events occur with a known constant rate and independently of the time since the last event.

Lets understand what exactly that means.

Environment Setup

Clean up

https://gist.github.com/3576b8a37d27ee3c67fb5c4c26a21408

Load libraries

https://gist.github.com/d76c6cb2d649c1e6e9c079a124702f2a

Data

For this exercise, I was looking for FIFA matches data and using the latest new resource from our friends over at Google (Google Dataset Search), I found this amazing dataset International football results from 1872 to 2018. This is a dataset for all the soccer matches from 1872 to 2018, all 39,669 of them!
Reading in.

https://gist.github.com/036f9b8e854017f20c7a8ba05fb19d4a

## # A tibble: 39,669 x 9
##    date       home_team away_team home_score away_score tournament city 
##    <date>     <chr>     <chr>          <int>      <int> <chr>      <chr>
##  1 1872-11-30 Scotland  England            0          0 Friendly   Glas…
##  2 1873-03-08 England   Scotland           4          2 Friendly   Lond…
##  3 1874-03-07 Scotland  England            2          1 Friendly   Glas…
##  4 1875-03-06 England   Scotland           2          2 Friendly   Lond…
##  5 1876-03-04 Scotland  England            3          0 Friendly   Glas…
##  6 1876-03-25 Scotland  Wales              4          0 Friendly   Glas…
##  7 1877-03-03 England   Scotland           1          3 Friendly   Lond…
##  8 1877-03-05 Wales     Scotland           0          2 Friendly   Wrex…
##  9 1878-03-02 Scotland  England            7          2 Friendly   Glas…
## 10 1878-03-23 Scotland  Wales              9          0 Friendly   Glas…
## # ... with 39,659 more rows, and 2 more variables: country <chr>,
## #   neutral <lgl>

Explore

https://gist.github.com/b300c271d37ebb43c81a906f078f1109

##       date             home_team          away_team        
##  Min.   :1872-11-30   Length:39669       Length:39669      
##  1st Qu.:1977-02-02   Class :character   Class :character  
##  Median :1996-10-06   Mode  :character   Mode  :character  
##  Mean   :1989-10-17                                        
##  3rd Qu.:2008-01-22                                        
##  Max.   :2018-07-10                                        
##    home_score       away_score      tournament            city          
##  Min.   : 0.000   Min.   : 0.000   Length:39669       Length:39669      
##  1st Qu.: 1.000   1st Qu.: 0.000   Class :character   Class :character  
##  Median : 1.000   Median : 1.000   Mode  :character   Mode  :character  
##  Mean   : 1.748   Mean   : 1.188                                        
##  3rd Qu.: 2.000   3rd Qu.: 2.000                                        
##  Max.   :31.000   Max.   :21.000                                        
##    country           neutral       
##  Length:39669       Mode :logical  
##  Class :character   FALSE:29848    
##  Mode  :character   TRUE :9821     
##                                    
##                                    
## 

Looks like the data is complete and tidy. A few interesting observations

  1. We have data from Nov 30th 1872 to July 10th 2018. Whoo!
  2. A max home_score value of 31 and max away_score of 21?! Some matches to look into!
  3. About 25% of the matches are played in neutral territory. Are these all World Cup matches?

Lets generate some more interesting features

https://gist.github.com/51eb72325433681fec3d092c430783e6

When is the Poisson Distribution appropriate?

For a random variable k to be Poisson, it needs to hold the following 4 conditions (wikipedia)

  1. k is the number of times an event occurs in an interval and k can take values 0, 1, 2, …. i.e. k needs to be an integer (a major distinction from the more popular Gaussian Distribution, where the variable is continuous).

  2. The occurrence of one event does not affect the probability that a second event will occur. That is, events occur independently.

  3. The rate at which events occur is constant. The rate cannot be higher in some intervals and lower in other intervals.

  4. Two events cannot occur at exactly the same instant; instead, at each very small sub-interval exactly one event either occurs or does not occur.
    Or
    The actual probability distribution is given by a binomial distribution and the number of trials is sufficiently bigger than the number of successes one is asking about.

Now, lets first identify our k and the interval and see if they hold the above 4 conditions. Lets explore the following 3 options -

  1. k is total number of goals and interval is 1 year.
  2. k is total number of goals and interval is 1 day.
  3. k is total number of goals and interval is 1 match.

Although we have kept our 3 options such that Condition 1 & 2 will always hold, i.e. The number of goals is always an integer and 1 goal is independent of another (for the most part). But we will need to explore Condtions 3 & 4 for each of these options.

1. k is total number of goals and interval is 1 year

https://gist.github.com/2408a8f7b6b76721658cb32d3b247c44

## # A tibble: 147 x 6
##    year  `mean(totalGoal… `sum(totalGoals… `min(totalGoals…
##    <fct>            <dbl>            <int>            <dbl>
##  1 1872              0                   0                0
##  2 1873              6                   6                6
##  3 1874              3                   3                3
##  4 1875              4                   4                4
##  5 1876              3.5                 7                3
##  6 1877              3                   6                2
##  7 1878              9                  18                9
##  8 1879              5                  15                3
##  9 1880              6.67               20                5
## 10 1881              4.67               14                1
## # ... with 137 more rows, and 2 more variables: `max(totalGoals)` <dbl>,
## #   `median(totalGoals)` <dbl>

https://gist.github.com/b5d67d1ce84b83da148b49b250908f20

https://gist.github.com/3a525955ae51637ee5530318fcc57cb7

As we see in the above 2 plots, even though the mean number of goals remains more or less constant over the years, but the total number of goals per year increases, this violates our condition 3 for it to be a poisson distribution. Also, as per condition 4, number of trials should be sufficiently bigger than number of successes, which is also violated in this case because we have 147 trials (i.e. number of years in the dataset) and successes to the order of ~1000 or more (i.e. total number of goals per year).
Even logically, we can think that if there are more number of matches in a year, then there will be more number of total goals in that year, which violates condition 3.

Based on above, we can also assume that our option 2 (i.e. total number of goals in 1 day), although will be closer to being a poisson distribution as compared to option 1, but it still wont be because more number of matches in a day will mean more number of goals which will violate condition 3 that the rate at which events occur needs to be constant. Lets visualize this for option 2.

2. k is total number of goals and interval is 1 day

https://gist.github.com/5c668b62a25387262f42dde2cf3a159f

## # A tibble: 14,863 x 2
##    date           n
##    <date>     <int>
##  1 2012-02-29    66
##  2 2016-03-29    63
##  3 2008-03-26    60
##  4 2014-03-05    59
##  5 2012-11-14    56
##  6 2011-10-11    54
##  7 2011-11-11    54
##  8 2008-10-11    53
##  9 2011-11-15    53
## 10 2011-09-02    52
## # ... with 14,853 more rows

https://gist.github.com/712a6381347442f21244ac03fc3b8c79

## # A tibble: 14,863 x 6
##    date       `mean(totalGoal… `sum(totalGoals… `min(totalGoals…
##    <date>                <dbl>            <int>            <dbl>
##  1 2012-02-29             2.73              180                0
##  2 2016-03-29             2.79              176                0
##  3 2008-03-26             2.72              163                0
##  4 2011-10-11             3                 162                0
##  5 2014-03-05             2.66              157                0
##  6 2013-10-15             3.08              154                0
##  7 2008-10-11             2.85              151                0
##  8 2011-09-02             2.88              150                0
##  9 2011-09-06             2.87              149                0
## 10 2011-11-15             2.77              147                0
## # ... with 14,853 more rows, and 2 more variables:
## #   `max(totalGoals)` <dbl>, `median(totalGoals)` <dbl>

https://gist.github.com/0d42a1ad2cc8b83723cd23fa49436d5f

https://gist.github.com/201dd85fd8eeb58d0ac88c02680acb84

So, even though number of successes is fairly low compared to number of trials (condition 4 satisfied), rate of event occuring is not constant and is dependent on number of matches played for option 2. Therefore, we reject option 2 as a poisson distribution as well.
Lets finally explore option 3.

3. k is total number of goals and interval is 1 match.

https://gist.github.com/e91b8f8f1e8f4f1c9524664c1fa6c46d

## # A tibble: 1 x 1
##   `mean(totalGoals)`
##                <dbl>
## 1               2.94

https://gist.github.com/01b3e3f4755560570684977b73a04c8c

Eureka! We have a constant rate of number of goals per match with a peak at around 3 goals and a mean of 2.935642 goals per match. Number of goals scored (‘the event’ being a goal being scored) is an integer where one goal is independent of another and the number of matches (i.e. trials) is way higher than number of goals (i.e. successes) per match. Therefore, we have found our Poisson Distribution!

Probability of events for a Poisson Distribution

Now that we have our poisson distribution, we can calculate the probability of k events happening in an interval using the following:

P(k event**s i**n a**n interva**l) = e − λ * λk/k! where,
$\lambda$ = Mean number of events per interval, i.e. mean number of goals per match.
k = Number of events for probability estimation, i.e. number of goals,
e = is the euler number and
k! = is the factorial of k.

As per our exploration above, we have mean number of goals as λ = 2.935642, we can plug-in this value to the formula above to calculate the probability of any number of goals being scored in a match.

For example,

P(5 goals score**d i**n a match) = e − 2.935642 * 2.9356425/5! P(5 goals score**d i**n a match) = 0.09647195841

Lets use R to calculate the above.

https://gist.github.com/1d58b075eeb730598616226ea767c802

## [1] 0.09647199

And we see the same value as calculated above.
We can also see how the probability varies as we increase the number of events i.e. number of goals from 0 to 8.

https://gist.github.com/cef2f6ea6d1a96c3d8a6ad75dd4992aa

Summary

Poisson distribution’s probability calculation formula can be a nifty little trick under anyone’s belt to evaluate the probabily of an event happening. It is also widely used in the industry with applications like estimating the probability of k number of customers arriving at a store in order to optimize resources or probability that a webpage has seen some k updates in order to optimize the rate at which to crawl a webpage by a search engine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment