misho-kr/Statistical Thinking in Python (Part 1).md

## Statistical Thinking in Python (Part 1).md

      
    Raw
  

              Statistical Thinking in Python (Part 1).md
            
          
    Statistical Thinking in Python (Part 1)

After acquiring data and getting them into a form you can work with, you want to make clear, succinct conclusions from them. This crucial last step of a data analysis pipeline hinges on the principles of statistical inference. You will start building the foundation to think statistically, speak the language of your data, and understand what your data is telling you. Get up-to-speed and begin thinking statistically.
By Justin Bois Lecturer at the California Institute of Technology
Graphical exploratory data analysis

Before diving into sophisticated statistical inference techniques, you should first explore your data by plotting them and computing simple summary statistics. This process, called exploratory data analysis, is a crucial first step in statistical analysis of data.

“Exploratory data analysis can never be the whole story, but nothing else can serve as the foundation stone.” —John Tukey


Exploratory data analysis

Plotting a histogram
Setting the bins of a histogram


Seaborn

Generating a bee swarm plot


Empirical cumulative distribution function (ECDF)

The probaility the measured value will be less than the mark on the X axis


df_swing = pd.read_csv('2008_swing_states.csv')

_ = plt.hist(df_swing['dem_share'])
_ = plt.xlabel('percent of vote for Obama')
_ = plt.ylabel('number of counties')
plt.show()

bin_edges = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
_ = plt.hist(df_swing['dem_share'], bins=bin_edges)
plt.show()
_ = sns.swarmplot(x='state', y='dem_share', data=df_swing)
_ = plt.xlabel('state')
_ = plt.ylabel('percent of vote for Obama')
plt.show()

import numpy as np
x = np.sort(df_swing['dem_share'])
y = np.arange(1, len(x)+1) / len(x)
_ = plt.plot(x, y, marker='.', linestyle='none')
_ = plt.xlabel('percent of vote for Obama')
_ = plt.ylabel('ECDF')
plt.margins(0.02) # Keeps data off plot edges
plt.show()
Quantitative exploratory data analysis

Compute useful summary statistics, which serve to concisely describe salient features of a dataset with a few numbers

Mean and outliers, Median, Mode
Percentiles on an ECDF, outliers and box plots
Variance

The mean squared distance of the data from their mean
Informally, a measure of the spread of data


Computing the standard deviation
Covariance -- a measure of how two quantities vary together
Pearson correlation coefficient

covariance / (std of x)(std of y)
variability due to codependence / independant variability
0 means the data is not correlated at all


import numpy as np

np.mean(dem_share_PA)
np.median(dem_share_UT)
np.percentile(df_swing['dem_share'], [25, 50, 75])

import matplotlib.pyplot as plt
import seaborn as sns

_ = sns.boxplot(x='east_west', y='dem_share', data=df_all_states)
_ = plt.xlabel('region')
_ = plt.ylabel('percent of vote for Obama')
plt.show()

np.var(dem_share_FL)
np.std(dem_share_FL)
# same
np.sqrt(np.var(dem_share_FL))

# scatter plot
_ = plt.plot(total_votes/1000, dem_share, marker='.', linestyle='none')
_ = plt.xlabel('total votes (thousands)')
_ = plt.ylabel('percent of vote for Obama')
Thinking probabilistically -- Discrete variables

Statistical inference rests upon probability. Because we can very rarely say anything meaningful with absolute certainty from data, we use probabilistic language to make quantitative statements about data. Think probabilistically about discrete quantities: those that can only take certain values, like integers.

Hacker statistics

Uses simulated repeated measurements to compute probabilities
Simulate many many times
Probability is approximately fraction of trials with the outcome of interest


The np.random module

Random number seed


Bernoulli trial -- an experiment that has two options, success (True) and failure (False)
Probability mass function (PMF) -- the set of probabilities of discrete outcomes
Discrete Uniform PMF -- the outcome of rolling a single fair die is:

Discrete
Uniformly distributed


Probability distribution

A mathematical description of outcomes


Binomial distribution

The number r of successes in n Bernoulli trials with probability p of success, is Binomially distributed
The number r of heads in 4 coin ips with probability 0.5 of heads, is Binomially distributed


Sampling from the Binomial distribution
Binomial CDF
Poisson process -- the timing of the next event is completely independent of when the previous event happened

Natural births in a given hospital
Hit on a website during a given hour
Meteor strikes
Molecular collisions in a gas
Aviation incidents
Buses in Poissonville


Poisson distribution

The number r of arrivals of a Poisson process in a given time interval with average rate of 'x' arrivals per interval is Poisson distributed
The number r of hits on a website in one hour with an average hit rate of 6 hits per hour is Poisson distributed


Poisson PMF and Distribution

Poisson Distribution - limit of the Binomial distribution for low probability of success and large number of trials
That is, for rare events


Poisson CDF

np.random.seed(42)
coins = np.random.random(size=4)
heads = coins < 0.5
np.sum(heads)

n_all_heads = 0 # Initialize number of 4-heads trials
for _ in range(10000):
  heads = np.random.random(size=4) < 0.5
  n_heads = np.sum(heads)
  if n_heads == 4:
    n_all_heads += 1
n_all_heads / 10000
samples = np.random.binomial(60, 0.1, size=10000)

sns.set()
x, y = ecdf(samples)
_ = plt.plot(x, y, marker='.', linestyle='none')
_ = plt.xlabel('number of successes')
_ = plt.ylabel('CDF')
plt.show()
samples = np.random.poisson(6, size=10000)
x, y = ecdf(samples)
_ = plt.plot(x, y, marker='.', linestyle='none')
_ = plt.xlabel('number of successes')
_ = plt.ylabel('CDF')
plt.show()
Thinking probabilistically -- Continuous variables

Continuous variables can take on any fractional value. Many of the principles are the same, but there are some subtleties. Speak the probabilistic language you need to launch into the inference techniques covered in the sequel to this course.

Michelson's speed of light experiment
Probability density function (PDF)

Continuous analog to the PMF
Mathematical description of the relative likelihood of observing a value of a continuous variable
Normal PDF
Normal CDF


Normal distribution -- describes a continuous variable whose PDF has a single symmetric peak

Checking Normality of Michelson data


import numpy as np

mean = np.mean(michelson_speed_of_light)
std = np.std(michelson_speed_of_light)

samples = np.random.normal(mean, std, size=10000)
x, y = ecdf(michelson_speed_of_light)
x_theor, y_theor = ecdf(samples)

import matplotlib.pyplot as plt
import seaborn as sns

sns.set()
_ = plt.plot(x_theor, y_theor)
_ = plt.plot(x, y, marker='.', linestyle='none')
_ = plt.xlabel('speed of light (km/s)')
_ = plt.ylabel('CDF')

plt.show()

The Gaussian distribution
The Exponential distribution

The waiting time between arrivals of a Poisson process is Exponentially distributed
Exponential PDF


mean = np.mean(inter_times)
samples = np.random.exponential(mean, size=10000)

x, y = ecdf(inter_times)
x_theor, y_theor = ecdf(samples)
_ = plt.plot(x_theor, y_theor)
_ = plt.plot(x, y, marker='.', linestyle='none')
_ = plt.xlabel('time (days)')
_ = plt.ylabel('CDF')

plt.show()