After acquiring data and getting them into a form you can work with, you want to make clear, succinct conclusions from them. This crucial last step of a data analysis pipeline hinges on the principles of statistical inference. You will start building the foundation to think statistically, speak the language of your data, and understand what your data is telling you. Get up-to-speed and begin thinking statistically.
By Justin Bois Lecturer at the California Institute of Technology
Before diving into sophisticated statistical inference techniques, you should first explore your data by plotting them and computing simple summary statistics. This process, called exploratory data analysis, is a crucial first step in statistical analysis of data.
“Exploratory data analysis can never be the whole story, but nothing else can serve as the foundation stone.” —John Tukey
- Exploratory data analysis
- Plotting a histogram
- Setting the bins of a histogram
Seaborn
- Generating a bee swarm plot
- Empirical cumulative distribution function (ECDF)
- The probaility the measured value will be less than the mark on the
X
axis
- The probaility the measured value will be less than the mark on the
df_swing = pd.read_csv('2008_swing_states.csv')
_ = plt.hist(df_swing['dem_share'])
_ = plt.xlabel('percent of vote for Obama')
_ = plt.ylabel('number of counties')
plt.show()
bin_edges = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
_ = plt.hist(df_swing['dem_share'], bins=bin_edges)
plt.show()
_ = sns.swarmplot(x='state', y='dem_share', data=df_swing)
_ = plt.xlabel('state')
_ = plt.ylabel('percent of vote for Obama')
plt.show()
import numpy as np
x = np.sort(df_swing['dem_share'])
y = np.arange(1, len(x)+1) / len(x)
_ = plt.plot(x, y, marker='.', linestyle='none')
_ = plt.xlabel('percent of vote for Obama')
_ = plt.ylabel('ECDF')
plt.margins(0.02) # Keeps data off plot edges
plt.show()
Compute useful summary statistics, which serve to concisely describe salient features of a dataset with a few numbers
- Mean and outliers, Median, Mode
- Percentiles on an ECDF, outliers and box plots
- Variance
- The mean squared distance of the data from their mean
- Informally, a measure of the spread of data
- Computing the standard deviation
- Covariance -- a measure of how two quantities vary together
- Pearson correlation coefficient
- covariance / (std of x)(std of y)
- variability due to codependence / independant variability
0
means the data is not correlated at all
import numpy as np
np.mean(dem_share_PA)
np.median(dem_share_UT)
np.percentile(df_swing['dem_share'], [25, 50, 75])
import matplotlib.pyplot as plt
import seaborn as sns
_ = sns.boxplot(x='east_west', y='dem_share', data=df_all_states)
_ = plt.xlabel('region')
_ = plt.ylabel('percent of vote for Obama')
plt.show()
np.var(dem_share_FL)
np.std(dem_share_FL)
# same
np.sqrt(np.var(dem_share_FL))
# scatter plot
_ = plt.plot(total_votes/1000, dem_share, marker='.', linestyle='none')
_ = plt.xlabel('total votes (thousands)')
_ = plt.ylabel('percent of vote for Obama')
Statistical inference rests upon probability. Because we can very rarely say anything meaningful with absolute certainty from data, we use probabilistic language to make quantitative statements about data. Think probabilistically about discrete quantities: those that can only take certain values, like integers.
- Hacker statistics
- Uses simulated repeated measurements to compute probabilities
- Simulate many many times
- Probability is approximately fraction of trials with the outcome of interest
- The
np.random
module- Random number seed
- Bernoulli trial -- an experiment that has two options,
success
(True) andfailure
(False) - Probability mass function (PMF) -- the set of probabilities of discrete outcomes
- Discrete Uniform PMF -- the outcome of rolling a single fair die is:
- Discrete
- Uniformly distributed
- Probability distribution
- A mathematical description of outcomes
- Binomial distribution
- The number
r
of successes in n Bernoulli trials with probabilityp
of success, is Binomially distributed - The number
r
of heads in4
coin ips with probability0.5
of heads, is Binomially distributed
- The number
- Sampling from the Binomial distribution
- Binomial CDF
- Poisson process -- the timing of the next event is completely independent of when the previous event happened
- Natural births in a given hospital
- Hit on a website during a given hour
- Meteor strikes
- Molecular collisions in a gas
- Aviation incidents
- Buses in Poissonville
- Poisson distribution
- The number
r
of arrivals of a Poisson process in a given time interval with average rate of 'x' arrivals per interval is Poisson distributed - The number
r
of hits on a website inone
hour with an average hit rate of6
hits per hour is Poisson distributed
- The number
- Poisson PMF and Distribution
- Poisson Distribution - limit of the Binomial distribution for low probability of success and large number of trials
- That is, for rare events
- Poisson CDF
np.random.seed(42)
coins = np.random.random(size=4)
heads = coins < 0.5
np.sum(heads)
n_all_heads = 0 # Initialize number of 4-heads trials
for _ in range(10000):
heads = np.random.random(size=4) < 0.5
n_heads = np.sum(heads)
if n_heads == 4:
n_all_heads += 1
n_all_heads / 10000
samples = np.random.binomial(60, 0.1, size=10000)
sns.set()
x, y = ecdf(samples)
_ = plt.plot(x, y, marker='.', linestyle='none')
_ = plt.xlabel('number of successes')
_ = plt.ylabel('CDF')
plt.show()
samples = np.random.poisson(6, size=10000)
x, y = ecdf(samples)
_ = plt.plot(x, y, marker='.', linestyle='none')
_ = plt.xlabel('number of successes')
_ = plt.ylabel('CDF')
plt.show()
Continuous variables can take on any fractional value. Many of the principles are the same, but there are some subtleties. Speak the probabilistic language you need to launch into the inference techniques covered in the sequel to this course.
- Michelson's speed of light experiment
- Probability density function (PDF)
- Continuous analog to the PMF
- Mathematical description of the relative likelihood of observing a value of a continuous variable
- Normal PDF
- Normal CDF
- Normal distribution -- describes a continuous variable whose PDF has a single symmetric peak
- Checking Normality of Michelson data
import numpy as np
mean = np.mean(michelson_speed_of_light)
std = np.std(michelson_speed_of_light)
samples = np.random.normal(mean, std, size=10000)
x, y = ecdf(michelson_speed_of_light)
x_theor, y_theor = ecdf(samples)
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
_ = plt.plot(x_theor, y_theor)
_ = plt.plot(x, y, marker='.', linestyle='none')
_ = plt.xlabel('speed of light (km/s)')
_ = plt.ylabel('CDF')
plt.show()
- The Gaussian distribution
- The Exponential distribution
- The waiting time between arrivals of a Poisson process is Exponentially distributed
- Exponential PDF
mean = np.mean(inter_times)
samples = np.random.exponential(mean, size=10000)
x, y = ecdf(inter_times)
x_theor, y_theor = ecdf(samples)
_ = plt.plot(x_theor, y_theor)
_ = plt.plot(x, y, marker='.', linestyle='none')
_ = plt.xlabel('time (days)')
_ = plt.ylabel('CDF')
plt.show()