Skip to content

Instantly share code, notes, and snippets.

@misho-kr
Last active December 10, 2020 05:53
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save misho-kr/7928a7f6f9cfb86ea85dac1486b7950b to your computer and use it in GitHub Desktop.
Save misho-kr/7928a7f6f9cfb86ea85dac1486b7950b to your computer and use it in GitHub Desktop.
Summary of "Statistical Thinking in Python (Part 1)" from Datacamp.Org

After acquiring data and getting them into a form you can work with, you want to make clear, succinct conclusions from them. This crucial last step of a data analysis pipeline hinges on the principles of statistical inference. You will start building the foundation to think statistically, speak the language of your data, and understand what your data is telling you. Get up-to-speed and begin thinking statistically.

By Justin Bois Lecturer at the California Institute of Technology

Graphical exploratory data analysis

Before diving into sophisticated statistical inference techniques, you should first explore your data by plotting them and computing simple summary statistics. This process, called exploratory data analysis, is a crucial first step in statistical analysis of data.

“Exploratory data analysis can never be the whole story, but nothing else can serve as the foundation stone.” —John Tukey

  • Exploratory data analysis
    • Plotting a histogram
    • Setting the bins of a histogram
  • Seaborn
    • Generating a bee swarm plot
  • Empirical cumulative distribution function (ECDF)
    • The probaility the measured value will be less than the mark on the X axis
df_swing = pd.read_csv('2008_swing_states.csv')

_ = plt.hist(df_swing['dem_share'])
_ = plt.xlabel('percent of vote for Obama')
_ = plt.ylabel('number of counties')
plt.show()

bin_edges = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
_ = plt.hist(df_swing['dem_share'], bins=bin_edges)
plt.show()
_ = sns.swarmplot(x='state', y='dem_share', data=df_swing)
_ = plt.xlabel('state')
_ = plt.ylabel('percent of vote for Obama')
plt.show()

import numpy as np
x = np.sort(df_swing['dem_share'])
y = np.arange(1, len(x)+1) / len(x)
_ = plt.plot(x, y, marker='.', linestyle='none')
_ = plt.xlabel('percent of vote for Obama')
_ = plt.ylabel('ECDF')
plt.margins(0.02) # Keeps data off plot edges
plt.show()

Quantitative exploratory data analysis

Compute useful summary statistics, which serve to concisely describe salient features of a dataset with a few numbers

  • Mean and outliers, Median, Mode
  • Percentiles on an ECDF, outliers and box plots
  • Variance
    • The mean squared distance of the data from their mean
    • Informally, a measure of the spread of data
  • Computing the standard deviation
  • Covariance -- a measure of how two quantities vary together
  • Pearson correlation coefficient
    • covariance / (std of x)(std of y)
    • variability due to codependence / independant variability
    • 0 means the data is not correlated at all
import numpy as np

np.mean(dem_share_PA)
np.median(dem_share_UT)
np.percentile(df_swing['dem_share'], [25, 50, 75])

import matplotlib.pyplot as plt
import seaborn as sns

_ = sns.boxplot(x='east_west', y='dem_share', data=df_all_states)
_ = plt.xlabel('region')
_ = plt.ylabel('percent of vote for Obama')
plt.show()

np.var(dem_share_FL)
np.std(dem_share_FL)
# same
np.sqrt(np.var(dem_share_FL))

# scatter plot
_ = plt.plot(total_votes/1000, dem_share, marker='.', linestyle='none')
_ = plt.xlabel('total votes (thousands)')
_ = plt.ylabel('percent of vote for Obama')

Thinking probabilistically -- Discrete variables

Statistical inference rests upon probability. Because we can very rarely say anything meaningful with absolute certainty from data, we use probabilistic language to make quantitative statements about data. Think probabilistically about discrete quantities: those that can only take certain values, like integers.

  • Hacker statistics
    • Uses simulated repeated measurements to compute probabilities
    • Simulate many many times
    • Probability is approximately fraction of trials with the outcome of interest
  • The np.random module
    • Random number seed
  • Bernoulli trial -- an experiment that has two options, success (True) and failure (False)
  • Probability mass function (PMF) -- the set of probabilities of discrete outcomes
  • Discrete Uniform PMF -- the outcome of rolling a single fair die is:
    • Discrete
    • Uniformly distributed
  • Probability distribution
    • A mathematical description of outcomes
  • Binomial distribution
    • The number r of successes in n Bernoulli trials with probability p of success, is Binomially distributed
    • The number r of heads in 4 coin ips with probability 0.5 of heads, is Binomially distributed
  • Sampling from the Binomial distribution
  • Binomial CDF
  • Poisson process -- the timing of the next event is completely independent of when the previous event happened
    • Natural births in a given hospital
    • Hit on a website during a given hour
    • Meteor strikes
    • Molecular collisions in a gas
    • Aviation incidents
    • Buses in Poissonville
  • Poisson distribution
    • The number r of arrivals of a Poisson process in a given time interval with average rate of 'x' arrivals per interval is Poisson distributed
    • The number r of hits on a website in one hour with an average hit rate of 6 hits per hour is Poisson distributed
  • Poisson PMF and Distribution
    • Poisson Distribution - limit of the Binomial distribution for low probability of success and large number of trials
    • That is, for rare events
  • Poisson CDF
np.random.seed(42)
coins = np.random.random(size=4)
heads = coins < 0.5
np.sum(heads)

n_all_heads = 0 # Initialize number of 4-heads trials
for _ in range(10000):
  heads = np.random.random(size=4) < 0.5
  n_heads = np.sum(heads)
  if n_heads == 4:
    n_all_heads += 1
n_all_heads / 10000
samples = np.random.binomial(60, 0.1, size=10000)

sns.set()
x, y = ecdf(samples)
_ = plt.plot(x, y, marker='.', linestyle='none')
_ = plt.xlabel('number of successes')
_ = plt.ylabel('CDF')
plt.show()
samples = np.random.poisson(6, size=10000)
x, y = ecdf(samples)
_ = plt.plot(x, y, marker='.', linestyle='none')
_ = plt.xlabel('number of successes')
_ = plt.ylabel('CDF')
plt.show()

Thinking probabilistically -- Continuous variables

Continuous variables can take on any fractional value. Many of the principles are the same, but there are some subtleties. Speak the probabilistic language you need to launch into the inference techniques covered in the sequel to this course.

  • Michelson's speed of light experiment
  • Probability density function (PDF)
    • Continuous analog to the PMF
    • Mathematical description of the relative likelihood of observing a value of a continuous variable
    • Normal PDF
    • Normal CDF
  • Normal distribution -- describes a continuous variable whose PDF has a single symmetric peak
    • Checking Normality of Michelson data
import numpy as np

mean = np.mean(michelson_speed_of_light)
std = np.std(michelson_speed_of_light)

samples = np.random.normal(mean, std, size=10000)
x, y = ecdf(michelson_speed_of_light)
x_theor, y_theor = ecdf(samples)

import matplotlib.pyplot as plt
import seaborn as sns

sns.set()
_ = plt.plot(x_theor, y_theor)
_ = plt.plot(x, y, marker='.', linestyle='none')
_ = plt.xlabel('speed of light (km/s)')
_ = plt.ylabel('CDF')

plt.show()
  • The Gaussian distribution
  • The Exponential distribution
    • The waiting time between arrivals of a Poisson process is Exponentially distributed
    • Exponential PDF
mean = np.mean(inter_times)
samples = np.random.exponential(mean, size=10000)

x, y = ecdf(inter_times)
x_theor, y_theor = ecdf(samples)
_ = plt.plot(x_theor, y_theor)
_ = plt.plot(x, y, marker='.', linestyle='none')
_ = plt.xlabel('time (days)')
_ = plt.ylabel('CDF')

plt.show()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment