Skip to content

Instantly share code, notes, and snippets.

@taixingbi
Last active October 14, 2018 12:27
Show Gist options
  • Save taixingbi/caf1e6f3579210a1415fe3bdfd8c2b78 to your computer and use it in GitHub Desktop.
Save taixingbi/caf1e6f3579210a1415fe3bdfd8c2b78 to your computer and use it in GitHub Desktop.
data_science
1. Correlation
https://en.wikipedia.org/wiki/Correlation_coefficient
+1 strongest possible agreement
0 two varaible are independent
−1 strongest possible disagreement
r(x,y)= cov(x,y)/(var(x)*var(y))**.5
---cov(x,y)= E( (x-u)(y-u) )
---var(x) var= (E(x-u)**2 )
Correlation does not imply causation
2. K-mean
A/B testing we are limited by the time, resources. we measure a sample of the potentially infinitely many future visitors to
a site and then use our observations on that sample to predict how visitors would behave in the future.
1. we split our users into two randomly selected groups.
2. we will observe noticeable differences between the behavior of these two groups,
including on Key Performance Indicators(like click through rate, conversion rate).
3. null hypothesis test
X = P - Pc
H0 : X<=0
step 1. compute z score
z= (P-Pc)/(P(1-P)/N - Pc(1-Pc)/Nc)**.5
step 2. from zscore find relevant p-value
one case: one-tailed hypothesis 1.76 -> 0.05
step 3
p-value <= significance level(alpha= .05): null hypothesis is rejected statistically significant(unlikely to have occured by chance)
p-value > significance level(alpha= .05): null hypothesis is not rejected instatistically significant
-------------------------------------read more -------------------------------------
* hypothesis
insufficient evidence that do further testing and experimentation
* null hypothesis (H0)
no (statistical) significance between the two variables in the hypothesis
* alternative hypothesis (H1)
(statistical) significance between the two variables in the hypothesis
* statistical significance
In (statistical) hypothesis testing, a result has statistical significance
when it is very unlikely to have occurred given the null hypothesis(by chance)
* significance level(Confidence level)
0.05(95%)
determine if a result can be judged statistically significant
* Confidence interval
The interval has an associated confidence level
Confidence intervals consist of a range of potential values of the unknown population parameter.
X ± E
E= Z*s/√n margin error
(Z= 1.96 when confidence level is 95%)
* Type I error
rejection of a true null hypothesis (false positive)
* Type II error (beta β)
rejection of a false null hypothesis (false negative)
* Statistical power
power is inversely related to β (power = 1 – β)
* Cohen's d
an effect size used to indicate the standardised difference between two means
how to plot
http://rpsychologist.com/d3/NHST/
one-tailed tests vs two-tailed tests
A/B test is one scenario, belongs to one-tailed tests
one-tailed tests
Pros
One-Tailed Tests
Requires less traffic
Gains significance faster (read: why significance does not equal validity)
Cons
Only accounts for one scenario
Can lead to inaccurate and biased results
two-tailed tests
Pros
Accounts for all three scenarios
Leads to accurate and reliable results
Cons
Requires more traffic
Takes longer to gain significance
* t-test
A t-test is an analysis framework used to determine the difference between two sample means from two normally distributed populations with unknown variances.
for A/B conversion rate
tips t-test does not work for more than two groups
1. increase chance type I error(true null hypothesis)
2. true null hypothesis is incorrect rejected
* Tukey HSD
for multi groups.
. based on t distribution.
. take number of mean
* z score
The z score test for two population proportions is used when you want to know whether two populations or groups.
differ significantly on some single characteristic(e.g. proportion)
* hypothesis
insufficient evidence that do further testing and experimentation
* null hypothesis
no (statistical) significance between the two variables in the hypothesis
* alternative hypothesis
(statistical) significance between the two variables in the hypothesis
** statistical significance
In (statistical) hypothesis testing, a result has statistical significance
when it is very unlikely to have occurred given the null hypothesis(by chance)
** p- value(alpha):
If the p-value is less than 0.05, we reject the null hypothesis
* significance level
determine if a result can be judged statistically significant
* Type I error
rejection of a true null hypothesis (false positive)
* Type II error (beta β)
rejection of a false null hypothesis (false negative)
** significance Power
probability that a particular sample will be to reject the null hypothesis when the null hypothesis is false.
the greater the sample size, the greater the power of the test
* Cohen's d
Effect size = True value - Hypothesized value
an effect size used to indicate the standardised difference between two means
effect size= mean difference / standard deviation
* Statistical power
power is inversely related to β (power = 1 – β)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment