Last active
October 14, 2018 12:27
-
-
Save taixingbi/caf1e6f3579210a1415fe3bdfd8c2b78 to your computer and use it in GitHub Desktop.
data_science
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
1. Correlation | |
https://en.wikipedia.org/wiki/Correlation_coefficient | |
+1 strongest possible agreement | |
0 two varaible are independent | |
−1 strongest possible disagreement | |
r(x,y)= cov(x,y)/(var(x)*var(y))**.5 | |
---cov(x,y)= E( (x-u)(y-u) ) | |
---var(x) var= (E(x-u)**2 ) | |
Correlation does not imply causation | |
2. K-mean |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
A/B testing we are limited by the time, resources. we measure a sample of the potentially infinitely many future visitors to | |
a site and then use our observations on that sample to predict how visitors would behave in the future. | |
1. we split our users into two randomly selected groups. | |
2. we will observe noticeable differences between the behavior of these two groups, | |
including on Key Performance Indicators(like click through rate, conversion rate). | |
3. null hypothesis test | |
X = P - Pc | |
H0 : X<=0 | |
step 1. compute z score | |
z= (P-Pc)/(P(1-P)/N - Pc(1-Pc)/Nc)**.5 | |
step 2. from zscore find relevant p-value | |
one case: one-tailed hypothesis 1.76 -> 0.05 | |
step 3 | |
p-value <= significance level(alpha= .05): null hypothesis is rejected statistically significant(unlikely to have occured by chance) | |
p-value > significance level(alpha= .05): null hypothesis is not rejected instatistically significant | |
-------------------------------------read more ------------------------------------- | |
* hypothesis | |
insufficient evidence that do further testing and experimentation | |
* null hypothesis (H0) | |
no (statistical) significance between the two variables in the hypothesis | |
* alternative hypothesis (H1) | |
(statistical) significance between the two variables in the hypothesis | |
* statistical significance | |
In (statistical) hypothesis testing, a result has statistical significance | |
when it is very unlikely to have occurred given the null hypothesis(by chance) | |
* significance level(Confidence level) | |
0.05(95%) | |
determine if a result can be judged statistically significant | |
* Confidence interval | |
The interval has an associated confidence level | |
Confidence intervals consist of a range of potential values of the unknown population parameter. | |
X ± E | |
E= Z*s/√n margin error | |
(Z= 1.96 when confidence level is 95%) | |
* Type I error | |
rejection of a true null hypothesis (false positive) | |
* Type II error (beta β) | |
rejection of a false null hypothesis (false negative) | |
* Statistical power | |
power is inversely related to β (power = 1 – β) | |
* Cohen's d | |
an effect size used to indicate the standardised difference between two means | |
how to plot | |
http://rpsychologist.com/d3/NHST/ | |
one-tailed tests vs two-tailed tests | |
A/B test is one scenario, belongs to one-tailed tests | |
one-tailed tests | |
Pros | |
One-Tailed Tests | |
Requires less traffic | |
Gains significance faster (read: why significance does not equal validity) | |
Cons | |
Only accounts for one scenario | |
Can lead to inaccurate and biased results | |
two-tailed tests | |
Pros | |
Accounts for all three scenarios | |
Leads to accurate and reliable results | |
Cons | |
Requires more traffic | |
Takes longer to gain significance | |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
* t-test | |
A t-test is an analysis framework used to determine the difference between two sample means from two normally distributed populations with unknown variances. | |
for A/B conversion rate | |
tips t-test does not work for more than two groups | |
1. increase chance type I error(true null hypothesis) | |
2. true null hypothesis is incorrect rejected | |
* Tukey HSD | |
for multi groups. | |
. based on t distribution. | |
. take number of mean | |
* z score | |
The z score test for two population proportions is used when you want to know whether two populations or groups. | |
differ significantly on some single characteristic(e.g. proportion) | |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
* hypothesis | |
insufficient evidence that do further testing and experimentation | |
* null hypothesis | |
no (statistical) significance between the two variables in the hypothesis | |
* alternative hypothesis | |
(statistical) significance between the two variables in the hypothesis | |
** statistical significance | |
In (statistical) hypothesis testing, a result has statistical significance | |
when it is very unlikely to have occurred given the null hypothesis(by chance) | |
** p- value(alpha): | |
If the p-value is less than 0.05, we reject the null hypothesis | |
* significance level | |
determine if a result can be judged statistically significant | |
* Type I error | |
rejection of a true null hypothesis (false positive) | |
* Type II error (beta β) | |
rejection of a false null hypothesis (false negative) | |
** significance Power | |
probability that a particular sample will be to reject the null hypothesis when the null hypothesis is false. | |
the greater the sample size, the greater the power of the test | |
* Cohen's d | |
Effect size = True value - Hypothesized value | |
an effect size used to indicate the standardised difference between two means | |
effect size= mean difference / standard deviation | |
* Statistical power | |
power is inversely related to β (power = 1 – β) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment