Skip to content

Instantly share code, notes, and snippets.

@cristiano74
Created June 15, 2013 13:47
Show Gist options
  • Save cristiano74/5788196 to your computer and use it in GitHub Desktop.
Save cristiano74/5788196 to your computer and use it in GitHub Desktop.
Chi square - python.md
#Chi square - python
[toc]
##Definizione del processo
1. Generazione di una popolazione a partire da una tupla generica
2. estrazione casuale di un campione di n_elementi dalla pop creata
3. calcolo della frequenze attese e ossevarte in base alla tupla estratta
4. calcolo del chi-quadrato con p_value associato e definizione dei livelli di accuratezza del risulato.
+ State your conclusion in terms of your hypothesis.
+ **If the p value for the calculated chi-square is p > 0.05, accept your hypothesis. 'The deviation is small enough that chance alone accounts for it**.
+ A p value of **0.6**, for example, means that **there is a 60% probability that any deviation from expected is *due to chance only***. This is within the range of acceptable deviation.
+ If the p value for the calculated chi-square is ***p < 0.05*, REJECT your hypothesis, and conclude that some factor other than chance** is operating for the deviation to be so great.
+ For example, a p value of 0.01 means that there is only a 1% chance that this deviation is due to chance alone. Therefore, other factors must be involved.
+ The chi-square test will be used to test for the "goodness to fit" between observed and expected data from several laboratory investigations in this lab manual.
LINK (http://www2.lv.psu.edu/jxm57/irp/chisquar.html)
##Esempio
**La tupla generica estratta è:**
` px [0.0, 0.05, 0.15, 0.1, 0.35, 0.15, 0.2]`
Da una popolazione di 100,000 elementi , estraggo 1000 elementi:
Quindi la freq. **attesa** assoluta (FA) è la seguente (totale = 1000 elementi):
`fa_pre [50, 150, 100, 350, 150, 200]`
*Notare che il **valore 0** è stato eliminato dal vettore, per [bug con il sistema di calcolo python][6]
Quindi la freq. **osservata** assoluta (FO) è :
` fo_pre [53, 155, 96, 332, 162, 202]`
Il confronto fra FA e FO:
`[ 50 150 100 350 150 200] --> FA`
`[ 53 155 96 332 162 202] --> FO`
Calcolo del pvalue: **`Chi square test: 0.789628937062`**
A ciscun valore di p_value corrisponde un livello di accuratezza, come nella tabella sotto:
Dalla elaborazione della [v1.1](#v11), appare come risultato quanto segue:
###Risultato ottenuto con [CSTP](#cstp) v1.1
BASH line$ in linux: `python test4.py 500 100000 1000 10 `
` ('<<<Risultato>>>', 'pop=', 100000, '; campione=', 1000, 'Tuple estratte=', 500, 'High:', 149, ' Med:', 201, ' Low:', 150, 'nullo', 0)
`
name | label
------------------------|
'pop=', 100000, | 100k casi considerati come pop
'campione=', 1000 | 1000 casi estratti (sample)
'Tuple estratte=' 500, | 500 combinazioni di freschezza su 230k totali utilizzate nel processo
' High:', 149, | 149 su 500 combinazioni hanno dato p value alto (vedere [tabella](#ptable))
' Med:', 201, | 201 su 500 combinazioni hanno dato p value medio
' Low:', 150, | 150 su 500 combinazioni hanno dato p value medio
'nullo', 0 | da trascurare - solo per controllo
Il processo è ripetuto n volte, con la stessa numerosità della poopolazione e del campione, ma con differenti tuple estratte.
I risultati sono memorizzati nel file [**SAMPLERESULTS_CHISQUARE.txt**](https://www.pythonanywhere.com/user/cristiano74/files/home/cristiano74/mysite/SAMPLERESULTS_CHISQUARE.txt?edit]).
Un estratto del file :
```
('<<<Risultato>>>', 'pop=', 100000, '; campione=', 1000, 'Tuple estratte=', 500, 'High:', 139, ' Med:', 221, ' Low:', 140, 'nullo', 0)
('<<<Risultato>>>', 'pop=', 100000, '; campione=', 1000, 'Tuple estratte=', 500, 'High:', 163, ' Med:', 177, ' Low:', 160, 'nullo', 0)
('<<<Risultato>>>', 'pop=', 100000, '; campione=', 1000, 'Tuple estratte=', 500, 'High:', 170, ' Med:', 179, ' Low:', 151, 'nullo', 0)
('<<<Risultato>>>', 'pop=', 100000, '; campione=', 1000, 'Tuple estratte=', 500, 'High:', 142, ' Med:', 209, ' Low:', 149, 'nullo', 0)
('<<<Risultato>>>', 'pop=', 100000, '; campione=', 1000, 'Tuple estratte=', 500, 'High:', 146, ' Med:', 195, ' Low:', 159, 'nullo', 0)
('<<<Risultato>>>', 'pop=', 100000, '; campione=', 1000, 'Tuple estratte=', 500, 'High:', 146, ' Med:', 204, ' Low:', 150, 'nullo', 0)
```
###<a id="ptable">Tabella di referimento per il p_value</a>
p value | level
--------- | -----
<0.30 | low accuracy
0.30-0.70 | medium accuracy
\>0.70 | High accuracy
----------
## CODE
### Create chi square test process [CSTP] <a id="cstp"></a>- test4.py
> note: Chi Square Test Process = CSTP
**`CODICE per la generazione del pvalue CHI SQUARE`**
[source code ] [7]
####<a id="v11">V1.1</a>
> - introdotto il ciclo sulle ripetizioni del processo, param 4
> - la lista delle tuple viene ri-estratta ogni ciclo
> - save dei risultati in: [**SAMPLERESULTS_CHISQUARE.txt**](https://www.pythonanywhere.com/user/cristiano74/files/home/cristiano74/mysite/SAMPLERESULTS_CHISQUARE.txt?edit])
> - dipende dalle tuple create con [Cretate tuples v1.0](#createtuplesv1.0)
```
############################################
# USAGE:
# python test4.py 100 100000 5000
# param1= numero estrazioni casuali di tuple
# param2= num pop
# param3= num sample
# param4= num di ripetizioni del processo
############################################
import sys
import random
import collections
import numpy
#QUI SI APRE IL FILE CON LE COMBINAZIONI
#################################
lista=[]
import pickle
#f = file("test.txt", "r")
f = file("outfile.txt", "r")
lista=pickle.load(f)
#############################
#from scipy.stats import mannwhitneyu
#from scipy.stats import chisquare
#from scipy.stats import kruskal
from scipy.stats import rv_discrete
import scipy.stats.mstats as mst
tuple_tot = 0
acc=0
high = 0
low = 0
med = 0
nullo = 0
p_value=0
# a= random.sample(range(100), 100)
n_extract = int(sys.argv[1]) #, indare numero di estrazioni casuali di tuple dalle permutazioni
fa = []
fa_pre = []
fo_pre = []
fo =[]
tot=int(sys.argv[2])
Nsample = int(sys.argv[3])
for i in xrange(int(sys.argv[4])):
lista = random.sample(lista,n_extract)
acc=0
high = 0
low = 0
med = 0
nullo = 0
p_value=0
for e in lista:
print e
#px=[float(e[0]/100),float(e[1]/100),float(e[2]/100),float(e[3]/100),float(e[4]/100),float(e[5]/100),float(e[6]/100)] # 7 rep]
e1= ((e[0]/float(100)))
e2= ((e[1]/float(100)))
e3= ((e[2]/float(100)))
e4= ((e[3]/float(100)))
e5= ((e[4]/float(100)))
e6= ((e[5]/float(100)))
e7= ((e[6]/float(100)))
px=[e1,e2,e3,e4,e5,e6,e7] # 7 rep]
print "px", px
x =[1,2,3,4,5,6,7]
fa_pre=([int(e1*Nsample),int(e2*Nsample),int(e3*Nsample),int(e4*Nsample),int(e5*Nsample),int(e6*Nsample),int(e7*Nsample)])
fa_pre[:] = (value for value in fa_pre if value != 0.0)
print "fa_pre",fa_pre
fa= numpy.array(fa_pre)
#fa = numpy.array([int(e1*Nsample),int(e2*Nsample),int(e3*Nsample),int(e4*Nsample),int(e5*Nsample),int(e6*Nsample),int(e7*Nsample)])
pop = rv_discrete(values=(x, px)).rvs(size=tot)
random.shuffle(pop)
sample1 = random.sample(pop,Nsample)
counter_sample=collections.Counter(sample1)
fo_pre = [counter_sample[1],counter_sample[2],counter_sample[3],counter_sample[4],counter_sample[5],counter_sample[6],counter_sample[7]]
fo_pre[:] = (value for value in fo_pre if value != 0.0)
print "fo_pre",fo_pre
fo = numpy.array(fo_pre)
#fo = numpy.array([counter_sample[1],counter_sample[2],counter_sample[3],counter_sample[4],counter_sample[5],counter_sample[6],counter_sample[7]])
print fa
print fo
#print mst.chisquare(fo, fa)
#print(samples)
try:
u, p_value = mst.chisquare(fo, fa)
#u, p_value = 2*mannwhitneyu(pop, sample1)
#u, p_value = kruskal(pop, sample1)
if p_value > 0.70:
acc += 1
test = "HIGH precision"
high += 1
else:
if p_value < 0.30:
test = "Low precision"
low += 1
else:
test = "Medium precision"
med += 1
except:
nullo +=1
p_value = 0
pass
print "Chi square test:", p_value
final = "<<<Risultato>>>",'pop=', tot, "; campione=", Nsample ,"Estrazioni=",n_extract, "High:",high," Med:",med," Low:",low, "nullo", nullo
print final
log = open("SAMPLERESULTS_CHISQUARE.txt","a")
print >> log,final
log.close()
```
####v 1.0
############################################
# USAGE:
# python test4.py 100 100000 5000
# param1= replicazioni test, casuali tuple
# param2= num pop
# param3= num sample
############################################
import sys
import random
import collections
import numpy
#QUI SI APRE IL FILE CON LE COMBINAZIONI
#################################
lista=[]
import pickle
#f = file("test.txt", "r")
f = file("outfile.txt", "r")
lista=pickle.load(f)
#############################
#from scipy.stats import mannwhitneyu
#from scipy.stats import chisquare
#from scipy.stats import kruskal
from scipy.stats import rv_discrete
import scipy.stats.mstats as mst
tuple_tot = 0
acc=0
high = 0
low = 0
med = 0
nullo = 0
p_value=0
# a= random.sample(range(100), 100)
n_extract = int(sys.argv[1]) #, indare numero di estrazioni casuali di tuple dalle permutazioni
lista = random.sample(lista,n_extract)
fa = []
fa_pre = []
fo_pre = []
fo =[]
for e in lista:
tot=int(sys.argv[2])
Nsample = int(sys.argv[3])
print e
#px=[float(e[0]/100),float(e[1]/100),float(e[2]/100),float(e[3]/100),float(e[4]/100),float(e[5]/100),float(e[6]/100)] # 7 rep]
e1= ((e[0]/float(100)))
e2= ((e[1]/float(100)))
e3= ((e[2]/float(100)))
e4= ((e[3]/float(100)))
e5= ((e[4]/float(100)))
e6= ((e[5]/float(100)))
e7= ((e[6]/float(100)))
px=[e1,e2,e3,e4,e5,e6,e7] # 7 rep]
print "px", px
x =[1,2,3,4,5,6,7]
fa_pre=([int(e1*Nsample),int(e2*Nsample),int(e3*Nsample),int(e4*Nsample),int(e5*Nsample),int(e6*Nsample),int(e7*Nsample)])
fa_pre[:] = (value for value in fa_pre if value != 0.0)
print "fa_pre",fa_pre
fa= numpy.array(fa_pre)
#fa = numpy.array([int(e1*Nsample),int(e2*Nsample),int(e3*Nsample),int(e4*Nsample),int(e5*Nsample),int(e6*Nsample),int(e7*Nsample)])
pop = rv_discrete(values=(x, px)).rvs(size=tot)
random.shuffle(pop)
sample1 = random.sample(pop,Nsample)
counter_sample=collections.Counter(sample1)
fo_pre = [counter_sample[1],counter_sample[2],counter_sample[3],counter_sample[4],counter_sample[5],counter_sample[6],counter_sample[7]]
fo_pre[:] = (value for value in fo_pre if value != 0.0)
print "fo_pre",fo_pre
fo = numpy.array(fo_pre)
#fo = numpy.array([counter_sample[1],counter_sample[2],counter_sample[3],counter_sample[4],counter_sample[5],counter_sample[6],counter_sample[7]])
print fa
print fo
#print mst.chisquare(fo, fa)
#print(samples)
try:
u, p_value = mst.chisquare(fo, fa)
#u, p_value = 2*mannwhitneyu(pop, sample1)
#u, p_value = kruskal(pop, sample1)
if p_value > 0.70:
acc += 1
test = "HIGH precision"
high += 1
else:
if p_value < 0.30:
test = "Low precision"
low += 1
else:
test = "Medium precision"
med += 1
except:
nullo +=1
p_value = 0
pass
print "Chi square test:", p_value
final = "<<<Risultato>>>",'pop=', tot, "; campione=", Nsample ,"Estrazioni=",n_extract, "High:",high," Med:",med," Low:",low, "nullo", nullo
print final
###<a id="createtuples">Create tuples-test1.py</a>
> file per la generazione delle tuple con permutazione , con i seguenti valori di percentuale:
`v=[0,5,10,15,20,25,30,35,40,45,50,55,60,65,70,75,80,85,90,95,100];`
>[Source code for test1.py][8]
Il risultato delle 230230 tuple è memorizzato nel file OUTFILE.TXT
####<a id="createtuplesv1.0">V1.0</a>
> esempio BASH : **python filename.py 1**
> il valore 1 indica che il testmode=1, cioe' falso.
```
##################################################
#QUESTO FILE GENERA LE TUPLE DA UTILIZZARE NEL FILE
#DA ESEGUIRE UNA SOLA VOLTA
#IL FILE : OUTFILE.TXT
##################################################
import itertools;
v=[0,5,10,15,20,25,30,35,40,45,50,55,60,65,70,75,80,85,90,95,100];
import sys
# esempio BASH : python filename.py 1
#testmode=0
testmode=int(sys.argv[1])
target=100 #somma da raggiungere - test
if testmode==1:
n_rep=3
else:
n_rep=7
############### CREAZIONE TUPLA
combos = [a for a in itertools.product(v,repeat=n_rep) if sum(a)==target]
#############################
lista = []
pop = []
############### AGGIUNGE TUPLA ALLA LISTA
for e in combos:
# print e #indica la tupla
lista.append(e)
#########################################
############### SCRIVE LISTA SU FILE OUTFILE.TXT
import pickle
f = file("outfile.txt", "w")
pickle.dump(lista, f)
#print lista
##################################
```
##Notes
####Chi square Goodness of fit: - python
1. [Articolo di un modulo python per chi square GOF][4]
2. [**How to Test Your Discrete Distribution- Minitab**] [5]
3. [Difference between scipy.stats.mstats.chisquare and
scipy.stats.mstats in Python][6]
1. [p VALUE - INTERPRETATION][9]
1. [How to generate numbers based on an arbitrary **discrete distribution**?][2]
2. [ANOVA, Nonparametric Tests, and More Added to SVS via New **Python** Capabilities][3]
####PYTHON CODE. SUM and permutation
1. http://mathoverflow.net/questions/9477/uniquely-generate-all-permutations-of-three-digits-that-sum-to-a-particular-value
2. http://stackoverflow.com/questions/4632322/finding-all-possible-combinations-of-numbers-to-reach-a-given-sum
4. Utilizzo *LIST* in python: http://effbot.org/zone/python-list.htm
5. [Sample size for pilot studies](http://www.statstodo.com/SSiz2Props_Pgm.php)
####Algorithm
1. http://web.eecs.utk.edu/~vose/Publications/random.pdf
2. Bootstrap Python : https://github.com/cgevans/scikits-bootstrap
3. [wilcoxon](http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.wicoxon.html#scipy.stats.wilcoxon)
5. [mann withney](http://www.alglib.net/hypothesistesting/mannwhitneyu.php)
----------
[1]: https://www.pythonanywhere.com/user/cristiano74/files/home/cristiano74/mysite/6_example.py?edit
[2]: http://stats.stackexchange.com/questions/26858/how-to-generate-numbers-based-on-an-arbitrary-discrete-distribution
[3]:[http://blog.goldenhelix.com/?p=595]
[4]: (http://statsmodels.sourceforge.net/stable/stats.html#goodness-of-fit-tests-and-measures)
[5]: http://blog.minitab.com/blog/adventures-in-statistics/how-to-test-your-discrete-distribution
[6]:(http://stackoverflow.com/questions/13913572/difference-between-scipy-stats-mstats-chisquare-and-scipy-stats-mstats-in-python)
[7]:https://www.pythonanywhere.com/user/cristiano74/files/home/cristiano74/mysite/test4.py?edit
[8]:https://www.pythonanywhere.com/user/cristiano74/files/home/cristiano74/mysite/test1.py?edit
[9]: (http://www2.lv.psu.edu/jxm57/irp/chisquar.html)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment