This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import numpy as np | |
import pandas as pd | |
import scipy.stats as stats |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# STEP 1: GENERATE A RANDOM DATASET | |
# Generate under a random factor | |
# https://docs.scipy.org/doc/numpy/reference/generated/numpy.random.seed.html | |
np.random.seed(10) | |
# Sample data randomly at fixed probabilities | |
voter_race = np.random.choice(a=["asian","black","hispanic","other","white"], | |
p=[0.05, 0.15 ,0.25, 0.05, 0.5], | |
size=1000) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Create a CrossTab from DataFrame, Assign the column names and row names | |
voter_tab = pd.crosstab(voters.race, voters.party, margins=True) | |
voter_tab.columns = ["democrat", "independent", "republican", "row_totals"] | |
voter_tab.index = ["asian", "black", "hispanic", "other", "white", "col_totals"] | |
# You can check the data of CrossTab by calling it | |
voter_tab |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
""" | |
Calculate the "expected" table: | |
"Expected" table can be calculated using below formula: | |
total_rows x total_columns / total_observations | |
And these factors can be get by: | |
- total_rows = voter_tab["row_totals"] | |
- total_columns = voter_tab["col_totals"] | |
- total_observations = 1000 | |
Please note that the "loc" function in below code is used to switch the | |
index base on column name to row name |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# STEP 2: GET THE "OBSERVED" TABLE AND "EXPECTED" TABLE | |
""" | |
Calculate the "observed" table: | |
"Observed" table can be extracted from our CrossTab by exclude the row_totals and col_totals | |
You can see row_totals is in the index of 4 (in column) | |
and col_totals is in the index of 6 (in row). | |
So [0:5, 0:3] means "we will take the rows from 0 index to 5 index | |
and columns from 0 index to 3 index and assign to new CrossTab | |
that named [observed]" | |
""" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Now convert into a DataFrame, Assign the column names and row names | |
expected = pd.DataFrame(expected) | |
expected.columns = ["democrat", "independent", "republican"] | |
expected.index = ["asian", "black", "hispanic", "other", "white"] | |
# You can check the data of expected table by calling it | |
expected |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# STEP 3: CALCULATE THE CHI SQUARE VALUE and CRITICAL VALUE | |
""" | |
Chi square formula: | |
chi square = total of [(observed - expected)^2]/expected | |
Note: We call .sum() twice: once to get the column sums | |
and a second time to add the column sums together, | |
returning the sum of the entire 2D table. | |
""" | |
chi_squared_stat = (((observed-expected)**2)/expected).sum().sum() | |
print(chi_squared_stat) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
""" | |
Find the critical value for confidence of 95% and degree of freedom (df) of 8 | |
Why df = 8? | |
Degree of freedom formula: | |
df = (total rows - 1) x (total columns - 1) | |
= (5 - 1) x (3 - 1) | |
= 4 x 2 | |
= 8 | |
""" | |
crit = stats.chi2.ppf(q=0.95, df=8) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
""" METHODOLOGY 02: CALCULATE USING SCIPY.STATS LIBRARY""" | |
stats = stats.chi2_contingency(observed=observed) | |
# You can check the returned data by calling it | |
# The returned data includes: chi_squared_stat, p_value, df, expected_crosstab | |
stats |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import statistics as stats |
OlderNewer