jessvb/.stats_in_python.md

## .stats_in_python.md

      
    Raw
  

              .stats_in_python.md
            
          
    How to approach statistics

Before you code, do three things:

Think about what you want to show in your data. For example, how would you like to ultimately visualize it or what would you ultimately like to say about it? Here are some helpful info graphics from a data science cheatsheet article on Medium:


Do quick initial visualizations of your data to get a sense of their distributions, whether it's actually worth doing a statistical test (i.e., does it look like there might be a significant difference?), and whether there are any outliers (which could be interesting in themselves! E.g., why did that particular participant choose xyz vs. how the rest of the group chose?). Diverging stacked bar charts are excellent as first visualizations for ordinal data (like Likert Scale data), and box plots are also excelent first visualizations for other data. JMP is apparently great for these types of initial quick-looks, and Vega-Lite (e.g., Vega-Lite editor, normalized stacked bar chart) is awesome for creating visualizations right in the browser with JSON (e.g., use csv to json converter or python pandas to_json) or with Python/Jupyter.

With these plots, you'll get a sense of:

the distribution (e.g., if it's skewed to the left or right, it might not be normal),
see whether the two distributions you're comparing are similar (and if not, figure out if there are different subgroups of the data that are causing different distributions),
see if there are outliers (and why this might be the case → Is there a particular type of person/variable that's causing the outliers?), and
figure out if it's worth it to do a statistical test (e.g., if the distributions are similar, but the medians are different).


Here are some useful links about how to analyze box plots:

Interpreting box plots by Khan Academy (e.g., T/F statements about box plots)
How to compare box plots by Linh Ngo
More on how to compare box plots by Linh Ngo


See this paper about why diverging stacked bar charts are ideal for visualizing Likert scale results.

Robbins, Naomi B., and Richard M. Heiberger. "Plotting Likert and other rating scales." Proceedings of the 2011 joint statistical meeting. Vol. 1. American Statistical Association,, 2011.


Figure out which type of statistical test you need to use; for example, by using the flowchart below from this article.


The first question to ask yourself is: Is your data normal? In many of the code snippets below, this question is taken care of for you by a normality test, and the appropriate test will automatically be chosen (e.g., if normal --> t-test).

Note also that if you have a sample size n > 30, "according to the central limit theorem, the distribution of the sample mean satisfies the normal distribution when the number of samples is above 30", so you do not necessarily have to check for normality [Normality Test in Clinical Research, Central limit theorem: the cornerstone of modern statistics].


The second question is: Is your data paired or unpaired? Paired data is usually where you follow individuals through the data, like when you're trying to find differences between pre/post surveys. Unpaired data is where you're finding differences between different "groups" of participants, like parents vs. children on a specific survey. For paired data, you use tests like the "paired t-test" (normal) and the "Wilcoxon signed-rank test" (not normal). For unpaired data, you use tests like the "independent t-test" (normal) and the "Mann-Whitney U test" (not normal).


Other tests might include Kendall Tau correlation, Krippendorff's alpha for computing inter-coder reliability for thematic analysis, etc.


More complex statistics: When you have many comparisons, factors, covariates, etc.

If you're considering doing a lot of comparisons because you have a large number of factors/covariates that could potentially affect your dependent variable, don't do it! Doing many comparisons (e.g., with a t-test) increases the chances of incorrect results. Use an omnibus test, like ANOVA/ANCOVA/MANCOVA/Friedman/etc. instead! Quote from "Biofeedback Heart Rate Training during Exercise" by Goldstein et al.:
The experimental data were analyzed in two steps. First, analyses of
variance (ANOVAs) were performed and then independent means t-tests
comparing the experimental groups. The reason for this apprach was that
the data analysis involved multiple between-group comparisons, and the
probability that at least one of these would reveal a statistically significant
difference by chance alone was unacceptably high. ANOVAs provided an
assessment of the significance of the overall effects of feedback, sessions,
etc., on the dependent variables, and obtaining significant effects in the
ANOVAs permitted application of multiple t-tests to the data without fear
of wrongly inferring significant differences between groups.

Here's a link to a fantastic explanation of the differences between ANOVA, ANCOVA, MANOVA, & MANCOVA. Below, pictures from the site explain the differences in a nutshell.
Side Note:

"Dependent Variable" == output (e.g., someone's heart rate)
"Independent Variables" == input variables that don't change (e.g., users' age or the app version), which can be either "factors" or "covariates"
"Covariates" == continuous independent variable (e.g., age, number of hours spent studying, or anxiety-trait score)
"Factor" == independent variable that comes in "groups" and is not continuous (e.g., gender, level of education, or app version)
"Omnibus Test" == finds whether there is an overall difference across groups, before doing post-hoc tests—like t-tests—which can find the source of the difference (more info)
"Repeated Measures" == when you measure the same thing at different times (e.g. blood pressure of a group of people at three different times), or the same thing with different 'treatments' (e.g. engagement of a group of people with three different apps), you can used Repeated Measures tests, like Repeated Measures ANOVA, ANCOVA, Friedman test, etc. (more info 1, more info 2)

Side Note:
ANOVA, ANCOVA, linear regression, etc. methods expect the output ("dependent") variable to be continuous. If you have oridinal data, like Likert scale data, technically you're supposed to use ordinal methods instead, like ordered logistic regression, multinomial logistic regression, or binary logistic regression [Analysing Likert Scale Type Data]. That being said, you can treat Likert scale data as continuous if either:

(1) the Likert scale has five or more categories (Johnson & Creech, 1983; Norman, 2010; Sullivan & Artino, 2013; Zumbo & Zimmerman, 1993), or
(2) you can average two or more ordinal variables (e.g., Likert scale questions) to create an approximately continuous variable.

ANOVA


One-Way ANOVA
Two-Way ANOVA


Input: one or two factors (i.e., with levels/groups)
Output: one continuous variable

ANCOVA


(Basically ANOVA, but also includes at least one covariate / continuous independent variable.)

Input: factors and covariates
Output: one continuous variable

How to implement ANCOVA

Here are some links describing ANCOVA and Linear Regression [ANCOVA expressed as linear regression, How to perform ANCOA in Python, How to fit models using formulas & categorical variables]. Essentially, you can use the statsmodels.api package, statsmodels.formula.api package, and the ols function to model your data (Linear Regression), and get summary statistics. Use the p-values in the summary to determine if they likely affect the outcome of the dependent variable.
Make sure to test the assumptions for linear regression though! See this post on how to test linear regression assumptions in Python.
Here's a code snippet of how to model data with both categorical (C()) and covariate independent variables.
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf

# Example dataframe:
df = pd.read_csv("avg-hr-by-app-type.csv")
df.head()


avgHR
appType
expLevel
gender
anxietyTrait
age
busyness


57.956522
A
advanced
male
1.6
19
0.5025


88.900000
B
novice
female
1.6
31
0.4775


100.388889
A
advanced
male
2.0
42
0.4775


105.981132
B
novice
female
1.4
22
0.3450


82.645161
A
advanced
male
1.0
39
0.3600


# ANCOVA (Note: C --> categorical variable)
formula = "avgHR ~ C(appType) + C(expLevel) + C(gender) + anxietyTrait + age + busyness"
model = smf.ols(formula=formula, data=df).fit()
model.summary()
At the top of the model summary, find the R-squared and the Prob (F-statistic) values. These tell you the goodness-of-fit. See Regression Analysis by Example by Chatterjee & Hadi for in-depth discussion of linear regression.
E.g., "The goodness-of-fit index, R^2, may be interpreted as the proportion [percent] of the total variability in the response variable Y that is accounted for by the predictor variable[s]"
E.g., "H_0 is rejected if ... p(F) <= alpha", and "H_0: Reduced model is adequate ... H_1: Full model is adequate" (i.e., if p < alpha, the full model/formula you developed is likely significant!)
If you can't satisfy the linearity assumption or can't come up with a model where p < alpha, R^2 is great enough, etc., consider adding more variables to your formula and/or interaction effects, or transforming the dependent variable (e.g., log(y)). Here's some more info on transformations:

The best explanation I've found for why we're allowed to use these transformations is that the transformation functions preserve the order of the output—they're monotonic transformations
Detailed examples of transformations (log, Box-Cox): Statistical Methods II: Data Transformations
Packages in R for transformations (note the 'log', 'Tukey’s Ladder of Powers', 'Box-Cox' transformations): R Handbook: Transforming Data
Often-cited explanation of transformations: Transformation and Weighting in Regression by Carrol & Ruppert)

MANOVA


One-Way MANOVA
Two-Way MANOVA


(Basically ANOVA, but has multiple output/dependent variables.)

Input: one or two factors
Output: multiple continuous variables

MANCOVA


(Basically ANCOVA, but has multiple output/dependent variables.)

Input: factors and covariates
Output: multiple continuous variables

Stats in Python

See the following code snippets for how to do stats in Python:


Preparing data for paired tests


Paired t-test, Wilcoxon, and other useful helper functions

E.g., Did students trust conversational agents more on the pre-test than on the post-test?


Preparing data for unpaired tests


Unpaired t-test, Mann-Whitney, and other useful helper functions

E.g., Did parents trust conversational agents less than children?


Kendall Tau correlation matrix and test

E.g., Is trust correlated with friendliness of the agent?


Krippendorff's alpha with MASI distance for intercoder reliability

E.g., How much did Researcher A agree with Researcher B when tagging data?


Clean text for word cloud and create word cloud

E.g., Did students mention "technology" more than the word "computer" in their long-answer questions?


Variable association with Fisher's exact test

E.g., Was there an association between [children and commenting on human-likeness of an agent] and [parents and commenting on artificiality of an agent]?


More detail about methods used above:

Test for normality
Wilcoxon signed-rank test


## 01-paired-data-prep.py
import pandas as pd

###############################################################################
### Prepare data for paired statistical tests.
### For paired tests, you need to keep track of IDs of the participants and
### compare them across surveys. This shows an example where there was a pre-,
### mid-, and post-survey, and the IDs are stored in a column named "codename".

def onlySameIds(data1, data2, idname='codename'):
    data1_for_data2 = data1.loc[data1[idname].isin(
        data2[idname])].sort_values(by=[idname])
    data2_for_data1 = data2.loc[data2[idname].isin(
        data1[idname])].sort_values(by=[idname])
    return (data1_for_data2, data2_for_data1)


predata = pd.read_excel('path/to/pre-survey.xlsx', sheet_name="cleaned-responses")
middata = pd.read_excel('path/to/mid-survey.xlsx', sheet_name="cleaned-responses")
postdata = pd.read_excel('path/to/post-survey.xlsx', sheet_name="cleaned-responses")

# get the independent columns either by listing them, like this, or from the
# data itself (e.g., you could use the df.columns attribute somehow...)
indep_cols = [
    "levelprogramming",
    "childparent",
    "age",
    "gender"
]

# Assuming that you just collected the demographics/independent variable information
# in the pre-survey, you'll have to add the independent variable columns from pre-survey
# to middata and postdata:

# 1. find unique indep_cols
cols_to_add = []
for col in indep_cols:
    if not col in middata.columns: # here, we could put postdata or middata - doesn't matter
        cols_to_add.append(col)
cols_to_add.append("codename")  # include codename column for merging
data_to_add = predata[cols_to_add]


# 2. add the columns to middata and postdata, only keeping rows from 'left'
#    dataframes, which are middata and postdata (not data_to_add) in this case
#    (note: merge "defaults to the intersection of the columns in both
#    DataFrames", which in this case will be "codename". You can alternatively
#    manually input which column(s) it will be merged on.)
middata = middata.merge(data_to_add, how='left')
postdata = postdata.merge(data_to_add, how='left')

# Next, let's make sure to clean the data s.t. we're comparing the same/correct students pre
# vs. post vs. mid. Only look at IDs that are in BOTH pre and mid data:
(predata_for_mid, middata_for_pre) = onlySameIds(predata, middata, 'codename')
(predata_for_post, postdata_for_pre) = onlySameIds(predata, postdata, 'codename')
(middata_for_post, postdata_for_mid) = onlySameIds(middata, postdata, 'codename')

# The data should be good to go now! See 2-paired-t-test-and-wilx-helper-functions.py for the
# actual comparisons.

## 02-paired-t-test-and-wilx-helper-functions.py
################################################################################
#################################################################################
### Helper functions for normality, paired t-test and wilcoxon tests
# See first 01-paired-data-prep.py

# You may want to loop over the following functions for large datasets and create
# a dataframe and/or an Excel spreadsheet of the data. (See below.)

###############################
########## functions ##########
# This will perform a t-test or wilcoxon signed-rank test depending on a normality
# test of the differences between two columns (e.g., pre/post) and print out the
# results.
# e.g., data1 = pre-test with many columns of likert scale values
# e.g., data2 = post-test with many columns of likert scale values
# e.g., comparisonName = 'pre/post' --> this is just to print
# e.g., col = 'likertQ1' --> this is the column name in data1/data2 you're comparing
def compareData(data1, data2, col, comparisonName, alpha=0.05):
    # to add to dataframe: depencol | indepcol[s] | datapair[2] | greater/less | testname | pval | stat | init_mean | after_mean
    # actual col names: depen | indep | comparison | sigdir | test | p | stat | mean1 | mean2
    # e.g., friendcoworker | numagents=3 | pre/post | -1 | wilx | pval | stat | 2.2 | 1.5
    datatoappend = None
    diffs = getDiffs(data1, data2, col)
    if colsAreComparable(data1, data2, col):
        testname = 'n/a'
        res = {'pvalue': None, 'statistic': None, 'z_abs': None}

        if (sum(diffs) != 0):
            if isNormal(data1, data2, col, alpha):
                # paired t-test
                restest = stats.ttest_rel(data1[col], data2[col])
                testname = 'ttest'
            else:
                # wilcoxon signed-rank test
                restest = stats.wilcoxon(diffs)
                testname = 'wilx'

        mean1 = data1[col].mean()
        mean2 = data2[col].mean()

        if (sum(diffs) != 0) and (restest.pvalue < alpha):
            res['pvalue'] = restest[1]
            res['statistic'] = restest[0]

            greaterless = 0
            if mean1 > mean2:
                greaterless = -1
            elif mean1 < mean2:
                greaterless = 1

            # summarize data in a dictionary to be added to a dataframe/spreadsheet:
            datatoappend = {'depen': col, 'indep1': None, 'indep2': None, 'indep3': None,
                            'comparison': comparisonName, 'sigdir': greaterless,
                            'testname': testname, 'p': res['pvalue'], 'stat': res['statistic'],
                            'mean1': mean1, 'mean2': mean2}  # still need: indep
        else:
            print(
                f'not significant ({col}: {comparisonName}; mean1: {mean1}, mean2: {mean2})')
    else:
        print('NOTE: columns are *not* comparable')
        print(
            f'comparison: {col}: {comparisonName}; len data 1: {len(data1)}; len data 2: {len(data2)}')

    return datatoappend


def colsAreComparable(data1, data2, colname):
    comparable = False
    if (colname in data1) and (colname in data2):
        if (len(data1) > 0) and (len(data2) > 0):
            if is_numeric_dtype(data1[colname]) and is_numeric_dtype(data2[colname]):
                comparable = True
        else:
            print('not comparable because len <= 0')
    else:
        print('not comparable because colname not in one of datas')
        print(f'colname: {colname}')
    return comparable


def isNormal(data1, data2, colname, alpha=0.05):
    normal = True

    if (len(data1[colname]) < 20 or len(data2[colname]) < 20):
        normal = False
    else:
        res1 = stats.normaltest(data1[colname])
        res2 = stats.normaltest(data2[colname])
        if (res1.pvalue < alpha) or (res2.pvalue < alpha):
            normal = False

    return normal


def getDiffs(data1, data2, colname):
    diffs = []
    for id in data1['codename'].values:
        diff = data2.loc[data2['codename'] == id][colname].values[0] - \
            data1.loc[data1['codename'] == id][colname].values[0]
        # check if NaN (could have a missing value) and add to diffs:
        if not math.isnan(diff):
            diffs.append(float(diff))
    return diffs

###################################
########## Example usage ##########
# See 1-paired-data-prep.py to get and prepare data.
alpha = 0.05
# ...

### Single comparison example ###
nameinit = 'pre'
namefin = 'mid'
depencol = 'unreliabledependable'
limcol1 = 'weird'
lim1 = 0
limcol2 = None
lim2 = None

if nameinit == 'pre' and namefin == 'mid':
    datainit = predata_for_mid
    datafin = middata_for_pre
elif nameinit == 'pre' and namefin == 'post':
    datainit = predata_for_post
    datafin = postdata_for_pre
elif nameinit == 'mid' and namefin == 'post':
    datainit = middata_for_post
    datafin = postdata_for_mid
else:
    print("INCORRECT NAMES FOR DATA")


if (limcol2):
    data1 = datainit.loc[(datainit[limcol1] == lim1) &
                         (datainit[limcol2] == lim2)]
    data2 = datafin.loc[(datafin[limcol1] == lim1) &
                        (datafin[limcol2] == lim2)]
    datatoappend = compareData(
        data1, data2, depencol, f'{nameinit}/{namefin}', alpha)
    datatoappend['indep1'] = f'{limcol1}={lim1}'
    datatoappend['indep2'] = f'{limcol2}={lim2}'
else:
    data1 = datainit.loc[(datainit[limcol1] == lim1)]
    data2 = datafin.loc[(datafin[limcol1] == lim1)]
    datatoappend = compareData(
        data1, data2, depencol, f'{nameinit}/{namefin}', alpha)
    datatoappend['indep1'] = f'{limcol1}={lim1}'


# PRINT
print(datatoappend)


### Loop over many examples ###
depen_cols = ["likertQ1", "likertQ2", "likertQ3", "..."]

datapairs = [
    [predata_for_mid, middata_for_pre, 'pre/mid'],
    [predata_for_post, postdata_for_pre, 'pre/post'],
    [middata_for_post, postdata_for_mid, 'mid/post']
]

resdf = pd.DataFrame(columns=['depen', 'indep1', 'indep2', 'indep3', 'comparison',
                              'sigdir', 'testname', 'p', 'stat', 'mean1', 'mean2'])

for datapair in datapairs:
    for depencol in depen_cols:
        datatoappend = compareData(datapair[0], datapair[1], depencol, datapair[2], alpha)
        resdf = resdf.append(datatoappend, ignore_index=True)


# export the data to an excel spreadsheet
filename = 'paired-results.xlsx'
resdf.to_excel(filename)
print(f"--- Exported to spreadsheet: {filename} ---")

## 03-unpaired-data-prep.py
import pandas as pd

###############################################################################
### Prepare data for paired statistical tests.
### For unpaired tests, you *don't* to keep track of IDs of the participants,
### so it's a little simpler than for paired prep.

predata = pd.read_excel('path/to/pre-survey.xlsx', sheet_name="cleaned-responses")
middata = pd.read_excel('path/to/mid-survey.xlsx', sheet_name="cleaned-responses")
postdata = pd.read_excel('path/to/post-survey.xlsx', sheet_name="cleaned-responses")

# get the independent columns either by listing them, like this, or from the
# data itself (e.g., you could use the df.columns attribute somehow...)
indep_cols = [
    "levelprogramming",
    "childparent",
    "age",
    "gender"
]

# Assuming that you just collected the demographics/independent variable information
# in the pre-survey, you'll have to add the independent variable columns from pre-survey
# to middata and postdata:

# 1. find unique indep_cols
cols_to_add = []
for col in indep_cols:
    if not col in middata.columns: # here, we could put postdata or middata - doesn't matter
        cols_to_add.append(col)
cols_to_add.append("codename")  # include codename column for merging
data_to_add = predata[cols_to_add]


# 2. add the columns to middata and postdata, only keeping rows from 'left'
#    dataframes, which are middata and postdata (not data_to_add) in this case
#    (note: merge "defaults to the intersection of the columns in both
#    DataFrames", which in this case will be "codename". You can alternatively
#    manually input which column(s) it will be merged on.)
middata = middata.merge(data_to_add, how='left')
postdata = postdata.merge(data_to_add, how='left')

# The data should be good to go now! See 4-unpaired-t-test-and-mann-whitney-helper-functions.py for the
# actual comparisons.

## 04-unpaired-t-test-and-mann-whitney-helper-functions.py
##################################################################################
# Mann-Whitney Test
# See first, 03-unpaired-data-prep.py.
# Assumption: Data are independent (e.g., Likert scale from parents vs. children.
#
# See example: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mannwhitneyu.html
#
# If the differences between the two samples *are* normally distributed, use an
# independent t-test: scipy.stats.ttest_ind
# https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html#scipy.stats.ttest_ind

alpha = 0.05
# ... see 03-unpaired-data-prep.py ...

#################
### functions ###

def isNormal(data1, data2, colname, alpha=0.05):
    normal = True

    if (len(data1[colname]) < 20 or len(data2[colname]) < 20):
        normal = False
    else:
        res1 = stats.normaltest(data1[colname])
        res2 = stats.normaltest(data2[colname])
        if (res1.pvalue < alpha) or (res2.pvalue < alpha):
            normal = False

    return normal


def colsAreComparable(data1, data2, colname):
    comparable = False
    if (colname in data1) and (colname in data2):
        if (len(data1) > 0) and (len(data2) > 0):
            if is_numeric_dtype(data1[colname]) and is_numeric_dtype(data2[colname]):
                comparable = True
        else:
            print('not comparable because len <= 0')
    else:
        print('not comparable because colname not in one of datas')
        print(f'colname: {colname}')
    return comparable


def compareData(data1, data2, col, comparisonName, alpha=0.05):
    # to add to dataframe: demographics | comparisonname | pre or mid or post | indepcol[s] | greater/less | testname | pval | stat | init_mean | after_mean
    # actual col names: demog | comparison | whichdata | indep1 | indep2 | indep3 | sigdir | test | p | stat | mean1 | mean2
    # e.g., levelprogramming | weird-0v1 | predata | children | None | None | -1 | wilx | pval | stat | 2.2 | 1.5
    datatoappend = None
    if colsAreComparable(data1, data2, col):
        testname = 'n/a'
        res = {'pvalue': None, 'statistic': None, 'z_abs': None}

        if isNormal(data1, data2, col, alpha):
            # independent t-test
            restest = stats.ttest_ind(data1[col], data2[col])
            testname = 'ttest'
        else:
            # mann-whitney test
            restest = stats.mannwhitneyu(data1[col], data2[col])
            testname = 'mann-whitney'

        mean1 = data1[col].mean()
        mean2 = data2[col].mean()

        if restest.pvalue < alpha:
            res['pvalue'] = restest[1]
            res['statistic'] = restest[0]

            greaterless = 0
            if mean1 > mean2:
                greaterless = -1
            elif mean1 < mean2:
                greaterless = 1

            # summarize data in a dictionary to be added to a dataframe/spreadsheet:
            datatoappend = {'demog': col, 'comparison': comparisonName, 'whichdata': None,
                            'indep1': None, 'indep2': None, 'indep3': None, 'sigdir': greaterless,
                            'test': testname, 'p': res['pvalue'], 'stat': res['statistic'],
                            'mean1': mean1, 'mean2': mean2}  # still need: comparison(actual), indep[s], whichdata
        else:
            print(
                f'not significant ({col}: {comparisonName}; mean1: {mean1}, mean2: {mean2})')
    else:
        print('NOTE: columns are *not* comparable')
        print(
            f'comparison: {col}: {comparisonName}; len data 1: {len(data1)}; len data 2: {len(data2)}')

    return datatoappend


###########################
### Example comparisons ###

##### Single comparison #####
### REGULAR ###
data = predata
whichdata = 'predata'
depencol = 'unreliabledependable'
compcol = 'weird'
comp1 = 0
comp2 = 1
limcol = None  # 'weird'
limit = None  # 0

if (limcol):
    data1 = data.loc[(data[compcol] == comp1) &
                     (data[limcol] == limit)]
    data2 = data.loc[(data[compcol] == comp2) &
                     (data[limcol] == limit)]
    datatoappend = compareData(
        data1, data2, depencol, f'{limcol}={limit} {compcol}-{comp1}v{comp2}', alpha)
else:
    data1 = data.loc[(data[compcol] == comp1)]
    data2 = data.loc[(data[compcol] == comp2)]
    datatoappend = compareData(
        data1, data2, depencol, f'{compcol}-{comp1}v{comp2}', alpha)

# ### GREATER THAN / LESS THAN ETC. WITH LIM ###
# data = predata
# whichdata = 'predata'
# depencol = 'inflexibleflexible'
# compcol = 'numagents'
# comp1 = 1
# comp2 = 1
# limcol = 'weird'
# limit = 0

# data1 = data.loc[(data[compcol] == comp1) &
#                 (data[limcol] == limit)]
# data2 = data.loc[(data[compcol] > comp2) &
#                 (data[limcol] == limit)]
# datatoappend = compareData(data1, data2, depencol, f'{limcol}={limit} {compcol}-=={comp1}v>{comp2}', alpha)

# ### GREATER THAN / LESS THAN ETC. NO LIM ###
# data = predata
# whichdata = 'predata'
# depencol = 'trustcas'
# compcol = 'numagents'
# comp1 = 1
# comp2 = 1
# limcol = None
# limit = None

# data1 = data.loc[(data[compcol] == comp1)]
# data2 = data.loc[(data[compcol] > comp2)]
# datatoappend = compareData(data1, data2, depencol, f'{compcol}-=={comp1}v>{comp2}', alpha)


### PRINT ###
if(datatoappend):
    datatoappend['whichdata'] = whichdata
    if limcol:
        datatoappend['indep1'] = limcol

print(datatoappend)


##### Loop over many examples #####
depen_cols = ["likertQ1", "likertQ2", "likertQ3", "..."]

resdf = pd.DataFrame(columns=['depen', 'comparison', 'whichdata', 'indep1',
                     'indep2', 'indep3', 'sigdir', 'test', 'p', 'stat', 'mean1', 'mean2'])

for data, dataname in zip(alldata, datanames):  # e.g., within predata
    for indepcol in indep_cols:  # e.g., levelprogramming
        if (indepcol in data):
            i = 0
            categories = data[indepcol].unique() # e.g., levelprogramming -> 0, 1, 2, 3
            # for categ in categories:  # e.g., childparent='C' vs childparent='P'
            while i < len(categories):
                j = i + 1
                # e.g., compare depen cols ('likertQ1', 'likertQ2') btwn parent vs child
                while j < len(categories):
                    data1 = data.loc[data[indepcol] == categories[i]]
                    data2 = data.loc[data[indepcol] == categories[j]]
                    for depencol in depen_cols:
                        datatoappend = compareData(data1, data2, depencol, f'{indepcol}-{categories[i]}v{categories[j]}', alpha)
                        datatoappend['whichdata'] = dataname
                        resdf = resdf.append(datatoappend, ignore_index=True)
                    j += 1
                i += 1

# export the data to an excel spreadsheet
filename = 'unpaired-results.xlsx'
resdf.to_excel(filename)
print(f"--- Exported to spreadsheet: {filename} ---")

## 05-kendall-tau-correlations.py
### Potential correlation tests:
# If Normal: Pearson correlation, but has to be interval or ratio data (NOT ordinal/categorical)
# If Not Normal --> non-parametric tests
# Non-parametric tests:
# (1) Spearman: for interval or ratio data (NOT ordinal/categorical)
# (2) Kendall: for ordinal or continuous data :)
# --> I think you can use Kendall for any of not-normal/normal/oridinal/continuous

### Test your data for normality. If some of the columns are not normal and/or some are ordinal,
### then use the Kendall correlation
# test normality (NOTE: you can't do this with string data):
# for s in cordata.columns:
#     # print(cordata[s])
#     k2, p = stats.normaltest(cordata[s])
#     if p < alpha:
#         print('normal: ' + str(s))
#     else:
#         print('NOT NORMAL: ' + str(s))
# # since some are not normal, let's use non-parametric test


import pandas as pd
import seaborn as sn
import matplotlib.pyplot as plt
import re
from scipy import stats

alpha = 0.05

predata = pd.read_excel('path/to/predata.xlsx')
postdata = pd.read_excel('path/to/postdata.xlsx')
otherdata = pd.read_excel('path/to/any/other/data.xlsx')
alldata = pd.merge(otherdata, pd.merge(predata,postdata, on='ID', suffixes=('_pre','_post')), on='ID')

# only columns we want to correlate:
cordata = pd.DataFrame(alldata,columns=[
    'example_col_pre', 'example_col_post', 'example_col2_pre', 'example_col2_post', 'other_col'
    ])
# clean data (get rid of rows with NaN's):
cordata = cordata.dropna()

# get the correlation matrix
corrMatrix = cordata.corr(method='kendall')
sn.heatmap(corrMatrix, annot=True)
plt.tight_layout()
plt.savefig('./path/to/figstorage/correlation-matrix.png', dpi=400)
plt.show()
print(corrMatrix)

# test for strong correlations (abs(cor) > ~0.5) and med (abs(cor) > ~0.3)
# NOTE: it's a symmetric matrix, so let's just do the bottom triangle
strong = []
med = []
c = 0
for col in corrMatrix.columns:
    c += 1 # skip the diagonal (which is always 1.0)
    for r in range(c,len(corrMatrix)):
        row = corrMatrix.columns[r]
        if corrMatrix[col][r] >= 0.5:
            # print('strong correlation: ' + str(col) + ' & ' + str(row))
            coef, p = stats.kendalltau(cordata[col], cordata[row])
            if p < alpha:
                # print('    significant')
                strong.append((col, row))
            # else:
            #     print('    NOT significant')
            # print()
        elif corrMatrix[col][r] >= 0.3:
            # print('med correlation: ' + str(col) + ' & ' + str(row))
            coef, p = stats.kendalltau(cordata[col], cordata[row])
            if p < alpha:
                # print('    significant')
                med.append((col, row))
            # else:
            #     print('    NOT significant')
            # print()

print('Strong Correlations:')
for t in strong:
    print(str(t[0]) + ', ' + str(t[1]))

print()
print('Medium Correlations:')
for t in med:
    print(str(t[0]) + ', ' + str(t[1]))

# NOTE: Interpretation of strong/med correlations:
# Values of 0.5 for strong and 0.3 for med come from education psychology:
# “Statistical Power Analysis for the Behavioral Sciences” 2013, pg. 80-81

# NOTE: Meaning of the p-value for strength:
# coef, p = scipy.stats.kendalltau(data1, data2)
# → This gives you the value in the matrix (coef) and the p-value → test whether the p-value is < 0.05
# From paper on User’s guide to correlation coefficients: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6107969/
# - p-value doesn’t mean the strength of the correlation, it means the likelihood that the correlation is actually this
#   strength (whether that’s weak strength or strong)
#    - e.g., “The low level of the p-value reassures us that 99.99% of the time the correlation is *weak* at an r of 0.31.”


## 06-krippendorffs-alpha-with-MASI.py
################################################################################
### Use this for inter-coder reliability (ICR, IRR, interrater agreement, etc.)
### when performing thematic analysis / open coding on data where you have
### multiple coders, and can tag items with multiple codes: Krippendorff's
### Alpha with MASI distance.
###
### For more information about many different ICR methods, see, "Intercoder
### Reliability in Qualitative Research: Debates and Practical Guidelines" by
### O'Connor and Joffe.
###
### For a citable paper on this specific ICR method, see "Inter-Coder Agreement
### for Computational Linguistics" by Artstein and Poesio.
###
### This code was taken from dashtaisen on StackOverflow:
### https://stackoverflow.com/questions/45741934/low-alpha-for-nltk-agreement-using-masi-distance

import pandas as pd
import nltk
from nltk.metrics import agreement
from nltk.metrics.distance import masi_distance
from nltk.metrics.distance import jaccard_distance

### get data
# Assumes columns are sorted by content (i.e., items to be tagged) so that all
# of the same content is next to each other in the list, and that there are
# no nulls / empty rows until the very end of the columns. (You can use 10-25%
# of your full dataset https://journals.sagepub.com/doi/10.1177/1609406919899220)
# e.g., name1_tag       name1_content       name2_tag           name2_content
#       +Artificial     cute robot          +Artificial         cute robot
#       Approachable    cute robot          Personalized        depends on topic
#       Personalized    depends on topic    Convenient          not smart
data = pd.read_excel('path/to/spreadsheet.xlsx', sheet_name="krippendorffs_alpha_prep")

### put data into correct format for krippendorff's: (coder, item, label)
def getkripdata(data, coders):
    tagdata = []

    for coder in coders:
        i = 0
        currTags = []
        while i < len(data[f"{coder}_content"]) - 1:
            currItem = data[f"{coder}_content"][i]
            nextItem = data[f"{coder}_content"][i+1]

            currTag = data[f"{coder}_tag"][i]
            currTags.append(currTag)


            if ((nextItem != currItem)):
                if pd.notna(currItem):
                    tagdata.append((f"{coder}",currItem,frozenset(currTags)))
                currTags = []
            i += 1

        # append the last tags/items
        if data[f"{coder}_content"][i] == currItem:
            currTags.append(data[f"{coder}_tag"][i])
        else:
            currItem = data[f"{coder}_content"][i]
            currTags = [data[f"{coder}_tag"][i]]

        if pd.notna(currItem):
            tagdata.append((coder,currItem,frozenset(currTags)))

    return tagdata

# In this case, we have 3 coders, where each coder overlapped with the other
# coders for a certain portion of the text. (Alternatively, you could just have
# 2 coders, so you'd just need `coders=[name1,name2]`.)
# e.g., name1forname2_tag  name1forname2_content  name2forname1_tag  name2forname1_content  name1forname3_tag  name1forname3_content  name3forname1_tag ...
# (vs. name1_tag  name1_content  name2_tag  name2_content, as shown above)
coders_name1name2 = ["name1forname2","name2forname1"]
tagdata_name1name2 = getkripdata(data, coders_name1name2)
coders_name1name3 = ["name1forname3","name3forname1"]
tagdata_name1name3 = getkripdata(data, coders_name1name3)
coders_name3name2 = ["name2forname3","name3forname2"]
tagdata_name3name2 = getkripdata(data, coders_name3name2)

allcoders = [coders_name1name2, coders_name1name3, coders_name3name2]
alltagdata = [tagdata_name1name2, tagdata_name1name3, tagdata_name3name2]

### make sure the number of items are the same for both coders, so that we can
### complete krippendorff's alpha:
i = 0
for tagdata in alltagdata:
    coders = allcoders[i]
    coder1 = coders[0]
    coder2 = coders[1]

    unique1 = data[f"{coder1}_content"].unique()
    unique1 = unique1[~pd.isnull(unique1)]

    unique2 = data[f"{coder2}_content"].unique()
    unique2= unique2[~pd.isnull(unique2)]

    # print(f'### for coders: {coders} ###')
    # print(f'unique1: {unique1}\nunique2: {unique2}')
    # print(f'unique1: {len(unique1)}\nunique2: {len(unique2)}')
    if len(unique1) != len(unique2):
        print('### NOT THE SAME NUMBER OF ITEMS ###')
    for item in unique1:
        if not (item in unique2):
            print(f'{coder2} is missing item ~{item}~')
    for item in unique2:
            if not (item in unique1):
                print(f'{coder1} is missing item ~{item}~')
    print()
    i += 1

### Krippendorff's Alpha with MASI distance:
# (coder, item, label)
# exampledata = [('inky','text01',frozenset(['love','gifts'])),
#       ('blinky','text01',frozenset(['love','gifts'])),
#       ('sue','text01',frozenset(['love','gifts'])),
#       ('inky','text02',frozenset(['slime','gaming'])),
#       ('blinky','text02',frozenset(['slime'])),
#       ('sue','text02',frozenset(['slime','gaming']))]

i = 0
for tagdata in alltagdata:
    task = nltk.AnnotationTask(distance=masi_distance)
    #jaccard_task = nltk.AnnotationTask(distance=jaccard_distance)
    coders = allcoders[i]
    task.load_array(tagdata)
    print("################################################################")
    print(f"Statistics for dataset using {task.distance} with coders {coders}")
    print("C: {}\nI: {}\nK: {}".format(task.C, task.I, task.K))
    print("Pi: {}".format(task.pi()))
    print("Kappa: {}".format(task.kappa()))
    print("Multi-Kappa: {}".format(task.multi_kappa()))
    print("Alpha: {}".format(task.alpha()))
    print()
    i += 1

### Krippendorff's Alpha values meaning:
# Values range from 0 to 1, where 0 is perfect disagreement and 1 is perfect
# agreement. Krippendorff suggests: "[I]t is customary to require α ≥ .800.
# Where tentative conclusions are still acceptable, α ≥ .667 is the lowest
# conceivable limit" Krippendorff, K. (2004). Content analysis:
# An introduction to its methodology. Thousand Oaks, California: Sage. p.241

## 07-nlp-clean-text-for-word-cloud.py
import glob, os
import re
import nltk
import nltk.corpus

nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.stem import WordNetLemmatizer

###############################################################################
### Takes in text files (in the dir `path`) containing a long string of words
### (e.g., all the answers from students to a question, all on one line), and
### outputs a csv file with cleaned, lemmatized words in a single column.

def clean_text(text):
    newtext = text.lower()
    newtext = re.sub(r"(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)|^rt|http.+?", "", newtext)
    newtext = re.sub(r"\d+", "", newtext)
    return newtext

def create_word_freq_pairs(text,filterWordsList):
    wordlist = text.split()
    wordlist = list(filter(lambda w: w not in filterWordsList, wordlist))
    wordfreq = [wordlist.count(p) for p in wordlist]
    freqdict = dict(list(zip(wordlist,wordfreq)))
    sorteddict = [(freqdict[key], key) for key in freqdict]
    sorteddict.sort()
    sorteddict.reverse()
    return sorteddict

filterWordsList = [
    'ai','artificial','artifical','intelligence','conversational'
    ]

path = 'path/to/where/text/files/are/'
os.chdir(path)
stop = stopwords.words('english')
dfs = {} # dict with dataframes for each of the texts
for filename in glob.glob("*.txt"):
    # clean the text (remove stop words, lemmatize)
    print(filename)
    fr = open(filename, 'r')
    cleaned = clean_text(fr.read())
    nostop = ' '.join([word for word in cleaned.split() if word not in (stop)])
    tokenized = word_tokenize(nostop)
    lemmatized = [WordNetLemmatizer().lemmatize(i) for i in tokenized]
    donetext = 'words\n'
    for i in lemmatized:
        donetext += i + '\n'
    fr.close()

    # write to a file for use in Tableau or in word_cloud_creator.py
    newfile = filename.split('.txt')[0] + '_cleaned.csv'
    fw = open(newfile, 'w')
    fw.write(donetext)
    fw.close()
    print(newfile)

    print()

###############################################################################

## 08-word-cloud-creator.py
# See first nlp-clean-text-for-word-cloud.py to create *_cleaned.csv files
# Based on https://www.datacamp.com/community/tutorials/wordcloud-python

import glob, os
import numpy as np
import pandas as pd
from os import path
from PIL import Image
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
from collections import Counter

import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
from matplotlib.ticker import MaxNLocator

path = 'path/to/csv/files/from/nlp-clean-text-for-word-cloud/'
outputpath = 'path/to/where/you/want/your/figs/stored/'
filterWordsList = [
    'put','any','words','you','don\'t', 'want', 'in', 'the', 'analysis', 'here'
    ]
os.chdir(path)
for filename in glob.glob("*_cleaned.csv"):
    # create title for plots
    title = ''
    if(filename.find('pre') != -1):
        title += 'Pre-test'
    else:
        title += 'Post-test'
    title += ' word frequency:'
    print(title)

    # get frequency data
    df = pd.read_csv(filename)
    # filter out obvious words
    for word in filterWordsList:
        df = df[df['words'] != word]
    freq = df['words'].value_counts()
    print(freq)

    # set max number of words for plotting
    N = 7
    indices = list(freq.index[0:N])
    values = list(freq.values[0:N])
    indices.reverse()
    values.reverse()

    # create 2 subplots with relative sizing
    fig = plt.figure(figsize=(6,2.5))#(9,3.75))
    fig.suptitle(title)
    gs = gridspec.GridSpec(1, 2,width_ratios=[1,2.25])
    # gs.update(wspace=0.1, hspace=0.1) # set the spacing between axes
    ax1 = plt.subplot(gs[0])
    ax2 = plt.subplot(gs[1])

    # plot histogram
    ax1.barh(indices, values)
    labels = ax1.get_yticklabels()
    plt.setp(labels, rotation=45, horizontalalignment='right')
    ax1.set(xlabel='Count')
    ax1.xaxis.set_major_locator(MaxNLocator(integer=True))
    for label in ax1.xaxis.get_ticklabels()[::2]: # every other tick
        label.set_visible(False)


    # plot word cloud
    text = ' '.join(df['words'].tolist())
    wordcloud = WordCloud(max_words=100, background_color="white", width=1600, height=800).generate(text)
    ax2.imshow(wordcloud, interpolation='bilinear')
    ax2.axis("off")

    # save & show figure
    plt.tight_layout()
    fig.subplots_adjust(top=.9)
    plt.savefig(outputpath + filename.split('_cleaned.csv')[0] + '.png', dpi=150)
    plt.show()

## 09-fishers-exact-test.py
import pandas as pd
import scipy.stats as stats

################################################################################
# Fisher's Exact Test
# https://www.statology.org/fishers-exact-test-python/
# E.g., Is there a statistically significant association between gender and
# political party?
# Data:
#           Democrat	Republican
#   Female  8	        4
#   Male    4	        9
#
# E.g., Is there an association between being a child or parent and thinking
# conversational agents should be more human-like or artificial?
#           Artificial  Human-like
#   Child   11          10
#   Parent  3           7

#################
### Data prep ###
# E.g., Count the number of tags, like 'human-like' and 'artificial', for children vs. parents
# Spreadsheet might look like this:
# C_Tag1 | P_Tag1 | C_Tag2 | P_Tag2 | ...
# 11     | 3      | 10     | 7      | ...
df = pd.read_excel('path/to/spreadsheet.xlsx', sheet_name="sheet-name")

alpha = 0.05


#####################
### Fisher's test ###
colnames = ['C_Human-like', 'C_Artificial', 'P_Human-like', 'P_Artificial']

table = [[float(df[colnames[0]]), float(df[colnames[1]])],
             [float(df[colnames[2]]), float(df[colnames[3]])]]
print(table)

resdiff = stats.fisher_exact(table, alternative='two-sided')
resgreater = stats.fisher_exact(table, alternative='greater')
resless = stats.fisher_exact(table, alternative='less')

# Two-sided ('resdiff'):
# if alpha <= 0.05:
#      "H1: (alternative hypothesis) The two variables are not independent (i.e., there's
#      an association between the two variables)."
# else:
#      "H0: (null hypothesis) The two variables are independent."

# "When using a one-tailed test (greater/less), you are testing for the possibility of the
# relationship in one direction and completely disregarding the possibility of a
# relationship in the other direction. [...] A one-tailed test will test either if the
# mean is significantly greater than x or if the mean is significantly less than x, but not
# both. [...] The one-tailed test provides more power to detect an effect in one direction
# by not testing the effect in the other direction."

diff = False
if resdiff[1] <= alpha:
    print(f'DIFFERENT: oddsr: {resdiff[0]}, p: {resdiff[1]}')
    diff = True
if resgreater[1] <= alpha:
    print(f'GREATER: oddsr: {resgreater[0]}, p: {resgreater[1]}')
    diff = True
if resless[1] <= alpha:
    print(f'LESS: oddsr: {resless[0]}, p: {resless[1]}')
    diff = True

if not diff:
    print('No significant difference.')

## 10-normality-assumption.py
import pandas as pd
from scipy import stats

data = pd.read_excel('path/to/file.xlsx')

##############################################################################################
### test for normality
# Note: if you're doing this for a paired test, like Wilcoxon Signed-Rank test or a paired
# t-test, you'll want to test the differences between the two samples for normality.
alpha = 0.05
res = stats.normaltest(data.colname)

if res.pvalue < alpha:  # null hypothesis: x comes from a normal distribution
    print("The null hypothesis can be rejected — x does not come from a normal distribution")
    print("Use a non-parametric test, like the wilcoxon signed-rank test")
else:
    print("The null hypothesis cannot be rejected — x comes from a normal distribution")
    print("Use a parametric test, like the t-test")
### result: if from normal distribution, use parametric test, otherwise, use non-parametric
##############################################################################################

## 11-wilcoxon-signed-rank.py
import pandas as pd
import math
from scipy import stats

predata = pd.read_excel('path/to/predatafile.xlsx')
postdata = pd.read_excel('path/to/postdatafile.xlsx')

################################################################################
### Wilcoxon signed-rank test
# See first, "test for normality". If not from a normal dist, use this; if it is
# normal, use the paired t-test.
# NOTE: See 03-paired-t-test-and-wilx-helper-functions for functions you can easily
# loop over (normal test, t-test, wilx test).
# Assumptions:
# 1. Data are paired and come from the same population (e.g., Likert scale from
#    students' pre- and post-test).
# 2. Each pair is chosen randomly and independently.
# 3. The data are measured on at least an interval scale when, as is usual,
#    within-pair differences are calculated to perform the test (though it does
#    suffice that within-pair comparisons are on an ordinal scale).
#
# See example: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.wilcoxon.html

# This test can be done in two ways: by giving the pre- and post-test results in the command,
# or by getting the differences and giving that to the command. Let's give the differences
# here.

# First, let's make sure to clean the data s.t. we're comparing the same/correct students pre
# vs. post. Only look at IDs that are in BOTH pre and post data:
predata_cleaned = predata.loc[predata['ID'].isin(postdata['ID'])].sort_values(by=['ID'])
postdata_cleaned = postdata.loc[postdata['ID'].isin(predata['ID'])].sort_values(by=['ID'])
col = 'name_of_column_to_compare'
diffs = []
for id in predata_cleaned['ID'].values:
    diff = postdata_cleaned.loc[postdata_cleaned['ID'] == id][col].values[0] - predata_cleaned.loc[predata_cleaned['ID'] == id][col].values[0]
    # check if NaN (could have a missing value) and add to diffs:
    if not math.isnan(diff):
        diffs.append(float(diff))

stat = stats.wilcoxon(diffs)
wilx = {'pvalue': stat.pvalue,
    'statistic': stat.statistic,
    'z_abs': stats.norm.isf(stat.pvalue/2)}

if wilx['pvalue'] < 0.05:
    # test greater/less than:
    statg = stats.wilcoxon(diffs, alternative='greater')
    statl = stats.wilcoxon(diffs, alternative='less')
    if statg.pvalue < 0.05:
        wilx = {'pvalue': statg.pvalue,
            'statistic': statg.statistic,
            'z_abs': stats.norm.isf(statg.pvalue),
            'greater': True}
        print(f'{col} is sig: post is greater, {wilx}')
    elif statl.pvalue < 0.05:
        wilx = {'pvalue': statl.pvalue,
            'statistic': statl.statistic,
            'z_abs': stats.norm.isf(statl.pvalue),
            'less': True}
        print(f'{col} is sig: post is less, {wilx}')
    else:
        print(f'{col} is sig: {wilx}')
    # print out the sum of diffs so that you can double-check greater/less
    print('sum of {0}: {1}'.format(col, sum(diffs)))

else:
    print(f'{col} is NOT sig.')

################################################################################
avgHR	appType	expLevel	gender	anxietyTrait	age	busyness
57.956522	A	advanced	male	1.6	19	0.5025
88.900000	B	novice	female	1.6	31	0.4775
100.388889	A	advanced	male	2.0	42	0.4775
105.981132	B	novice	female	1.4	22	0.3450
82.645161	A	advanced	male	1.0	39	0.3600
	import pandas as pd

	###############################################################################
	### Prepare data for paired statistical tests.
	### For paired tests, you need to keep track of IDs of the participants and
	### compare them across surveys. This shows an example where there was a pre-,
	### mid-, and post-survey, and the IDs are stored in a column named "codename".

	def onlySameIds(data1, data2, idname='codename'):
	data1_for_data2 = data1.loc[data1[idname].isin(
	data2[idname])].sort_values(by=[idname])
	data2_for_data1 = data2.loc[data2[idname].isin(
	data1[idname])].sort_values(by=[idname])
	return (data1_for_data2, data2_for_data1)


	predata = pd.read_excel('path/to/pre-survey.xlsx', sheet_name="cleaned-responses")
	middata = pd.read_excel('path/to/mid-survey.xlsx', sheet_name="cleaned-responses")
	postdata = pd.read_excel('path/to/post-survey.xlsx', sheet_name="cleaned-responses")

	# get the independent columns either by listing them, like this, or from the
	# data itself (e.g., you could use the df.columns attribute somehow...)
	indep_cols = [
	"levelprogramming",
	"childparent",
	"age",
	"gender"
	]

	# Assuming that you just collected the demographics/independent variable information
	# in the pre-survey, you'll have to add the independent variable columns from pre-survey
	# to middata and postdata:

	# 1. find unique indep_cols
	cols_to_add = []
	for col in indep_cols:
	if not col in middata.columns: # here, we could put postdata or middata - doesn't matter
	cols_to_add.append(col)
	cols_to_add.append("codename") # include codename column for merging
	data_to_add = predata[cols_to_add]


	# 2. add the columns to middata and postdata, only keeping rows from 'left'
	# dataframes, which are middata and postdata (not data_to_add) in this case
	# (note: merge "defaults to the intersection of the columns in both
	# DataFrames", which in this case will be "codename". You can alternatively
	# manually input which column(s) it will be merged on.)
	middata = middata.merge(data_to_add, how='left')
	postdata = postdata.merge(data_to_add, how='left')

	# Next, let's make sure to clean the data s.t. we're comparing the same/correct students pre
	# vs. post vs. mid. Only look at IDs that are in BOTH pre and mid data:
	(predata_for_mid, middata_for_pre) = onlySameIds(predata, middata, 'codename')
	(predata_for_post, postdata_for_pre) = onlySameIds(predata, postdata, 'codename')
	(middata_for_post, postdata_for_mid) = onlySameIds(middata, postdata, 'codename')

	# The data should be good to go now! See 2-paired-t-test-and-wilx-helper-functions.py for the
	# actual comparisons.
	################################################################################
	#################################################################################
	### Helper functions for normality, paired t-test and wilcoxon tests
	# See first 01-paired-data-prep.py

	# You may want to loop over the following functions for large datasets and create
	# a dataframe and/or an Excel spreadsheet of the data. (See below.)

	###############################
	########## functions ##########
	# This will perform a t-test or wilcoxon signed-rank test depending on a normality
	# test of the differences between two columns (e.g., pre/post) and print out the
	# results.
	# e.g., data1 = pre-test with many columns of likert scale values
	# e.g., data2 = post-test with many columns of likert scale values
	# e.g., comparisonName = 'pre/post' --> this is just to print
	# e.g., col = 'likertQ1' --> this is the column name in data1/data2 you're comparing
	def compareData(data1, data2, col, comparisonName, alpha=0.05):
	# to add to dataframe: depencol \| indepcol[s] \| datapair[2] \| greater/less \| testname \| pval \| stat \| init_mean \| after_mean
	# actual col names: depen \| indep \| comparison \| sigdir \| test \| p \| stat \| mean1 \| mean2
	# e.g., friendcoworker \| numagents=3 \| pre/post \| -1 \| wilx \| pval \| stat \| 2.2 \| 1.5
	datatoappend = None
	diffs = getDiffs(data1, data2, col)
	if colsAreComparable(data1, data2, col):
	testname = 'n/a'
	res = {'pvalue': None, 'statistic': None, 'z_abs': None}

	if (sum(diffs) != 0):
	if isNormal(data1, data2, col, alpha):
	# paired t-test
	restest = stats.ttest_rel(data1[col], data2[col])
	testname = 'ttest'
	else:
	# wilcoxon signed-rank test
	restest = stats.wilcoxon(diffs)
	testname = 'wilx'

	mean1 = data1[col].mean()
	mean2 = data2[col].mean()

	if (sum(diffs) != 0) and (restest.pvalue < alpha):
	res['pvalue'] = restest[1]
	res['statistic'] = restest[0]

	greaterless = 0
	if mean1 > mean2:
	greaterless = -1
	elif mean1 < mean2:
	greaterless = 1

	# summarize data in a dictionary to be added to a dataframe/spreadsheet:
	datatoappend = {'depen': col, 'indep1': None, 'indep2': None, 'indep3': None,
	'comparison': comparisonName, 'sigdir': greaterless,
	'testname': testname, 'p': res['pvalue'], 'stat': res['statistic'],
	'mean1': mean1, 'mean2': mean2} # still need: indep
	else:
	print(
	f'not significant ({col}: {comparisonName}; mean1: {mean1}, mean2: {mean2})')
	else:
	print('NOTE: columns are not comparable')
	print(
	f'comparison: {col}: {comparisonName}; len data 1: {len(data1)}; len data 2: {len(data2)}')

	return datatoappend


	def colsAreComparable(data1, data2, colname):
	comparable = False
	if (colname in data1) and (colname in data2):
	if (len(data1) > 0) and (len(data2) > 0):
	if is_numeric_dtype(data1[colname]) and is_numeric_dtype(data2[colname]):
	comparable = True
	else:
	print('not comparable because len <= 0')
	else:
	print('not comparable because colname not in one of datas')
	print(f'colname: {colname}')
	return comparable


	def isNormal(data1, data2, colname, alpha=0.05):
	normal = True

	if (len(data1[colname]) < 20 or len(data2[colname]) < 20):
	normal = False
	else:
	res1 = stats.normaltest(data1[colname])
	res2 = stats.normaltest(data2[colname])
	if (res1.pvalue < alpha) or (res2.pvalue < alpha):
	normal = False

	return normal


	def getDiffs(data1, data2, colname):
	diffs = []
	for id in data1['codename'].values:
	diff = data2.loc[data2['codename'] == id][colname].values[0] - \
	data1.loc[data1['codename'] == id][colname].values[0]
	# check if NaN (could have a missing value) and add to diffs:
	if not math.isnan(diff):
	diffs.append(float(diff))
	return diffs

	###################################
	########## Example usage ##########
	# See 1-paired-data-prep.py to get and prepare data.
	alpha = 0.05
	# ...

	### Single comparison example ###
	nameinit = 'pre'
	namefin = 'mid'
	depencol = 'unreliabledependable'
	limcol1 = 'weird'
	lim1 = 0
	limcol2 = None
	lim2 = None

	if nameinit == 'pre' and namefin == 'mid':
	datainit = predata_for_mid
	datafin = middata_for_pre
	elif nameinit == 'pre' and namefin == 'post':
	datainit = predata_for_post
	datafin = postdata_for_pre
	elif nameinit == 'mid' and namefin == 'post':
	datainit = middata_for_post
	datafin = postdata_for_mid
	else:
	print("INCORRECT NAMES FOR DATA")


	if (limcol2):
	data1 = datainit.loc[(datainit[limcol1] == lim1) &
	(datainit[limcol2] == lim2)]
	data2 = datafin.loc[(datafin[limcol1] == lim1) &
	(datafin[limcol2] == lim2)]
	datatoappend = compareData(
	data1, data2, depencol, f'{nameinit}/{namefin}', alpha)
	datatoappend['indep1'] = f'{limcol1}={lim1}'
	datatoappend['indep2'] = f'{limcol2}={lim2}'
	else:
	data1 = datainit.loc[(datainit[limcol1] == lim1)]
	data2 = datafin.loc[(datafin[limcol1] == lim1)]
	datatoappend = compareData(
	data1, data2, depencol, f'{nameinit}/{namefin}', alpha)
	datatoappend['indep1'] = f'{limcol1}={lim1}'


	# PRINT
	print(datatoappend)


	### Loop over many examples ###
	depen_cols = ["likertQ1", "likertQ2", "likertQ3", "..."]

	datapairs = [
	[predata_for_mid, middata_for_pre, 'pre/mid'],
	[predata_for_post, postdata_for_pre, 'pre/post'],
	[middata_for_post, postdata_for_mid, 'mid/post']
	]

	resdf = pd.DataFrame(columns=['depen', 'indep1', 'indep2', 'indep3', 'comparison',
	'sigdir', 'testname', 'p', 'stat', 'mean1', 'mean2'])

	for datapair in datapairs:
	for depencol in depen_cols:
	datatoappend = compareData(datapair[0], datapair[1], depencol, datapair[2], alpha)
	resdf = resdf.append(datatoappend, ignore_index=True)


	# export the data to an excel spreadsheet
	filename = 'paired-results.xlsx'
	resdf.to_excel(filename)
	print(f"--- Exported to spreadsheet: {filename} ---")
	##################################################################################
	# Mann-Whitney Test
	# See first, 03-unpaired-data-prep.py.
	# Assumption: Data are independent (e.g., Likert scale from parents vs. children.
	#
	# See example: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mannwhitneyu.html
	#
	# If the differences between the two samples are normally distributed, use an
	# independent t-test: scipy.stats.ttest_ind
	# https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html#scipy.stats.ttest_ind

	alpha = 0.05
	# ... see 03-unpaired-data-prep.py ...

	#################
	### functions ###

	def isNormal(data1, data2, colname, alpha=0.05):
	normal = True

	if (len(data1[colname]) < 20 or len(data2[colname]) < 20):
	normal = False
	else:
	res1 = stats.normaltest(data1[colname])
	res2 = stats.normaltest(data2[colname])
	if (res1.pvalue < alpha) or (res2.pvalue < alpha):
	normal = False

	return normal


	def colsAreComparable(data1, data2, colname):
	comparable = False
	if (colname in data1) and (colname in data2):
	if (len(data1) > 0) and (len(data2) > 0):
	if is_numeric_dtype(data1[colname]) and is_numeric_dtype(data2[colname]):
	comparable = True
	else:
	print('not comparable because len <= 0')
	else:
	print('not comparable because colname not in one of datas')
	print(f'colname: {colname}')
	return comparable


	def compareData(data1, data2, col, comparisonName, alpha=0.05):
	# to add to dataframe: demographics \| comparisonname \| pre or mid or post \| indepcol[s] \| greater/less \| testname \| pval \| stat \| init_mean \| after_mean
	# actual col names: demog \| comparison \| whichdata \| indep1 \| indep2 \| indep3 \| sigdir \| test \| p \| stat \| mean1 \| mean2
	# e.g., levelprogramming \| weird-0v1 \| predata \| children \| None \| None \| -1 \| wilx \| pval \| stat \| 2.2 \| 1.5
	datatoappend = None
	if colsAreComparable(data1, data2, col):
	testname = 'n/a'
	res = {'pvalue': None, 'statistic': None, 'z_abs': None}

	if isNormal(data1, data2, col, alpha):
	# independent t-test
	restest = stats.ttest_ind(data1[col], data2[col])
	testname = 'ttest'
	else:
	# mann-whitney test
	restest = stats.mannwhitneyu(data1[col], data2[col])
	testname = 'mann-whitney'

	mean1 = data1[col].mean()
	mean2 = data2[col].mean()

	if restest.pvalue < alpha:
	res['pvalue'] = restest[1]
	res['statistic'] = restest[0]

	greaterless = 0
	if mean1 > mean2:
	greaterless = -1
	elif mean1 < mean2:
	greaterless = 1

	# summarize data in a dictionary to be added to a dataframe/spreadsheet:
	datatoappend = {'demog': col, 'comparison': comparisonName, 'whichdata': None,
	'indep1': None, 'indep2': None, 'indep3': None, 'sigdir': greaterless,
	'test': testname, 'p': res['pvalue'], 'stat': res['statistic'],
	'mean1': mean1, 'mean2': mean2} # still need: comparison(actual), indep[s], whichdata
	else:
	print(
	f'not significant ({col}: {comparisonName}; mean1: {mean1}, mean2: {mean2})')
	else:
	print('NOTE: columns are not comparable')
	print(
	f'comparison: {col}: {comparisonName}; len data 1: {len(data1)}; len data 2: {len(data2)}')

	return datatoappend


	###########################
	### Example comparisons ###

	##### Single comparison #####
	### REGULAR ###
	data = predata
	whichdata = 'predata'
	depencol = 'unreliabledependable'
	compcol = 'weird'
	comp1 = 0
	comp2 = 1
	limcol = None # 'weird'
	limit = None # 0

	if (limcol):
	data1 = data.loc[(data[compcol] == comp1) &
	(data[limcol] == limit)]
	data2 = data.loc[(data[compcol] == comp2) &
	(data[limcol] == limit)]
	datatoappend = compareData(
	data1, data2, depencol, f'{limcol}={limit} {compcol}-{comp1}v{comp2}', alpha)
	else:
	data1 = data.loc[(data[compcol] == comp1)]
	data2 = data.loc[(data[compcol] == comp2)]
	datatoappend = compareData(
	data1, data2, depencol, f'{compcol}-{comp1}v{comp2}', alpha)

	# ### GREATER THAN / LESS THAN ETC. WITH LIM ###
	# data = predata
	# whichdata = 'predata'
	# depencol = 'inflexibleflexible'
	# compcol = 'numagents'
	# comp1 = 1
	# comp2 = 1
	# limcol = 'weird'
	# limit = 0

	# data1 = data.loc[(data[compcol] == comp1) &
	# (data[limcol] == limit)]
	# data2 = data.loc[(data[compcol] > comp2) &
	# (data[limcol] == limit)]
	# datatoappend = compareData(data1, data2, depencol, f'{limcol}={limit} {compcol}-=={comp1}v>{comp2}', alpha)

	# ### GREATER THAN / LESS THAN ETC. NO LIM ###
	# data = predata
	# whichdata = 'predata'
	# depencol = 'trustcas'
	# compcol = 'numagents'
	# comp1 = 1
	# comp2 = 1
	# limcol = None
	# limit = None

	# data1 = data.loc[(data[compcol] == comp1)]
	# data2 = data.loc[(data[compcol] > comp2)]
	# datatoappend = compareData(data1, data2, depencol, f'{compcol}-=={comp1}v>{comp2}', alpha)


	### PRINT ###
	if(datatoappend):
	datatoappend['whichdata'] = whichdata
	if limcol:
	datatoappend['indep1'] = limcol

	print(datatoappend)


	##### Loop over many examples #####
	depen_cols = ["likertQ1", "likertQ2", "likertQ3", "..."]

	resdf = pd.DataFrame(columns=['depen', 'comparison', 'whichdata', 'indep1',
	'indep2', 'indep3', 'sigdir', 'test', 'p', 'stat', 'mean1', 'mean2'])

	for data, dataname in zip(alldata, datanames): # e.g., within predata
	for indepcol in indep_cols: # e.g., levelprogramming
	if (indepcol in data):
	i = 0
	categories = data[indepcol].unique() # e.g., levelprogramming -> 0, 1, 2, 3
	# for categ in categories: # e.g., childparent='C' vs childparent='P'
	while i < len(categories):
	j = i + 1
	# e.g., compare depen cols ('likertQ1', 'likertQ2') btwn parent vs child
	while j < len(categories):
	data1 = data.loc[data[indepcol] == categories[i]]
	data2 = data.loc[data[indepcol] == categories[j]]
	for depencol in depen_cols:
	datatoappend = compareData(data1, data2, depencol, f'{indepcol}-{categories[i]}v{categories[j]}', alpha)
	datatoappend['whichdata'] = dataname
	resdf = resdf.append(datatoappend, ignore_index=True)
	j += 1
	i += 1

	# export the data to an excel spreadsheet
	filename = 'unpaired-results.xlsx'
	resdf.to_excel(filename)
	print(f"--- Exported to spreadsheet: {filename} ---")
	### Potential correlation tests:
	# If Normal: Pearson correlation, but has to be interval or ratio data (NOT ordinal/categorical)
	# If Not Normal --> non-parametric tests
	# Non-parametric tests:
	# (1) Spearman: for interval or ratio data (NOT ordinal/categorical)
	# (2) Kendall: for ordinal or continuous data :)
	# --> I think you can use Kendall for any of not-normal/normal/oridinal/continuous

	### Test your data for normality. If some of the columns are not normal and/or some are ordinal,
	### then use the Kendall correlation
	# test normality (NOTE: you can't do this with string data):
	# for s in cordata.columns:
	# # print(cordata[s])
	# k2, p = stats.normaltest(cordata[s])
	# if p < alpha:
	# print('normal: ' + str(s))
	# else:
	# print('NOT NORMAL: ' + str(s))
	# # since some are not normal, let's use non-parametric test


	import pandas as pd
	import seaborn as sn
	import matplotlib.pyplot as plt
	import re
	from scipy import stats

	alpha = 0.05

	predata = pd.read_excel('path/to/predata.xlsx')
	postdata = pd.read_excel('path/to/postdata.xlsx')
	otherdata = pd.read_excel('path/to/any/other/data.xlsx')
	alldata = pd.merge(otherdata, pd.merge(predata,postdata, on='ID', suffixes=('_pre','_post')), on='ID')

	# only columns we want to correlate:
	cordata = pd.DataFrame(alldata,columns=[
	'example_col_pre', 'example_col_post', 'example_col2_pre', 'example_col2_post', 'other_col'
	])
	# clean data (get rid of rows with NaN's):
	cordata = cordata.dropna()

	# get the correlation matrix
	corrMatrix = cordata.corr(method='kendall')
	sn.heatmap(corrMatrix, annot=True)
	plt.tight_layout()
	plt.savefig('./path/to/figstorage/correlation-matrix.png', dpi=400)
	plt.show()
	print(corrMatrix)

	# test for strong correlations (abs(cor) > ~0.5) and med (abs(cor) > ~0.3)
	# NOTE: it's a symmetric matrix, so let's just do the bottom triangle
	strong = []
	med = []
	c = 0
	for col in corrMatrix.columns:
	c += 1 # skip the diagonal (which is always 1.0)
	for r in range(c,len(corrMatrix)):
	row = corrMatrix.columns[r]
	if corrMatrix[col][r] >= 0.5:
	# print('strong correlation: ' + str(col) + ' & ' + str(row))
	coef, p = stats.kendalltau(cordata[col], cordata[row])
	if p < alpha:
	# print(' significant')
	strong.append((col, row))
	# else:
	# print(' NOT significant')
	# print()
	elif corrMatrix[col][r] >= 0.3:
	# print('med correlation: ' + str(col) + ' & ' + str(row))
	coef, p = stats.kendalltau(cordata[col], cordata[row])
	if p < alpha:
	# print(' significant')
	med.append((col, row))
	# else:
	# print(' NOT significant')
	# print()

	print('Strong Correlations:')
	for t in strong:
	print(str(t[0]) + ', ' + str(t[1]))

	print()
	print('Medium Correlations:')
	for t in med:
	print(str(t[0]) + ', ' + str(t[1]))

	# NOTE: Interpretation of strong/med correlations:
	# Values of 0.5 for strong and 0.3 for med come from education psychology:
	# “Statistical Power Analysis for the Behavioral Sciences” 2013, pg. 80-81

	# NOTE: Meaning of the p-value for strength:
	# coef, p = scipy.stats.kendalltau(data1, data2)
	# → This gives you the value in the matrix (coef) and the p-value → test whether the p-value is < 0.05
	# From paper on User’s guide to correlation coefficients: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6107969/
	# - p-value doesn’t mean the strength of the correlation, it means the likelihood that the correlation is actually this
	# strength (whether that’s weak strength or strong)
	# - e.g., “The low level of the p-value reassures us that 99.99% of the time the correlation is weak at an r of 0.31.”
	################################################################################
	### Use this for inter-coder reliability (ICR, IRR, interrater agreement, etc.)
	### when performing thematic analysis / open coding on data where you have
	### multiple coders, and can tag items with multiple codes: Krippendorff's
	### Alpha with MASI distance.
	###
	### For more information about many different ICR methods, see, "Intercoder
	### Reliability in Qualitative Research: Debates and Practical Guidelines" by
	### O'Connor and Joffe.
	###
	### For a citable paper on this specific ICR method, see "Inter-Coder Agreement
	### for Computational Linguistics" by Artstein and Poesio.
	###
	### This code was taken from dashtaisen on StackOverflow:
	### https://stackoverflow.com/questions/45741934/low-alpha-for-nltk-agreement-using-masi-distance

	import pandas as pd
	import nltk
	from nltk.metrics import agreement
	from nltk.metrics.distance import masi_distance
	from nltk.metrics.distance import jaccard_distance

	### get data
	# Assumes columns are sorted by content (i.e., items to be tagged) so that all
	# of the same content is next to each other in the list, and that there are
	# no nulls / empty rows until the very end of the columns. (You can use 10-25%
	# of your full dataset https://journals.sagepub.com/doi/10.1177/1609406919899220)
	# e.g., name1_tag name1_content name2_tag name2_content
	# +Artificial cute robot +Artificial cute robot
	# Approachable cute robot Personalized depends on topic
	# Personalized depends on topic Convenient not smart
	data = pd.read_excel('path/to/spreadsheet.xlsx', sheet_name="krippendorffs_alpha_prep")

	### put data into correct format for krippendorff's: (coder, item, label)
	def getkripdata(data, coders):
	tagdata = []

	for coder in coders:
	i = 0
	currTags = []
	while i < len(data[f"{coder}_content"]) - 1:
	currItem = data[f"{coder}_content"][i]
	nextItem = data[f"{coder}_content"][i+1]

	currTag = data[f"{coder}_tag"][i]
	currTags.append(currTag)


	if ((nextItem != currItem)):
	if pd.notna(currItem):
	tagdata.append((f"{coder}",currItem,frozenset(currTags)))
	currTags = []
	i += 1

	# append the last tags/items
	if data[f"{coder}_content"][i] == currItem:
	currTags.append(data[f"{coder}_tag"][i])
	else:
	currItem = data[f"{coder}_content"][i]
	currTags = [data[f"{coder}_tag"][i]]

	if pd.notna(currItem):
	tagdata.append((coder,currItem,frozenset(currTags)))

	return tagdata

	# In this case, we have 3 coders, where each coder overlapped with the other
	# coders for a certain portion of the text. (Alternatively, you could just have
	# 2 coders, so you'd just need `coders=[name1,name2]`.)
	# e.g., name1forname2_tag name1forname2_content name2forname1_tag name2forname1_content name1forname3_tag name1forname3_content name3forname1_tag ...
	# (vs. name1_tag name1_content name2_tag name2_content, as shown above)
	coders_name1name2 = ["name1forname2","name2forname1"]
	tagdata_name1name2 = getkripdata(data, coders_name1name2)
	coders_name1name3 = ["name1forname3","name3forname1"]
	tagdata_name1name3 = getkripdata(data, coders_name1name3)
	coders_name3name2 = ["name2forname3","name3forname2"]
	tagdata_name3name2 = getkripdata(data, coders_name3name2)

	allcoders = [coders_name1name2, coders_name1name3, coders_name3name2]
	alltagdata = [tagdata_name1name2, tagdata_name1name3, tagdata_name3name2]

	### make sure the number of items are the same for both coders, so that we can
	### complete krippendorff's alpha:
	i = 0
	for tagdata in alltagdata:
	coders = allcoders[i]
	coder1 = coders[0]
	coder2 = coders[1]

	unique1 = data[f"{coder1}_content"].unique()
	unique1 = unique1[~pd.isnull(unique1)]

	unique2 = data[f"{coder2}_content"].unique()
	unique2= unique2[~pd.isnull(unique2)]

	# print(f'### for coders: {coders} ###')
	# print(f'unique1: {unique1}\nunique2: {unique2}')
	# print(f'unique1: {len(unique1)}\nunique2: {len(unique2)}')
	if len(unique1) != len(unique2):
	print('### NOT THE SAME NUMBER OF ITEMS ###')
	for item in unique1:
	if not (item in unique2):
	print(f'{coder2} is missing item ~{item}~')
	for item in unique2:
	if not (item in unique1):
	print(f'{coder1} is missing item ~{item}~')
	print()
	i += 1

	### Krippendorff's Alpha with MASI distance:
	# (coder, item, label)
	# exampledata = [('inky','text01',frozenset(['love','gifts'])),
	# ('blinky','text01',frozenset(['love','gifts'])),
	# ('sue','text01',frozenset(['love','gifts'])),
	# ('inky','text02',frozenset(['slime','gaming'])),
	# ('blinky','text02',frozenset(['slime'])),
	# ('sue','text02',frozenset(['slime','gaming']))]

	i = 0
	for tagdata in alltagdata:
	task = nltk.AnnotationTask(distance=masi_distance)
	#jaccard_task = nltk.AnnotationTask(distance=jaccard_distance)
	coders = allcoders[i]
	task.load_array(tagdata)
	print("################################################################")
	print(f"Statistics for dataset using {task.distance} with coders {coders}")
	print("C: {}\nI: {}\nK: {}".format(task.C, task.I, task.K))
	print("Pi: {}".format(task.pi()))
	print("Kappa: {}".format(task.kappa()))
	print("Multi-Kappa: {}".format(task.multi_kappa()))
	print("Alpha: {}".format(task.alpha()))
	print()
	i += 1

	### Krippendorff's Alpha values meaning:
	# Values range from 0 to 1, where 0 is perfect disagreement and 1 is perfect
	# agreement. Krippendorff suggests: "[I]t is customary to require α ≥ .800.
	# Where tentative conclusions are still acceptable, α ≥ .667 is the lowest
	# conceivable limit" Krippendorff, K. (2004). Content analysis:
	# An introduction to its methodology. Thousand Oaks, California: Sage. p.241
	import glob, os
	import re
	import nltk
	import nltk.corpus

	nltk.download('stopwords')
	nltk.download('punkt')
	nltk.download('wordnet')
	from nltk.corpus import stopwords
	from nltk.tokenize import sent_tokenize, word_tokenize
	from nltk.stem import WordNetLemmatizer

	###############################################################################
	### Takes in text files (in the dir `path`) containing a long string of words
	### (e.g., all the answers from students to a question, all on one line), and
	### outputs a csv file with cleaned, lemmatized words in a single column.

	def clean_text(text):
	newtext = text.lower()
	newtext = re.sub(r"(@[A-Za-z0-9]+)\|([^0-9A-Za-z \t])\|(\w+:\/\/\S+)\|^rt\|http.+?", "", newtext)
	newtext = re.sub(r"\d+", "", newtext)
	return newtext

	def create_word_freq_pairs(text,filterWordsList):
	wordlist = text.split()
	wordlist = list(filter(lambda w: w not in filterWordsList, wordlist))
	wordfreq = [wordlist.count(p) for p in wordlist]
	freqdict = dict(list(zip(wordlist,wordfreq)))
	sorteddict = [(freqdict[key], key) for key in freqdict]
	sorteddict.sort()
	sorteddict.reverse()
	return sorteddict

	filterWordsList = [
	'ai','artificial','artifical','intelligence','conversational'
	]

	path = 'path/to/where/text/files/are/'
	os.chdir(path)
	stop = stopwords.words('english')
	dfs = {} # dict with dataframes for each of the texts
	for filename in glob.glob("*.txt"):
	# clean the text (remove stop words, lemmatize)
	print(filename)
	fr = open(filename, 'r')
	cleaned = clean_text(fr.read())
	nostop = ' '.join([word for word in cleaned.split() if word not in (stop)])
	tokenized = word_tokenize(nostop)
	lemmatized = [WordNetLemmatizer().lemmatize(i) for i in tokenized]
	donetext = 'words\n'
	for i in lemmatized:
	donetext += i + '\n'
	fr.close()

	# write to a file for use in Tableau or in word_cloud_creator.py
	newfile = filename.split('.txt')[0] + '_cleaned.csv'
	fw = open(newfile, 'w')
	fw.write(donetext)
	fw.close()
	print(newfile)

	print()

	###############################################################################
	# See first nlp-clean-text-for-word-cloud.py to create *_cleaned.csv files
	# Based on https://www.datacamp.com/community/tutorials/wordcloud-python

	import glob, os
	import numpy as np
	import pandas as pd
	from os import path
	from PIL import Image
	from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
	from collections import Counter

	import matplotlib.pyplot as plt
	import matplotlib.gridspec as gridspec
	from matplotlib.ticker import MaxNLocator

	path = 'path/to/csv/files/from/nlp-clean-text-for-word-cloud/'
	outputpath = 'path/to/where/you/want/your/figs/stored/'
	filterWordsList = [
	'put','any','words','you','don\'t', 'want', 'in', 'the', 'analysis', 'here'
	]
	os.chdir(path)
	for filename in glob.glob("*_cleaned.csv"):
	# create title for plots
	title = ''
	if(filename.find('pre') != -1):
	title += 'Pre-test'
	else:
	title += 'Post-test'
	title += ' word frequency:'
	print(title)

	# get frequency data
	df = pd.read_csv(filename)
	# filter out obvious words
	for word in filterWordsList:
	df = df[df['words'] != word]
	freq = df['words'].value_counts()
	print(freq)

	# set max number of words for plotting
	N = 7
	indices = list(freq.index[0:N])
	values = list(freq.values[0:N])
	indices.reverse()
	values.reverse()

	# create 2 subplots with relative sizing
	fig = plt.figure(figsize=(6,2.5))#(9,3.75))
	fig.suptitle(title)
	gs = gridspec.GridSpec(1, 2,width_ratios=[1,2.25])
	# gs.update(wspace=0.1, hspace=0.1) # set the spacing between axes
	ax1 = plt.subplot(gs[0])
	ax2 = plt.subplot(gs[1])

	# plot histogram
	ax1.barh(indices, values)
	labels = ax1.get_yticklabels()
	plt.setp(labels, rotation=45, horizontalalignment='right')
	ax1.set(xlabel='Count')
	ax1.xaxis.set_major_locator(MaxNLocator(integer=True))
	for label in ax1.xaxis.get_ticklabels()[::2]: # every other tick
	label.set_visible(False)


	# plot word cloud
	text = ' '.join(df['words'].tolist())
	wordcloud = WordCloud(max_words=100, background_color="white", width=1600, height=800).generate(text)
	ax2.imshow(wordcloud, interpolation='bilinear')
	ax2.axis("off")

	# save & show figure
	plt.tight_layout()
	fig.subplots_adjust(top=.9)
	plt.savefig(outputpath + filename.split('_cleaned.csv')[0] + '.png', dpi=150)
	plt.show()
	import pandas as pd
	import scipy.stats as stats

	################################################################################
	# Fisher's Exact Test
	# https://www.statology.org/fishers-exact-test-python/
	# E.g., Is there a statistically significant association between gender and
	# political party?
	# Data:
	# Democrat Republican
	# Female 8 4
	# Male 4 9
	#
	# E.g., Is there an association between being a child or parent and thinking
	# conversational agents should be more human-like or artificial?
	# Artificial Human-like
	# Child 11 10
	# Parent 3 7

	#################
	### Data prep ###
	# E.g., Count the number of tags, like 'human-like' and 'artificial', for children vs. parents
	# Spreadsheet might look like this:
	# C_Tag1 \| P_Tag1 \| C_Tag2 \| P_Tag2 \| ...
	# 11 \| 3 \| 10 \| 7 \| ...
	df = pd.read_excel('path/to/spreadsheet.xlsx', sheet_name="sheet-name")

	alpha = 0.05


	#####################
	### Fisher's test ###
	colnames = ['C_Human-like', 'C_Artificial', 'P_Human-like', 'P_Artificial']

	table = [[float(df[colnames[0]]), float(df[colnames[1]])],
	[float(df[colnames[2]]), float(df[colnames[3]])]]
	print(table)

	resdiff = stats.fisher_exact(table, alternative='two-sided')
	resgreater = stats.fisher_exact(table, alternative='greater')
	resless = stats.fisher_exact(table, alternative='less')

	# Two-sided ('resdiff'):
	# if alpha <= 0.05:
	# "H1: (alternative hypothesis) The two variables are not independent (i.e., there's
	# an association between the two variables)."
	# else:
	# "H0: (null hypothesis) The two variables are independent."

	# "When using a one-tailed test (greater/less), you are testing for the possibility of the
	# relationship in one direction and completely disregarding the possibility of a
	# relationship in the other direction. [...] A one-tailed test will test either if the
	# mean is significantly greater than x or if the mean is significantly less than x, but not
	# both. [...] The one-tailed test provides more power to detect an effect in one direction
	# by not testing the effect in the other direction."

	diff = False
	if resdiff[1] <= alpha:
	print(f'DIFFERENT: oddsr: {resdiff[0]}, p: {resdiff[1]}')
	diff = True
	if resgreater[1] <= alpha:
	print(f'GREATER: oddsr: {resgreater[0]}, p: {resgreater[1]}')
	diff = True
	if resless[1] <= alpha:
	print(f'LESS: oddsr: {resless[0]}, p: {resless[1]}')
	diff = True

	if not diff:
	print('No significant difference.')
	import pandas as pd
	from scipy import stats

	data = pd.read_excel('path/to/file.xlsx')

	##############################################################################################
	### test for normality
	# Note: if you're doing this for a paired test, like Wilcoxon Signed-Rank test or a paired
	# t-test, you'll want to test the differences between the two samples for normality.
	alpha = 0.05
	res = stats.normaltest(data.colname)

	if res.pvalue < alpha: # null hypothesis: x comes from a normal distribution
	print("The null hypothesis can be rejected — x does not come from a normal distribution")
	print("Use a non-parametric test, like the wilcoxon signed-rank test")
	else:
	print("The null hypothesis cannot be rejected — x comes from a normal distribution")
	print("Use a parametric test, like the t-test")
	### result: if from normal distribution, use parametric test, otherwise, use non-parametric
	##############################################################################################