Skip to content

Instantly share code, notes, and snippets.

@yifeihuang
Last active December 4, 2020 03:45
Show Gist options
  • Save yifeihuang/c4a91c56b32d8504a69fc785fa2700ae to your computer and use it in GitHub Desktop.
Save yifeihuang/c4a91c56b32d8504a69fc785fa2700ae to your computer and use it in GitHub Desktop.
Multivariate experiment analysis procedure
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Example analysis of a multivariate experiment using multiple linear regression\n",
"\n",
"### problem statement: we ran a multivariate experiment on the enjoyment of food as a function of food and topping. The food is varied between ice cream and hotdog, and topping is varied between mustard and chocolate sauce\n",
"dataset borrowed from https://statisticsbyjim.com/regression/interaction-effects/\n",
"\n",
"### analysis procedure\n",
"- examine dataset to undertand features\n",
"- create interaction terms of interest, e.g. food x topping\n",
"- create dummy variables (one hot encoding of categorical variables) needed for regression models\n",
"- run regression model with cross validation (to minimize overfitting) and regularization (to eliminate non-impactful features)\n",
"- optional: if the sample size is large (>1000), you may wish to withhold a portion (typically 20%) as a test set to evaluate the true accuracy of the model, which is an assessment models ability to generalize and accuracy of the feature coefficient estimates\n",
"- analyze results"
]
},
{
"cell_type": "code",
"execution_count": 94,
"metadata": {},
"outputs": [],
"source": [
"df = pd.read_csv('Interactions_Categorical.csv')"
]
},
{
"cell_type": "code",
"execution_count": 95,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Enjoyment</th>\n",
" <th>Food</th>\n",
" <th>Condiment</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>81.926957</td>\n",
" <td>Hot Dog</td>\n",
" <td>Mustard</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>84.939774</td>\n",
" <td>Hot Dog</td>\n",
" <td>Mustard</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>90.286479</td>\n",
" <td>Hot Dog</td>\n",
" <td>Mustard</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>89.561802</td>\n",
" <td>Hot Dog</td>\n",
" <td>Mustard</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>97.676826</td>\n",
" <td>Hot Dog</td>\n",
" <td>Mustard</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>75</th>\n",
" <td>61.920134</td>\n",
" <td>Ice Cream</td>\n",
" <td>Mustard</td>\n",
" </tr>\n",
" <tr>\n",
" <th>76</th>\n",
" <td>61.055942</td>\n",
" <td>Ice Cream</td>\n",
" <td>Mustard</td>\n",
" </tr>\n",
" <tr>\n",
" <th>77</th>\n",
" <td>61.976713</td>\n",
" <td>Ice Cream</td>\n",
" <td>Mustard</td>\n",
" </tr>\n",
" <tr>\n",
" <th>78</th>\n",
" <td>61.544813</td>\n",
" <td>Ice Cream</td>\n",
" <td>Mustard</td>\n",
" </tr>\n",
" <tr>\n",
" <th>79</th>\n",
" <td>69.276921</td>\n",
" <td>Ice Cream</td>\n",
" <td>Mustard</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>80 rows × 3 columns</p>\n",
"</div>"
],
"text/plain": [
" Enjoyment Food Condiment\n",
"0 81.926957 Hot Dog Mustard\n",
"1 84.939774 Hot Dog Mustard\n",
"2 90.286479 Hot Dog Mustard\n",
"3 89.561802 Hot Dog Mustard\n",
"4 97.676826 Hot Dog Mustard\n",
".. ... ... ...\n",
"75 61.920134 Ice Cream Mustard\n",
"76 61.055942 Ice Cream Mustard\n",
"77 61.976713 Ice Cream Mustard\n",
"78 61.544813 Ice Cream Mustard\n",
"79 69.276921 Ice Cream Mustard\n",
"\n",
"[80 rows x 3 columns]"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"display(df)"
]
},
{
"cell_type": "code",
"execution_count": 96,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Enjoyment</th>\n",
" <th>Food</th>\n",
" <th>Condiment</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>count</th>\n",
" <td>80.000000</td>\n",
" <td>80</td>\n",
" <td>80</td>\n",
" </tr>\n",
" <tr>\n",
" <th>unique</th>\n",
" <td>NaN</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>top</th>\n",
" <td>NaN</td>\n",
" <td>Hot Dog</td>\n",
" <td>Chocolate Sauce</td>\n",
" </tr>\n",
" <tr>\n",
" <th>freq</th>\n",
" <td>NaN</td>\n",
" <td>40</td>\n",
" <td>40</td>\n",
" </tr>\n",
" <tr>\n",
" <th>mean</th>\n",
" <td>77.319827</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>std</th>\n",
" <td>15.044257</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>min</th>\n",
" <td>52.309297</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25%</th>\n",
" <td>62.331626</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>50%</th>\n",
" <td>77.999388</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>75%</th>\n",
" <td>90.922419</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>max</th>\n",
" <td>102.620440</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Enjoyment Food Condiment\n",
"count 80.000000 80 80\n",
"unique NaN 2 2\n",
"top NaN Hot Dog Chocolate Sauce\n",
"freq NaN 40 40\n",
"mean 77.319827 NaN NaN\n",
"std 15.044257 NaN NaN\n",
"min 52.309297 NaN NaN\n",
"25% 62.331626 NaN NaN\n",
"50% 77.999388 NaN NaN\n",
"75% 90.922419 NaN NaN\n",
"max 102.620440 NaN NaN"
]
},
"execution_count": 96,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.describe(include='all')"
]
},
{
"cell_type": "code",
"execution_count": 97,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Enjoyment</th>\n",
" <th>Food_Hot Dog</th>\n",
" <th>Food_Ice Cream</th>\n",
" <th>Cond_Chocolate Sauce</th>\n",
" <th>Cond_Mustard</th>\n",
" <th>FoodxCond_Chocolate Sauce on Hot Dog</th>\n",
" <th>FoodxCond_Chocolate Sauce on Ice Cream</th>\n",
" <th>FoodxCond_Mustard on Hot Dog</th>\n",
" <th>FoodxCond_Mustard on Ice Cream</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>81.926957</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>84.939774</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>90.286479</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>89.561802</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>97.676826</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>75</th>\n",
" <td>61.920134</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>76</th>\n",
" <td>61.055942</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>77</th>\n",
" <td>61.976713</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>78</th>\n",
" <td>61.544813</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>79</th>\n",
" <td>69.276921</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>80 rows × 9 columns</p>\n",
"</div>"
],
"text/plain": [
" Enjoyment Food_Hot Dog Food_Ice Cream Cond_Chocolate Sauce \\\n",
"0 81.926957 1 0 0 \n",
"1 84.939774 1 0 0 \n",
"2 90.286479 1 0 0 \n",
"3 89.561802 1 0 0 \n",
"4 97.676826 1 0 0 \n",
".. ... ... ... ... \n",
"75 61.920134 0 1 0 \n",
"76 61.055942 0 1 0 \n",
"77 61.976713 0 1 0 \n",
"78 61.544813 0 1 0 \n",
"79 69.276921 0 1 0 \n",
"\n",
" Cond_Mustard FoodxCond_Chocolate Sauce on Hot Dog \\\n",
"0 1 0 \n",
"1 1 0 \n",
"2 1 0 \n",
"3 1 0 \n",
"4 1 0 \n",
".. ... ... \n",
"75 1 0 \n",
"76 1 0 \n",
"77 1 0 \n",
"78 1 0 \n",
"79 1 0 \n",
"\n",
" FoodxCond_Chocolate Sauce on Ice Cream FoodxCond_Mustard on Hot Dog \\\n",
"0 0 1 \n",
"1 0 1 \n",
"2 0 1 \n",
"3 0 1 \n",
"4 0 1 \n",
".. ... ... \n",
"75 0 0 \n",
"76 0 0 \n",
"77 0 0 \n",
"78 0 0 \n",
"79 0 0 \n",
"\n",
" FoodxCond_Mustard on Ice Cream \n",
"0 0 \n",
"1 0 \n",
"2 0 \n",
"3 0 \n",
"4 0 \n",
".. ... \n",
"75 1 \n",
"76 1 \n",
"77 1 \n",
"78 1 \n",
"79 1 \n",
"\n",
"[80 rows x 9 columns]"
]
},
"execution_count": 97,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"test = df\n",
"test['FoodxCond'] = test.apply(lambda x: x['Condiment'] + ' on ' + x['Food'], axis=1)\n",
"test = pd.get_dummies(df, prefix=['Food', 'Cond', 'FoodxCond'], columns=['Food', 'Condiment', 'FoodxCond'])\n",
"test"
]
},
{
"cell_type": "code",
"execution_count": 98,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Baseline enjoyment: 76.0\n",
"\n",
"Element impact on enjoyment\n",
"FoodxCond_Chocolate Sauce on Ice Cream: 17.0\n",
"FoodxCond_Mustard on Hot Dog: 13.6\n",
"Food_Hot Dog: 0.0\n",
"Food_Ice Cream: -0.0\n",
"Cond_Chocolate Sauce: 0.0\n",
"Cond_Mustard: -0.0\n",
"FoodxCond_Chocolate Sauce on Hot Dog: -10.7\n",
"FoodxCond_Mustard on Ice Cream: -14.7\n",
"\n",
"Mean squared error: 23.8\n",
"Coefficient of determination: 89.3%\n"
]
}
],
"source": [
"from sklearn import linear_model\n",
"from sklearn.metrics import mean_squared_error, r2_score\n",
"\n",
"# Create cross validated and l2 regularized regression model\n",
"regr = linear_model.LassoCV(fit_intercept=True, cv=5)\n",
"\n",
"# get all factors and interaction terms as input variables\n",
"X = [c for c in test.columns.values if c!='Enjoyment']\n",
"\n",
"# fit model\n",
"regr.fit(test[X], test['Enjoyment'])\n",
"\n",
"# predict\n",
"y_pred = regr.predict(test[X])\n",
"\n",
"# The results\n",
"print('Baseline enjoyment: {:0.1f}\\n'.format(regr.intercept_))\n",
"print('Element impact on enjoyment')\n",
"coefficients = list(zip(X,regr.coef_))\n",
"coefficients.sort(key=lambda x: x[1], reverse=True)\n",
"for c in coefficients:\n",
" print('{}: {:.1f}'.format(c[0],c[1]))\n",
"\n",
"# The mean squared error\n",
"print('\\nMean squared error: {:0.1f}'.format(mean_squared_error(test['Enjoyment'], y_pred)))\n",
"# The coefficient of determination: 1 is perfect prediction\n",
"print('Coefficient of determination: {:0.1%}'.format(r2_score(test['Enjoyment'], y_pred)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### we can see that coefficient indicate that the main elements of food and condiment actually do not impact the enjoyment (0 coefficient) - people don't necessarily prefer one over the other\n",
"### The interaction between food and topping is actually huge impactful, with chocolate sauce on ice cream having the biggest positive impact, and mustard on ice cream having the biggest negative impact"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.5"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment