shlomihod/pre-class.ipynb

## pre-class.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "7xPuF47f-ECM"
   },
   "source": [
    "![banner](https://learn.responsibly.ai/assets/banner.jpg)\n",
    "\n",
    "# Class 4 - Discrimination & Fairness: Pre-Class Task\n",
    "\n",
    "https://learn.responsibly.ai"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "Rdojsdf0fGkn"
   },
   "source": [
    "\n",
    "In the fourth class, we will continue our journy into the challenge of fairness in machine learning.\n",
    "\n",
    "The goal of this pre-class task is to get familiar with an healthcare dataset we will work on in the fourth class. It is an Electronic Health Record (EHR) dataset which contains an electronic version of patients' medical history. The dataset based on data from a US health insurance firm, so it conains also cost information. From privacy resons, the data went through an \"anonomization\" process, but we'll get into that in the next classes.\n",
    "\n",
    "The following six tasks will help you to get familiar with this dataset, understand its various columns and their relation to each other. \n",
    "\n",
    "Please go through the whole notebook before you start coding. You could plan your work better if you have first an overview of the task.\n",
    "\n",
    "If you have any questions, please post them in the `#ds` channel in Discord or join the office hours.\n",
    "\n",
    "Let's start!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "FEahmIMGfk0h"
   },
   "source": [
    "## How to write your answers?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "5kaCEPTV-mmy"
   },
   "source": [
    "\n",
    "For each task, you should write a code and `print()` the answers to the questions. You should print them in the following format (see Task 1 example): \n",
    "\n",
    "```python\n",
    "print(f'T{task_number}-Q{question_number}: {answer}')\n",
    "```\n",
    "\n",
    "In some questions, you will need to display a data frame. The format of answering these type of question is:\n",
    "\n",
    "```python\n",
    "print(f'T{task_number}-Q{question_number}:')\n",
    "display(answer_df)\n",
    "```\n",
    "\n",
    "We assume you are familiar with `pandas` package. If you are not, you can read this [toturial](https://realpython.com/pandas-python-explore-dataset/) (note that it contains much more than you need in this task, yet it can be handy as a reference when you answer the questions). "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "GDLFwEBGr3qD"
   },
   "source": [
    "## Setup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "RnBxHABh_i3Q"
   },
   "outputs": [],
   "source": [
    "!wget -q https://stash.responsibly.ai/4-fairness/activity/patients_data.csv -O patients_data.csv\n",
    "\n",
    "%pip install -qqq git+https://github.com/ResponsiblyAI/railib.git"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "MoDJSDVMgDRZ"
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "from IPython.display import display\n",
    "\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "from sklearn.model_selection import train_test_split\n",
    "from railib.fairness.second import plot_x_vs_y\n",
    "\n",
    "sns.set()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "vO1b2cRhr_jF"
   },
   "source": [
    "## Dataset\n",
    "\n",
    "### Meaning of column suffixes (endings)\n",
    "* _t: indicates this is a time dependent variable from year t (e.g. t = 2020)\n",
    "* _tm1: indicates this is a time dependent variable from year t minus 1 (t-1) (e.g. if t = 2021 then t - 1 = 2020).\n",
    "\n",
    "### Variables categories\n",
    "\n",
    "#### Can be used as target variables (because they are \"at time t\" ):\n",
    "\n",
    "* Outcomes at time t: \"outcomes\" for a given calendar year (t): cost, health and program enrollment.\n",
    "In particular, we have:\n",
    "\n",
    "| Column Name        | Description                                                                                      | Note                                                                          |\n",
    "|--------------------|--------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------|\n",
    "| cost_t             | Total medical expenditures, rounded to the nearest 100                                           | The actual target variable used to train the model and produce the risk score |\n",
    "| program_enrolled_t | Indicator for whether patient-year was enrolled in program                                       |                                                                               |\n",
    "| illnesses_sum_t    | Total number of active illnesses                                                         |                                                                               |\n",
    "\n",
    "#### Can be used as \"predictors\" (features): \n",
    "\n",
    "* **Demographic**: e.g `gender`, `race`, `age`.\n",
    "* **Comorbidity variables** at time t-1: indicators for specific illnesses that were active in the previous year. <br> E.g `liver_elixhauser_tm1` which is an indicator for liver disease.\n",
    "* **Cost variables** at time t-1: Costs claimed from the patients' insurance payer over the previous year. <br> E.g `cost_laboratory_tm1` which is the total cost for laboratory tests.\n",
    "* **Biomarker\\medication** variables at time t-1: indicators capturing normal or abnormal values (or missingness) of biomarkers or relevant medications, over the previous year. <br> E.g `ghba1c_min-low_tm1` which is an indicator for low (< 4) minimum GHbA1c test result.\n",
    "\n",
    "An indicator is a binary variable: 1 stands for 'True' or 'Has the condition', 0 stands for 'False' or 'Doesn't Have'.\n",
    "\n",
    "For a detailed description of the dataset and the columns see [this document](https://docs.google.com/document/d/1OFGh7Hkqo8FjcPfBGql7mr8to9Il16XwvLdJkXF1IeU/edit?usp=sharing), but you don't need it.\n",
    "\n",
    "The EHR (Electronic Health Record) dataset contains 48,784 rows (patients) and 160 columns/variables. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "UIMjTzE2R-kk"
   },
   "source": [
    "### Loading the data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 317
    },
    "id": "57nnfiuzgEyu",
    "outputId": "abf3912b-a25d-4a07-d5d2-874e9c04572b"
   },
   "outputs": [],
   "source": [
    "data = pd.read_csv('patients_data.csv')\n",
    "data.head(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "fFdLHI5mSAmR"
   },
   "source": [
    "### Columns declaration"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "IPNKAtHijI6N"
   },
   "source": [
    "It is very difficult to work with all the 160 features all together. Therefore, we grouped them into high-level categoires."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "JmnUViQNSCEb"
   },
   "outputs": [],
   "source": [
    "# Outcome variables at time T (see above)\n",
    "outcomes_t = ['illnesses_sum_t', 'cost_t', 'cost_avoidable_t', 'bps_mean_t', 'ghba1c_mean_t', 'hct_mean_t', 'cre_mean_t', 'ldl_mean_t']\n",
    "\n",
    "# Demographic variables at time T-1\n",
    "demographic_tm1 = ['gender', 'race', 'age']\n",
    "\n",
    "# illness condition variables at time T-1\n",
    "comorbidities_tm1 = ['illnesses_sum_tm1', 'alcohol_elixhauser_tm1', 'anemia_elixhauser_tm1', 'arrhythmia_elixhauser_tm1', 'arthritis_elixhauser_tm1', 'bloodlossanemia_elixhauser_tm1', 'coagulopathy_elixhauser_tm1', 'compdiabetes_elixhauser_tm1', 'depression_elixhauser_tm1', 'drugabuse_elixhauser_tm1', 'electrolytes_elixhauser_tm1', 'hypertension_elixhauser_tm1', 'hypothyroid_elixhauser_tm1', 'liver_elixhauser_tm1', 'neurodegen_elixhauser_tm1', 'obesity_elixhauser_tm1', 'paralysis_elixhauser_tm1', 'psychosis_elixhauser_tm1', 'pulmcirc_elixhauser_tm1', 'pvd_elixhauser_tm1', 'renal_elixhauser_tm1', 'uncompdiabetes_elixhauser_tm1', 'valvulardz_elixhauser_tm1', 'wtloss_elixhauser_tm1', 'cerebrovasculardz_romano_tm1', 'chf_romano_tm1', 'dementia_romano_tm1', 'hemiplegia_romano_tm1', 'hivaids_romano_tm1', 'metastatic_romano_tm1', 'myocardialinfarct_romano_tm1', 'pulmonarydz_romano_tm1', 'tumor_romano_tm1', 'ulcer_romano_tm1']\n",
    "\n",
    "# Cost variables at time T-1\n",
    "costs_tm1 = ['cost_dialysis_tm1', 'cost_emergency_tm1', 'cost_home_health_tm1', 'cost_ip_medical_tm1', 'cost_ip_surgical_tm1', 'cost_laboratory_tm1', 'cost_op_primary_care_tm1', 'cost_op_specialists_tm1', 'cost_op_surgery_tm1', 'cost_other_tm1', 'cost_pharmacy_tm1', 'cost_physical_therapy_tm1', 'cost_radiology_tm1']\n",
    "\n",
    "# Biomarkers (e.g., blood test result) varbles at time T-1\n",
    "biomarkers_tm1 = ['lasix_dose_count_tm1', 'lasix_min_daily_dose_tm1', 'lasix_mean_daily_dose_tm1', 'lasix_max_daily_dose_tm1', 'cre_tests_tm1', 'crp_tests_tm1', 'esr_tests_tm1', 'ghba1c_tests_tm1', 'hct_tests_tm1', 'ldl_tests_tm1', 'nt_bnp_tests_tm1', 'sodium_tests_tm1', 'trig_tests_tm1', 'cre_min-low_tm1', 'cre_min-high_tm1', 'cre_min-normal_tm1', 'cre_mean-low_tm1', 'cre_mean-high_tm1', 'cre_mean-normal_tm1', 'cre_max-low_tm1', 'cre_max-high_tm1', 'cre_max-normal_tm1', 'crp_min-low_tm1', 'crp_min-high_tm1', 'crp_min-normal_tm1', 'crp_mean-low_tm1', 'crp_mean-high_tm1', 'crp_mean-normal_tm1', 'crp_max-low_tm1', 'crp_max-high_tm1', 'crp_max-normal_tm1', 'esr_min-low_tm1', 'esr_min-high_tm1', 'esr_min-normal_tm1', 'esr_mean-low_tm1', 'esr_mean-high_tm1', 'esr_mean-normal_tm1', 'esr_max-low_tm1', 'esr_max-high_tm1', 'esr_max-normal_tm1', 'ghba1c_min-low_tm1', 'ghba1c_min-high_tm1', 'ghba1c_min-normal_tm1', 'ghba1c_mean-low_tm1', 'ghba1c_mean-high_tm1', 'ghba1c_mean-normal_tm1', 'ghba1c_max-low_tm1', 'ghba1c_max-high_tm1', 'ghba1c_max-normal_tm1', 'hct_min-low_tm1', 'hct_min-high_tm1', 'hct_min-normal_tm1', 'hct_mean-low_tm1', 'hct_mean-high_tm1', 'hct_mean-normal_tm1', 'hct_max-low_tm1', 'hct_max-high_tm1', 'hct_max-normal_tm1', 'ldl_min-low_tm1', 'ldl_min-high_tm1', 'ldl_min-normal_tm1', 'ldl-mean-low_tm1', 'ldl-mean-high_tm1', 'ldl-mean-normal_tm1', 'ldl_max-low_tm1', 'ldl_max-high_tm1', 'ldl_max-normal_tm1', 'nt_bnp_min-low_tm1', 'nt_bnp_min-high_tm1', 'nt_bnp_min-normal_tm1', 'nt_bnp_mean-low_tm1', 'nt_bnp_mean-high_tm1', 'nt_bnp_mean-normal_tm1', 'nt_bnp_max-low_tm1', 'nt_bnp_max-high_tm1', 'nt_bnp_max-normal_tm1', 'sodium_min-low_tm1', 'sodium_min-high_tm1', 'sodium_min-normal_tm1', 'sodium_mean-low_tm1', 'sodium_mean-high_tm1', 'sodium_mean-normal_tm1', 'sodium_max-low_tm1', 'sodium_max-high_tm1', 'sodium_max-normal_tm1', 'trig_min-low_tm1', 'trig_min-high_tm1', 'trig_min-normal_tm1', 'trig_mean-low_tm1', 'trig_mean-high_tm1', 'trig_mean-normal_tm1', 'trig_max-low_tm1', 'trig_max-high_tm1', 'trig_max-normal_tm1']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "TrI8GCOxhsQL"
   },
   "source": [
    "## Tasks"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "6h6wogdvSCrM"
   },
   "source": [
    "### Task 1 (example)\n",
    "\n",
    "1. How many patients (each row is a patient) there are?\n",
    "2. How many males?\n",
    "3. How many females?\n",
    "4. Display a data frame: for each gender, show the average total number of active illnesses at time t (`illnesses_sum_t`) and the average of the total medical expenditures (`cost_t`)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 213
    },
    "id": "s6wWJuLcr-Qo",
    "outputId": "04c3676b-4119-42a3-fc79-096811781912"
   },
   "outputs": [],
   "source": [
    "answer1 = data.shape[0]\n",
    "answer2 = data[data['gender'] == 'Male'].shape[0]\n",
    "answer3 = data[data['gender'] == 'Female'].shape[0]\n",
    "answer4 = data.groupby('gender').agg({'illnesses_sum_t': 'mean', 'cost_t': 'mean'})\n",
    "\n",
    "print(f\"T1-Q1: {answer1}\")\n",
    "print(f\"T1-Q2: {answer2}\")\n",
    "print(f\"T1-Q3: {answer3}\")\n",
    "print(f\"T1-Q4:\")\n",
    "display(answer4)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "1uPt7Rv6sho4"
   },
   "source": [
    "### Task 2\n",
    "\n",
    "1. Display a data frame: for each gender, show the average total number of active illnesses at time tm1 (`illnesses_sum_tm1`) and the average of the total medical expenditures at tm1 (`cost_tm1`).\n",
    "2. How many patients have higher medical expenditures at time t than tm1?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "u0c9srgx3mmi"
   },
   "outputs": [],
   "source": [
    "# Your code here..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "vlBUnwJcsvI4"
   },
   "source": [
    "### Task 3\n",
    "\n",
    "1. Print all the age groups (unique values of `age` column)\n",
    "2. Display a dataframe: for each age group show the average value of the following variables at time t (`vars_at_t` defined below): `['bps_mean_t', 'ghba1c_mean_t', 'hct_mean_t', 'cre_mean_t', 'ldl_mean_t']`.\n",
    "Look at the Q2 data frame, and answer the following questions:\n",
    "3. Which age group has the lowest systolic blood pressure (`bps_mean_t`)?\n",
    "4. Which age group has the highest HbA1C (glycated hemoglobin, `ghba1c_mean_t`)?\n",
    "5. Which age group has the highest hematocrit (`hct_mean_t`)?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "6Mt5MbScUf-l"
   },
   "outputs": [],
   "source": [
    "vars_at_t = ['bps_mean_t', 'ghba1c_mean_t', 'hct_mean_t', 'cre_mean_t', 'ldl_mean_t']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "gcauikUm3p46"
   },
   "outputs": [],
   "source": [
    "# Your code here..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "w4pVCAm0so-h"
   },
   "source": [
    "### Task 4\n",
    "\n",
    "1. Create a race indicator column named `'black'`. If the patient is black, the value of the indicator should be 1, otherwise 0. Don't print anything.\n",
    "2. How many black patients are there?\n",
    "3. How many white patients are there?\n",
    "4. Calculate the correlation between the race indicator and `cost_t`. \n",
    "5. Calculate the correlation between the race indicator and HbA1C (glycated hemoglobin, `ghba1c_mean_t`)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "YmBzaM923vnu"
   },
   "outputs": [],
   "source": [
    "# Your code here..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "rwAD1w0dtg7o"
   },
   "source": [
    "### Task 5\n",
    "\n",
    "In the cell below, we define a visualization function that we will use in class. The following questions will help you understand the visualization better. First, read the function documentation (and code) and answer the questions:\n",
    "\n",
    "1. Call the function with `x_columns = 'cost_tm1'`, `y_column = 'illnesses_sum_tm1'` (leave `group_column` with its default value). \n",
    "\n",
    "Look at the plot and answer the questions (free text questions)\n",
    "\n",
    "2. What do the plotted X markers represent?\n",
    "3. Is the medical costs variable a good indicator of the illness conditions? For its whole range or only for part of it?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "cx5JDtmXUp7E"
   },
   "outputs": [],
   "source": [
    "help(plot_x_vs_y)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 394
    },
    "id": "jtNQGysGUr-E",
    "outputId": "2d36a1e6-43a9-4373-fc26-a8f322db6a3e"
   },
   "outputs": [],
   "source": [
    "# call the function (T5-Q1) here\n",
    "\n",
    "plot_x_vs_y(data,\n",
    "            x_column='cost_tm1',\n",
    "            y_column='illnesses_sum_tm1')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "BQp8JpjD31lg"
   },
   "outputs": [],
   "source": [
    "# Your code here..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "A3syIYMSUvge"
   },
   "source": [
    "### Task 6\n",
    "\n",
    "We define patients at risk at time t as patients being in the top decile in terms of medical costs at time tm1 (90th+ percentile of `cost_t`).\n",
    "\n",
    "1. What is the 90th percentile of `cost_tm1`?\n",
    "2. What is the 90th percentile of `cost_tm1` of black patients?\n",
    "3. What is the 90th percentile of `cost_tm1` of white patients?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "OC1k-Pme32n6"
   },
   "outputs": [],
   "source": [
    "# Your code here..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "nITerUlNwto5"
   },
   "source": [
    "## That's all!\n",
    "\n",
    "If you found a mistake / problem in this notebook, or something was unclear, please post at the `#ds` channel.\n",
    "\n",
    "**Prepare to explain to your team about this data and the model you've trained.**\n",
    "\n",
    "### Submission\n",
    "\n",
    "1. Save the notebook as a pdf file (In Colab: File > Print)\n",
    "2. Upload in Gradescope http://go.responsibly.ai/gradescope"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "provenance": []
  },
  "kernel_info": {
   "name": "python3"
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.0"
  },
  "nteract": {
   "version": "0.12.3"
  },
  "vscode": {
   "interpreter": {
    "hash": "55bbdba5d2159c30191d9b81156a2ec7ece345201aa1fcd9b85bbc484276dddb"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "7xPuF47f-ECM"
	},
	"source": [
	"![banner](https://learn.responsibly.ai/assets/banner.jpg)\n",
	"\n",
	"# Class 4 - Discrimination & Fairness: Pre-Class Task\n",
	"\n",
	"https://learn.responsibly.ai"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "Rdojsdf0fGkn"
	},
	"source": [
	"\n",
	"In the fourth class, we will continue our journy into the challenge of fairness in machine learning.\n",
	"\n",
	"The goal of this pre-class task is to get familiar with an healthcare dataset we will work on in the fourth class. It is an Electronic Health Record (EHR) dataset which contains an electronic version of patients' medical history. The dataset based on data from a US health insurance firm, so it conains also cost information. From privacy resons, the data went through an \"anonomization\" process, but we'll get into that in the next classes.\n",
	"\n",
	"The following six tasks will help you to get familiar with this dataset, understand its various columns and their relation to each other. \n",
	"\n",
	"Please go through the whole notebook before you start coding. You could plan your work better if you have first an overview of the task.\n",
	"\n",
	"If you have any questions, please post them in the `#ds` channel in Discord or join the office hours.\n",
	"\n",
	"Let's start!"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "FEahmIMGfk0h"
	},
	"source": [
	"## How to write your answers?"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "5kaCEPTV-mmy"
	},
	"source": [
	"\n",
	"For each task, you should write a code and `print()` the answers to the questions. You should print them in the following format (see Task 1 example): \n",
	"\n",
	"```python\n",
	"print(f'T{task_number}-Q{question_number}: {answer}')\n",
	"```\n",
	"\n",
	"In some questions, you will need to display a data frame. The format of answering these type of question is:\n",
	"\n",
	"```python\n",
	"print(f'T{task_number}-Q{question_number}:')\n",
	"display(answer_df)\n",
	"```\n",
	"\n",
	"We assume you are familiar with `pandas` package. If you are not, you can read this [toturial](https://realpython.com/pandas-python-explore-dataset/) (note that it contains much more than you need in this task, yet it can be handy as a reference when you answer the questions). "
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "GDLFwEBGr3qD"
	},
	"source": [
	"## Setup"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"id": "RnBxHABh_i3Q"
	},
	"outputs": [],
	"source": [
	"!wget -q https://stash.responsibly.ai/4-fairness/activity/patients_data.csv -O patients_data.csv\n",
	"\n",
	"%pip install -qqq git+https://github.com/ResponsiblyAI/railib.git"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"id": "MoDJSDVMgDRZ"
	},
	"outputs": [],
	"source": [
	"import pandas as pd\n",
	"import numpy as np\n",
	"import matplotlib.pyplot as plt\n",
	"import seaborn as sns\n",
	"from IPython.display import display\n",
	"\n",
	"import pandas as pd\n",
	"import matplotlib.pyplot as plt\n",
	"import seaborn as sns\n",
	"from sklearn.model_selection import train_test_split\n",
	"from railib.fairness.second import plot_x_vs_y\n",
	"\n",
	"sns.set()"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "vO1b2cRhr_jF"
	},
	"source": [
	"## Dataset\n",
	"\n",
	"### Meaning of column suffixes (endings)\n",
	"* _t: indicates this is a time dependent variable from year t (e.g. t = 2020)\n",
	"* _tm1: indicates this is a time dependent variable from year t minus 1 (t-1) (e.g. if t = 2021 then t - 1 = 2020).\n",
	"\n",
	"### Variables categories\n",
	"\n",
	"#### Can be used as target variables (because they are \"at time t\" ):\n",
	"\n",
	"* Outcomes at time t: \"outcomes\" for a given calendar year (t): cost, health and program enrollment.\n",
	"In particular, we have:\n",
	"\n",
	"\| Column Name \| Description \| Note \|\n",
	"\|--------------------\|--------------------------------------------------------------------------------------------------\|-------------------------------------------------------------------------------\|\n",
	"\| cost_t \| Total medical expenditures, rounded to the nearest 100 \| The actual target variable used to train the model and produce the risk score \|\n",
	"\| program_enrolled_t \| Indicator for whether patient-year was enrolled in program \| \|\n",
	"\| illnesses_sum_t \| Total number of active illnesses \| \|\n",
	"\n",
	"#### Can be used as \"predictors\" (features): \n",
	"\n",
	"* Demographic: e.g `gender`, `race`, `age`.\n",
	"* Comorbidity variables at time t-1: indicators for specific illnesses that were active in the previous year. <br> E.g `liver_elixhauser_tm1` which is an indicator for liver disease.\n",
	"* Cost variables at time t-1: Costs claimed from the patients' insurance payer over the previous year. <br> E.g `cost_laboratory_tm1` which is the total cost for laboratory tests.\n",
	"* Biomarker\\medication variables at time t-1: indicators capturing normal or abnormal values (or missingness) of biomarkers or relevant medications, over the previous year. <br> E.g `ghba1c_min-low_tm1` which is an indicator for low (< 4) minimum GHbA1c test result.\n",
	"\n",
	"An indicator is a binary variable: 1 stands for 'True' or 'Has the condition', 0 stands for 'False' or 'Doesn't Have'.\n",
	"\n",
	"For a detailed description of the dataset and the columns see [this document](https://docs.google.com/document/d/1OFGh7Hkqo8FjcPfBGql7mr8to9Il16XwvLdJkXF1IeU/edit?usp=sharing), but you don't need it.\n",
	"\n",
	"The EHR (Electronic Health Record) dataset contains 48,784 rows (patients) and 160 columns/variables. "
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "UIMjTzE2R-kk"
	},
	"source": [
	"### Loading the data"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"colab": {
	"base_uri": "https://localhost:8080/",
	"height": 317
	},
	"id": "57nnfiuzgEyu",
	"outputId": "abf3912b-a25d-4a07-d5d2-874e9c04572b"
	},
	"outputs": [],
	"source": [
	"data = pd.read_csv('patients_data.csv')\n",
	"data.head(5)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "fFdLHI5mSAmR"
	},
	"source": [
	"### Columns declaration"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "IPNKAtHijI6N"
	},
	"source": [
	"It is very difficult to work with all the 160 features all together. Therefore, we grouped them into high-level categoires."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"id": "JmnUViQNSCEb"
	},
	"outputs": [],
	"source": [
	"# Outcome variables at time T (see above)\n",
	"outcomes_t = ['illnesses_sum_t', 'cost_t', 'cost_avoidable_t', 'bps_mean_t', 'ghba1c_mean_t', 'hct_mean_t', 'cre_mean_t', 'ldl_mean_t']\n",
	"\n",
	"# Demographic variables at time T-1\n",
	"demographic_tm1 = ['gender', 'race', 'age']\n",
	"\n",
	"# illness condition variables at time T-1\n",
	"comorbidities_tm1 = ['illnesses_sum_tm1', 'alcohol_elixhauser_tm1', 'anemia_elixhauser_tm1', 'arrhythmia_elixhauser_tm1', 'arthritis_elixhauser_tm1', 'bloodlossanemia_elixhauser_tm1', 'coagulopathy_elixhauser_tm1', 'compdiabetes_elixhauser_tm1', 'depression_elixhauser_tm1', 'drugabuse_elixhauser_tm1', 'electrolytes_elixhauser_tm1', 'hypertension_elixhauser_tm1', 'hypothyroid_elixhauser_tm1', 'liver_elixhauser_tm1', 'neurodegen_elixhauser_tm1', 'obesity_elixhauser_tm1', 'paralysis_elixhauser_tm1', 'psychosis_elixhauser_tm1', 'pulmcirc_elixhauser_tm1', 'pvd_elixhauser_tm1', 'renal_elixhauser_tm1', 'uncompdiabetes_elixhauser_tm1', 'valvulardz_elixhauser_tm1', 'wtloss_elixhauser_tm1', 'cerebrovasculardz_romano_tm1', 'chf_romano_tm1', 'dementia_romano_tm1', 'hemiplegia_romano_tm1', 'hivaids_romano_tm1', 'metastatic_romano_tm1', 'myocardialinfarct_romano_tm1', 'pulmonarydz_romano_tm1', 'tumor_romano_tm1', 'ulcer_romano_tm1']\n",
	"\n",
	"# Cost variables at time T-1\n",
	"costs_tm1 = ['cost_dialysis_tm1', 'cost_emergency_tm1', 'cost_home_health_tm1', 'cost_ip_medical_tm1', 'cost_ip_surgical_tm1', 'cost_laboratory_tm1', 'cost_op_primary_care_tm1', 'cost_op_specialists_tm1', 'cost_op_surgery_tm1', 'cost_other_tm1', 'cost_pharmacy_tm1', 'cost_physical_therapy_tm1', 'cost_radiology_tm1']\n",
	"\n",
	"# Biomarkers (e.g., blood test result) varbles at time T-1\n",
	"biomarkers_tm1 = ['lasix_dose_count_tm1', 'lasix_min_daily_dose_tm1', 'lasix_mean_daily_dose_tm1', 'lasix_max_daily_dose_tm1', 'cre_tests_tm1', 'crp_tests_tm1', 'esr_tests_tm1', 'ghba1c_tests_tm1', 'hct_tests_tm1', 'ldl_tests_tm1', 'nt_bnp_tests_tm1', 'sodium_tests_tm1', 'trig_tests_tm1', 'cre_min-low_tm1', 'cre_min-high_tm1', 'cre_min-normal_tm1', 'cre_mean-low_tm1', 'cre_mean-high_tm1', 'cre_mean-normal_tm1', 'cre_max-low_tm1', 'cre_max-high_tm1', 'cre_max-normal_tm1', 'crp_min-low_tm1', 'crp_min-high_tm1', 'crp_min-normal_tm1', 'crp_mean-low_tm1', 'crp_mean-high_tm1', 'crp_mean-normal_tm1', 'crp_max-low_tm1', 'crp_max-high_tm1', 'crp_max-normal_tm1', 'esr_min-low_tm1', 'esr_min-high_tm1', 'esr_min-normal_tm1', 'esr_mean-low_tm1', 'esr_mean-high_tm1', 'esr_mean-normal_tm1', 'esr_max-low_tm1', 'esr_max-high_tm1', 'esr_max-normal_tm1', 'ghba1c_min-low_tm1', 'ghba1c_min-high_tm1', 'ghba1c_min-normal_tm1', 'ghba1c_mean-low_tm1', 'ghba1c_mean-high_tm1', 'ghba1c_mean-normal_tm1', 'ghba1c_max-low_tm1', 'ghba1c_max-high_tm1', 'ghba1c_max-normal_tm1', 'hct_min-low_tm1', 'hct_min-high_tm1', 'hct_min-normal_tm1', 'hct_mean-low_tm1', 'hct_mean-high_tm1', 'hct_mean-normal_tm1', 'hct_max-low_tm1', 'hct_max-high_tm1', 'hct_max-normal_tm1', 'ldl_min-low_tm1', 'ldl_min-high_tm1', 'ldl_min-normal_tm1', 'ldl-mean-low_tm1', 'ldl-mean-high_tm1', 'ldl-mean-normal_tm1', 'ldl_max-low_tm1', 'ldl_max-high_tm1', 'ldl_max-normal_tm1', 'nt_bnp_min-low_tm1', 'nt_bnp_min-high_tm1', 'nt_bnp_min-normal_tm1', 'nt_bnp_mean-low_tm1', 'nt_bnp_mean-high_tm1', 'nt_bnp_mean-normal_tm1', 'nt_bnp_max-low_tm1', 'nt_bnp_max-high_tm1', 'nt_bnp_max-normal_tm1', 'sodium_min-low_tm1', 'sodium_min-high_tm1', 'sodium_min-normal_tm1', 'sodium_mean-low_tm1', 'sodium_mean-high_tm1', 'sodium_mean-normal_tm1', 'sodium_max-low_tm1', 'sodium_max-high_tm1', 'sodium_max-normal_tm1', 'trig_min-low_tm1', 'trig_min-high_tm1', 'trig_min-normal_tm1', 'trig_mean-low_tm1', 'trig_mean-high_tm1', 'trig_mean-normal_tm1', 'trig_max-low_tm1', 'trig_max-high_tm1', 'trig_max-normal_tm1']"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "TrI8GCOxhsQL"
	},
	"source": [
	"## Tasks"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "6h6wogdvSCrM"
	},
	"source": [
	"### Task 1 (example)\n",
	"\n",
	"1. How many patients (each row is a patient) there are?\n",
	"2. How many males?\n",
	"3. How many females?\n",
	"4. Display a data frame: for each gender, show the average total number of active illnesses at time t (`illnesses_sum_t`) and the average of the total medical expenditures (`cost_t`)."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"colab": {
	"base_uri": "https://localhost:8080/",
	"height": 213
	},
	"id": "s6wWJuLcr-Qo",
	"outputId": "04c3676b-4119-42a3-fc79-096811781912"
	},
	"outputs": [],
	"source": [
	"answer1 = data.shape[0]\n",
	"answer2 = data[data['gender'] == 'Male'].shape[0]\n",
	"answer3 = data[data['gender'] == 'Female'].shape[0]\n",
	"answer4 = data.groupby('gender').agg({'illnesses_sum_t': 'mean', 'cost_t': 'mean'})\n",
	"\n",
	"print(f\"T1-Q1: {answer1}\")\n",
	"print(f\"T1-Q2: {answer2}\")\n",
	"print(f\"T1-Q3: {answer3}\")\n",
	"print(f\"T1-Q4:\")\n",
	"display(answer4)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "1uPt7Rv6sho4"
	},
	"source": [
	"### Task 2\n",
	"\n",
	"1. Display a data frame: for each gender, show the average total number of active illnesses at time tm1 (`illnesses_sum_tm1`) and the average of the total medical expenditures at tm1 (`cost_tm1`).\n",
	"2. How many patients have higher medical expenditures at time t than tm1?"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"id": "u0c9srgx3mmi"
	},
	"outputs": [],
	"source": [
	"# Your code here..."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "vlBUnwJcsvI4"
	},
	"source": [
	"### Task 3\n",
	"\n",
	"1. Print all the age groups (unique values of `age` column)\n",
	"2. Display a dataframe: for each age group show the average value of the following variables at time t (`vars_at_t` defined below): `['bps_mean_t', 'ghba1c_mean_t', 'hct_mean_t', 'cre_mean_t', 'ldl_mean_t']`.\n",
	"Look at the Q2 data frame, and answer the following questions:\n",
	"3. Which age group has the lowest systolic blood pressure (`bps_mean_t`)?\n",
	"4. Which age group has the highest HbA1C (glycated hemoglobin, `ghba1c_mean_t`)?\n",
	"5. Which age group has the highest hematocrit (`hct_mean_t`)?"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"id": "6Mt5MbScUf-l"
	},
	"outputs": [],
	"source": [
	"vars_at_t = ['bps_mean_t', 'ghba1c_mean_t', 'hct_mean_t', 'cre_mean_t', 'ldl_mean_t']"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"id": "gcauikUm3p46"
	},
	"outputs": [],
	"source": [
	"# Your code here..."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "w4pVCAm0so-h"
	},
	"source": [
	"### Task 4\n",
	"\n",
	"1. Create a race indicator column named `'black'`. If the patient is black, the value of the indicator should be 1, otherwise 0. Don't print anything.\n",
	"2. How many black patients are there?\n",
	"3. How many white patients are there?\n",
	"4. Calculate the correlation between the race indicator and `cost_t`. \n",
	"5. Calculate the correlation between the race indicator and HbA1C (glycated hemoglobin, `ghba1c_mean_t`)."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"id": "YmBzaM923vnu"
	},
	"outputs": [],
	"source": [
	"# Your code here..."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "rwAD1w0dtg7o"
	},
	"source": [
	"### Task 5\n",
	"\n",
	"In the cell below, we define a visualization function that we will use in class. The following questions will help you understand the visualization better. First, read the function documentation (and code) and answer the questions:\n",
	"\n",
	"1. Call the function with `x_columns = 'cost_tm1'`, `y_column = 'illnesses_sum_tm1'` (leave `group_column` with its default value). \n",
	"\n",
	"Look at the plot and answer the questions (free text questions)\n",
	"\n",
	"2. What do the plotted X markers represent?\n",
	"3. Is the medical costs variable a good indicator of the illness conditions? For its whole range or only for part of it?"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"id": "cx5JDtmXUp7E"
	},
	"outputs": [],
	"source": [
	"help(plot_x_vs_y)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"colab": {
	"base_uri": "https://localhost:8080/",
	"height": 394
	},
	"id": "jtNQGysGUr-E",
	"outputId": "2d36a1e6-43a9-4373-fc26-a8f322db6a3e"
	},
	"outputs": [],
	"source": [
	"# call the function (T5-Q1) here\n",
	"\n",
	"plot_x_vs_y(data,\n",
	" x_column='cost_tm1',\n",
	" y_column='illnesses_sum_tm1')"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"id": "BQp8JpjD31lg"
	},
	"outputs": [],
	"source": [
	"# Your code here..."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "A3syIYMSUvge"
	},
	"source": [
	"### Task 6\n",
	"\n",
	"We define patients at risk at time t as patients being in the top decile in terms of medical costs at time tm1 (90th+ percentile of `cost_t`).\n",
	"\n",
	"1. What is the 90th percentile of `cost_tm1`?\n",
	"2. What is the 90th percentile of `cost_tm1` of black patients?\n",
	"3. What is the 90th percentile of `cost_tm1` of white patients?"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"id": "OC1k-Pme32n6"
	},
	"outputs": [],
	"source": [
	"# Your code here..."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "nITerUlNwto5"
	},
	"source": [
	"## That's all!\n",
	"\n",
	"If you found a mistake / problem in this notebook, or something was unclear, please post at the `#ds` channel.\n",
	"\n",
	"Prepare to explain to your team about this data and the model you've trained.\n",
	"\n",
	"### Submission\n",
	"\n",
	"1. Save the notebook as a pdf file (In Colab: File > Print)\n",
	"2. Upload in Gradescope http://go.responsibly.ai/gradescope"
	]
	}
	],
	"metadata": {
	"colab": {
	"provenance": []
	},
	"kernel_info": {
	"name": "python3"
	},
	"kernelspec": {
	"display_name": "Python 3 (ipykernel)",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.11.0"
	},
	"nteract": {
	"version": "0.12.3"
	},
	"vscode": {
	"interpreter": {
	"hash": "55bbdba5d2159c30191d9b81156a2ec7ece345201aa1fcd9b85bbc484276dddb"
	}
	}
	},
	"nbformat": 4,
	"nbformat_minor": 1
	}