Skip to content

Instantly share code, notes, and snippets.

@shlomihod
Last active February 26, 2024 19:26
Show Gist options
  • Save shlomihod/2b9f3a582dc41ad577b976abc522787c to your computer and use it in GitHub Desktop.
Save shlomihod/2b9f3a582dc41ad577b976abc522787c to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "Fyjq_FPc56mU"
},
"source": [
"![banner](https://learn.responsibly.ai/assets/banner.jpg)\n",
"\n",
"# Class 4 - Discrimination & Fairness: Technical Report Playground\n",
"\n",
"https://learn.responsibly.ai"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "k8TteJfD56mY"
},
"source": [
"![logo]()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "OBu6_muW56mc"
},
"source": [
"## 1. Setup (NOT IMPORTANT)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "zFWBky1K56md",
"inputHidden": false,
"outputHidden": false,
"outputId": "6a497e1f-f9ea-4175-d2e9-71db96fe5723"
},
"outputs": [],
"source": [
"!wget -q https://stash.responsibly.ai/4-fairness/activity/patients_data.csv -O patients_data.csv\n",
"\n",
"%pip install -qqq git+https://github.com/ResponsiblyAI/railib.git"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "Mjl_wbZN56me",
"inputHidden": false,
"outputHidden": false
},
"outputs": [],
"source": [
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"from sklearn.model_selection import train_test_split\n",
"from railib.fairness.second import (data_statistics, column_distribution,\n",
" preprocessing, predict_risk_score,\n",
" plot_x_vs_y)\n",
"\n",
"sns.set()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Eu6kOmc556mf"
},
"source": [
"## Dataset\n",
"\n",
"The information here is just for reference and it is copied from the pre-calss task. No need to dive into that as a team, because the Data Science students worked with this dataset at home.\n",
"\n",
"### Meaning of column suffixes (endings)\n",
"* _t: indicates this is a time dependent variable from year t (e.g. t = 2020)\n",
"* _tm1: indicates this is a time dependent variable from year t minus 1 (t-1) (e.g. if t = 2021 then t - 1 = 2020).\n",
"\n",
"### Variables categories\n",
"\n",
"#### Can be used as target variables (because they are \"at time t\" ):\n",
"\n",
"* Outcomes at time t: \"outcomes\" for a given calendar year (t): cost, health, program enrollment, and the commercial risk score. \n",
"\n",
"In particular, we have:\n",
"\n",
"| Column Name | Description | Note |\n",
"|--------------------|--------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------|\n",
"| cost_t | Total medical expenditures, rounded to the nearest 100 | The actual target variable used to train the model and produce the risk score |\n",
"| risk_score_t | Commercial algorithmic risk score prediction for cost in year t, formed using data from year t-1 | The actual risk score used by the healthcare insurance company |\n",
"| program_enrolled_t | Indicator for whether patient-year was enrolled in program | |\n",
"| illnesses_sum_t | Total number of active illnesses | |\n",
"\n",
"#### Can be used as \"predictors\" (features): \n",
"\n",
"* **Demographic**: e.g `gender`, `race`, `age`.\n",
"* **Comorbidity variables** at time t-1: indicators for specific illnesses that were active in the previous year. <br> E.g `liver_elixhauser_tm1` which is an indicator for liver disease.\n",
"* **Cost variables** at time t-1: Costs claimed from the patients' insurance payer over the previous year. <br> E.g `cost_laboratory_tm1` which is the total cost for laboratory tests.\n",
"* **Biomarker\\medication** variables at time t-1: indicators capturing normal or abnormal values (or missingness) of biomarkers or relevant medications, over the previous year. <br> E.g `ghba1c_min-low_tm1` which is an indicator for low (< 4) minimum GHbA1c test result.\n",
"\n",
"An indicator is a binary variable: 1 stands for 'True' or 'Has the condition', 0 stands for 'False' or 'Doesn't Have'.\n",
"\n",
"For a detailed description of the dataset and the columns see [this document](https://docs.google.com/document/d/1OFGh7Hkqo8FjcPfBGql7mr8to9Il16XwvLdJkXF1IeU/edit?usp=sharing), but you don't need it.\n",
"\n",
"The EHR (Electronic Health Record) dataset contains 48,784 rows (patients) and 160 columns/variables. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 317
},
"id": "WGuKJRdy56mg",
"outputId": "4b7df329-2223-43be-b429-8387830f14a8"
},
"outputs": [],
"source": [
"data = pd.read_csv('patients_data.csv')\n",
"data.head(5)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "QrQSZSpP56mg"
},
"outputs": [],
"source": [
"# Outcome variables at time T (see above)\n",
"outcomes_t = ['illnesses_sum_t', 'cost_t', 'cost_avoidable_t', 'bps_mean_t', 'ghba1c_mean_t', 'hct_mean_t', 'cre_mean_t', 'ldl_mean_t']\n",
"\n",
"# Demographic variables at time T-1\n",
"demographic_tm1 = ['gender', 'race', 'age']\n",
"\n",
"# illness condition variables at time T-1\n",
"comorbidities_tm1 = ['illnesses_sum_tm1', 'alcohol_elixhauser_tm1', 'anemia_elixhauser_tm1', 'arrhythmia_elixhauser_tm1', 'arthritis_elixhauser_tm1', 'bloodlossanemia_elixhauser_tm1', 'coagulopathy_elixhauser_tm1', 'compdiabetes_elixhauser_tm1', 'depression_elixhauser_tm1', 'drugabuse_elixhauser_tm1', 'electrolytes_elixhauser_tm1', 'hypertension_elixhauser_tm1', 'hypothyroid_elixhauser_tm1', 'liver_elixhauser_tm1', 'neurodegen_elixhauser_tm1', 'obesity_elixhauser_tm1', 'paralysis_elixhauser_tm1', 'psychosis_elixhauser_tm1', 'pulmcirc_elixhauser_tm1', 'pvd_elixhauser_tm1', 'renal_elixhauser_tm1', 'uncompdiabetes_elixhauser_tm1', 'valvulardz_elixhauser_tm1', 'wtloss_elixhauser_tm1', 'cerebrovasculardz_romano_tm1', 'chf_romano_tm1', 'dementia_romano_tm1', 'hemiplegia_romano_tm1', 'hivaids_romano_tm1', 'metastatic_romano_tm1', 'myocardialinfarct_romano_tm1', 'pulmonarydz_romano_tm1', 'tumor_romano_tm1', 'ulcer_romano_tm1']\n",
"\n",
"# Cost variables at time T-1\n",
"costs_tm1 = ['cost_dialysis_tm1', 'cost_emergency_tm1', 'cost_home_health_tm1', 'cost_ip_medical_tm1', 'cost_ip_surgical_tm1', 'cost_laboratory_tm1', 'cost_op_primary_care_tm1', 'cost_op_specialists_tm1', 'cost_op_surgery_tm1', 'cost_other_tm1', 'cost_pharmacy_tm1', 'cost_physical_therapy_tm1', 'cost_radiology_tm1']\n",
"\n",
"# Biomarkers (e.g., blood test result) varbles at time T-1\n",
"biomarkers_tm1 = ['lasix_dose_count_tm1', 'lasix_min_daily_dose_tm1', 'lasix_mean_daily_dose_tm1', 'lasix_max_daily_dose_tm1', 'cre_tests_tm1', 'crp_tests_tm1', 'esr_tests_tm1', 'ghba1c_tests_tm1', 'hct_tests_tm1', 'ldl_tests_tm1', 'nt_bnp_tests_tm1', 'sodium_tests_tm1', 'trig_tests_tm1', 'cre_min-low_tm1', 'cre_min-high_tm1', 'cre_min-normal_tm1', 'cre_mean-low_tm1', 'cre_mean-high_tm1', 'cre_mean-normal_tm1', 'cre_max-low_tm1', 'cre_max-high_tm1', 'cre_max-normal_tm1', 'crp_min-low_tm1', 'crp_min-high_tm1', 'crp_min-normal_tm1', 'crp_mean-low_tm1', 'crp_mean-high_tm1', 'crp_mean-normal_tm1', 'crp_max-low_tm1', 'crp_max-high_tm1', 'crp_max-normal_tm1', 'esr_min-low_tm1', 'esr_min-high_tm1', 'esr_min-normal_tm1', 'esr_mean-low_tm1', 'esr_mean-high_tm1', 'esr_mean-normal_tm1', 'esr_max-low_tm1', 'esr_max-high_tm1', 'esr_max-normal_tm1', 'ghba1c_min-low_tm1', 'ghba1c_min-high_tm1', 'ghba1c_min-normal_tm1', 'ghba1c_mean-low_tm1', 'ghba1c_mean-high_tm1', 'ghba1c_mean-normal_tm1', 'ghba1c_max-low_tm1', 'ghba1c_max-high_tm1', 'ghba1c_max-normal_tm1', 'hct_min-low_tm1', 'hct_min-high_tm1', 'hct_min-normal_tm1', 'hct_mean-low_tm1', 'hct_mean-high_tm1', 'hct_mean-normal_tm1', 'hct_max-low_tm1', 'hct_max-high_tm1', 'hct_max-normal_tm1', 'ldl_min-low_tm1', 'ldl_min-high_tm1', 'ldl_min-normal_tm1', 'ldl-mean-low_tm1', 'ldl-mean-high_tm1', 'ldl-mean-normal_tm1', 'ldl_max-low_tm1', 'ldl_max-high_tm1', 'ldl_max-normal_tm1', 'nt_bnp_min-low_tm1', 'nt_bnp_min-high_tm1', 'nt_bnp_min-normal_tm1', 'nt_bnp_mean-low_tm1', 'nt_bnp_mean-high_tm1', 'nt_bnp_mean-normal_tm1', 'nt_bnp_max-low_tm1', 'nt_bnp_max-high_tm1', 'nt_bnp_max-normal_tm1', 'sodium_min-low_tm1', 'sodium_min-high_tm1', 'sodium_min-normal_tm1', 'sodium_mean-low_tm1', 'sodium_mean-high_tm1', 'sodium_mean-normal_tm1', 'sodium_max-low_tm1', 'sodium_max-high_tm1', 'sodium_max-normal_tm1', 'trig_min-low_tm1', 'trig_min-high_tm1', 'trig_min-normal_tm1', 'trig_mean-low_tm1', 'trig_mean-high_tm1', 'trig_mean-normal_tm1', 'trig_max-low_tm1', 'trig_max-high_tm1', 'trig_max-normal_tm1']"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "OIP_uxla56mh"
},
"source": [
"## 3. Data Analysis (USEFUL FUNCTIONS, WE'LL USE THEM IN SECTION 5)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "I8opBRYI56mh",
"outputId": "d9a6b5f3-7dc8-4fca-8307-9d7d9bf9b5c4"
},
"outputs": [],
"source": [
"help(data_statistics)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "FLhxuYAB56mh",
"outputId": "5247f18d-7589-43dc-cd40-203665f01428"
},
"outputs": [],
"source": [
"help(column_distribution)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 300
},
"id": "lYZcJ46x56mh",
"outputId": "8913d921-7744-49c5-b61e-aaec2f2dfabd"
},
"outputs": [],
"source": [
"# describing the outcomes data\n",
"\n",
"data_statistics(data[outcomes_t])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 285
},
"id": "8Ycv7nHg56mi",
"inputHidden": false,
"outputHidden": false,
"outputId": "a41f6afb-da28-4ef8-f20a-00df02fe8ba8"
},
"outputs": [],
"source": [
"# distribution of the costs\n",
"\n",
"column_distribution(data, 'cost_t', group_column=None, is_count=False)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 285
},
"id": "CD674vZ256mi",
"outputId": "307344ba-172e-41fa-f1be-be9b16beeafc"
},
"outputs": [],
"source": [
"# distribution of the costs for blacks and whites\n",
"\n",
"column_distribution(data, 'cost_t', group_column='race', is_count=False)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "csBFZSEo56mi"
},
"source": [
"## 4. Model Training (THE TEXT IS IMPORTANT, BUT NOT THE CODE)\n",
"\n",
"The model is **Linear Regression**.\n",
"\n",
"### Features (inputs)\n",
"1. Demographics\n",
"2. illnesses (Comorbidity variables at time t-1)\n",
"3. Detailed healthcare costs (Cost variables at time t-1)\n",
"4. Biomarkers (Biomarker/medication variables at time t-1) \n",
"\n",
"Note: The **race** variable is not included as a feature.\n",
"\n",
"### Objective (output)\n",
"To allocate resources in a cost-efficient manner, the model predicts future expenditures costs (costs at year t: `cost_t`).\n",
"\n",
"### The Risk Score\n",
"The risk score (`risk_score`) of each patient is based on the percentile he belongs to according to the model predictions. Percentiles are important for the company because the assingment to the “high-risk care management” program is based on them.\n",
"\n",
"Through cost-benefit calculations, it was decided that patients above the 97th percentile are automatically identified for enrollment in the program. Those above the 55th percentile are referred to a physician."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "FqzUl1xg56mj",
"outputId": "c0d3361c-3372-4604-c781-1acce49bdaa4"
},
"outputs": [],
"source": [
"help(preprocessing)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "BCEcAvMX56mj",
"outputId": "0ef8996e-334c-446b-8c31-d75649eab819"
},
"outputs": [],
"source": [
"help(predict_risk_score)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "29dzQIMB56mj",
"outputId": "6f942835-fe72-4393-ba47-9403e2f08a79"
},
"outputs": [],
"source": [
"# splitting the data into train and test\n",
"\n",
"train, test = train_test_split(data, test_size=0.4, random_state=42)\n",
"train, test = train.copy(), test.copy()\n",
"\n",
"# defining the features (X_columns) and the target variable (y_column)\n",
"# IMPORTANT: we don't include the race variable to avoid discrimination.\n",
"\n",
"X_columns = demographic_tm1 + comorbidities_tm1 + costs_tm1 + biomarkers_tm1\n",
"X_columns.remove('race')\n",
"\n",
"y_column = 'cost_t'\n",
"\n",
"# fit ,predict and evaluate\n",
"\n",
"train_risk_scores, test_risk_scores = predict_risk_score(train, test, X_columns, y_column)\n",
"train['risk_score'] = train_risk_scores\n",
"test['risk_score'] = test_risk_scores"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "TkNxDaDc56mk"
},
"source": [
"## 5. Risk Score Analysis (THE REALLY IMPORTANT PART)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "kXvMY74U56mk",
"outputId": "2cc5bae6-d325-423f-b7eb-12433680b70c"
},
"outputs": [],
"source": [
"help(plot_x_vs_y)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 394
},
"id": "I4m24WaT56ml",
"outputId": "52db8fc2-a808-454a-fe29-5688116103eb"
},
"outputs": [],
"source": [
"plot_x_vs_y(test, 'risk_score', 'cost_t', group_column=None)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 394
},
"id": "QBD60lyZ56ml",
"inputHidden": false,
"outputHidden": false,
"outputId": "5ac862f2-5dc9-4c58-8d71-ce677b775d46"
},
"outputs": [],
"source": [
"plot_x_vs_y(test,'risk_score', 'cost_t', group_column='race')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 6. Appendix\n",
"\n",
"In case you would like to look on the histogram of various columns."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"column_distribution(test, 'risk_score', group_column=None, cumulative=False,\n",
" is_count=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"column_distribution(test, 'risk_score', group_column='race', cumulative=False)"
]
}
],
"metadata": {
"colab": {
"provenance": []
},
"kernel_info": {
"name": "python3"
},
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.0"
},
"nteract": {
"version": "0.12.3"
}
},
"nbformat": 4,
"nbformat_minor": 1
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment