shlomihod/pre-class.ipynb

## pre-class.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "7xPuF47f-ECM"
   },
   "source": [
    "![banner](https://learn.responsibly.ai/assets/banner.jpg)\n",
    "\n",
    "# Class 3 - Discrimination & Fairness: Pre-Class Task\n",
    "\n",
    "https://learn.responsibly.ai"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "5kaCEPTV-mmy"
   },
   "source": [
    "In the third class, we will dive into the challenge of fairness of machine learning models.\n",
    "\n",
    "In this pre-class task, you will develop a simple classification model that takes a short textual biography of a person and returns its occupation. The model doesn't need to be fancy, you can aim for a \"baseline\" model with basic preprocessing that achieves reasonable accuracy of at least 80% on the test dataset. For example, Linear Regression on Bag of Words should be sufficient, but feel free to explore more powerful model families. We recommend keeping it simple and use the `sklearn` package. Finally, you will store your model so that you will be able to load it into the notebook in class.\n",
    "\n",
    "Please go through the whole notebook before you start coding. You could plan your work better if you have first an overview of the task.\n",
    "\n",
    "If you have any questions, please post them in the `#ds` channel in Discord or join the office hours.\n",
    "\n",
    "Let's start!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "GDLFwEBGr3qD"
   },
   "source": [
    "## Setup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "\n",
    "wget --quiet https://stash.responsibly.ai/3-fairness/activity/data.zip"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "RnBxHABh_i3Q"
   },
   "outputs": [],
   "source": [
    "!unzip -oq data.zip"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "MoDJSDVMgDRZ"
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "from IPython.display import display, Markdown"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "vO1b2cRhr_jF"
   },
   "source": [
    "## Dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "57nnfiuzgEyu"
   },
   "outputs": [],
   "source": [
    "train_df = pd.read_csv('./data/train.csv')\n",
    "test_df = pd.read_csv('./data/test.csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "nr0tl8qwse6a"
   },
   "source": [
    "The training dataset and the test dataset consists of multiple rows, one for each person, and two columns:\n",
    "1. `bio` - The biographies as text (i.e., `string`). This is the input to the model.\n",
    "1. `occupation` - The occupations of each person as text (i.e., `string`). This is the model's output."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "s6wWJuLcr-Qo"
   },
   "outputs": [],
   "source": [
    "train_df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "OAVzyhBC9plG"
   },
   "outputs": [],
   "source": [
    "test_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "1uPt7Rv6sho4"
   },
   "source": [
    "We used 75%-25% split between the training and the test dataset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "HaJpWBD_jQpE"
   },
   "outputs": [],
   "source": [
    "print(f'# train: {len(train_df)}')\n",
    "print(f'# test: {len(test_df)}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "vlBUnwJcsvI4"
   },
   "source": [
    "There are 28 occupations:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "Vh6O8fYksvZc"
   },
   "outputs": [],
   "source": [
    "sorted(train_df['occupation'].unique())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "w4pVCAm0so-h"
   },
   "source": [
    "Each running of the next cell will sample 10 random rows and show their occupations and biographies:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "Kysov0YvgfyP"
   },
   "outputs": [],
   "source": [
    "for _, row in train_df.sample(10).iterrows():\n",
    "    display(Markdown('### Ground-Truth Occupation: ' + row['occupation']))\n",
    "    display(Markdown(row['bio']))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "rwAD1w0dtg7o"
   },
   "source": [
    "## Your turn!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "EQ4cIjub61Hm"
   },
   "source": [
    "### Training"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "i_nkiZFO61TL"
   },
   "source": [
    "Train a model on the training dataset and ensure that your model achieves at least 80% accuracy on the test dataset. Please use the variable name `model` to hold your model object (e.g., sklearn's LogisticRegression, PyTorch model, ...)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "tchtyBFBtPjY"
   },
   "outputs": [],
   "source": [
    "# traning code"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "56DN8mXy65Ri"
   },
   "source": [
    "### Evaluation I"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "MWfga12HuDNW"
   },
   "source": [
    "Show the accuracy of the model on the test dataset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "rPN0Q7ApuKNX"
   },
   "outputs": [],
   "source": [
    "# evaluation code 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "04RWxwgb665v"
   },
   "source": [
    "### `predict` function"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "9WiQVDBCuRuZ"
   },
   "source": [
    "Implement the function `predict`, that returns predictions to a sequence of bios given as list/pandas series of strings. You will need to use this function in class."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "Ow1Px6gtuh14"
   },
   "outputs": [],
   "source": [
    "def predict(bios):\n",
    "    pass"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "G5Kd4XUavGvu"
   },
   "source": [
    "Demonstrate that your `predict` function works, **show the inputs and the predicted output** of a few examples:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "meXGMwtGvJqm"
   },
   "outputs": [],
   "source": [
    "# using predict function"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Evaluation II"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Show the accuracy-per-class on the test dataset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# evaluation code 2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Evaluation III"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, show the **Acceptance Rate**, **False Positive Rate** and **False Negative Rate** of each occupation.\n",
    "\n",
    "Bonus: use the Seaborn's `PairPlot` + dot plot to plot it [demo](https://seaborn.pydata.org/examples/pairgrid_dotplot.html).\n",
    "\n",
    "\n",
    "#### Confusion Matrix\n",
    "\n",
    "Actual class/Predicted class | P | N\n",
    "-----------------------------|---|--------------\n",
    "P       | **TP** | FN\n",
    "N     | FP | **TN** \n",
    "\n",
    "\n",
    "#### Metric Definitions\n",
    "\n",
    "\n",
    "**Acceptance Rate**\n",
    "\n",
    "${\\displaystyle \\mathrm {AR} \n",
    "= {\\frac{\\mathrm {TP + FP}}{\\mathrm {TP+FN+FP+TN}}}}$\n",
    "\n",
    "\n",
    "**False negative rate (FNR)**\n",
    "\n",
    "${\\displaystyle \\mathrm {FNR} = {\\frac {\\mathrm {FN} }{\\mathrm {FN} +\\mathrm {TP} }}}$\n",
    "\n",
    "**False Positive Rate (FPR)**\n",
    "\n",
    "${\\displaystyle \\mathrm {FPR} = {\\frac {\\mathrm {FP} }{\\mathrm {FP} +\\mathrm {TN} }}}$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# evaluation code 3"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "nITerUlNwto5"
   },
   "source": [
    "## That's all!\n",
    "\n",
    "If you found a mistake / problem in this notebook, or something was unclear, please post at the `#ds` channel.\n",
    "\n",
    "**Prepare to explain to your team about this data and the model you've trained.**\n",
    "\n",
    "### Submission\n",
    "\n",
    "1. Save the notebook as a pdf file (In Colab: File > Print)\n",
    "2. Upload in Gradescope http://go.responsibly.ai/gradescope"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "collapsed_sections": [],
   "name": "pre-class.ipynb",
   "provenance": [],
   "toc_visible": true
  },
  "kernel_info": {
   "name": "python3"
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.0"
  },
  "nteract": {
   "version": "0.12.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "7xPuF47f-ECM"
	},
	"source": [
	"![banner](https://learn.responsibly.ai/assets/banner.jpg)\n",
	"\n",
	"# Class 3 - Discrimination & Fairness: Pre-Class Task\n",
	"\n",
	"https://learn.responsibly.ai"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "5kaCEPTV-mmy"
	},
	"source": [
	"In the third class, we will dive into the challenge of fairness of machine learning models.\n",
	"\n",
	"In this pre-class task, you will develop a simple classification model that takes a short textual biography of a person and returns its occupation. The model doesn't need to be fancy, you can aim for a \"baseline\" model with basic preprocessing that achieves reasonable accuracy of at least 80% on the test dataset. For example, Linear Regression on Bag of Words should be sufficient, but feel free to explore more powerful model families. We recommend keeping it simple and use the `sklearn` package. Finally, you will store your model so that you will be able to load it into the notebook in class.\n",
	"\n",
	"Please go through the whole notebook before you start coding. You could plan your work better if you have first an overview of the task.\n",
	"\n",
	"If you have any questions, please post them in the `#ds` channel in Discord or join the office hours.\n",
	"\n",
	"Let's start!"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "GDLFwEBGr3qD"
	},
	"source": [
	"## Setup"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"%%bash\n",
	"\n",
	"wget --quiet https://stash.responsibly.ai/3-fairness/activity/data.zip"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"id": "RnBxHABh_i3Q"
	},
	"outputs": [],
	"source": [
	"!unzip -oq data.zip"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"id": "MoDJSDVMgDRZ"
	},
	"outputs": [],
	"source": [
	"import pandas as pd\n",
	"from IPython.display import display, Markdown"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "vO1b2cRhr_jF"
	},
	"source": [
	"## Dataset"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"id": "57nnfiuzgEyu"
	},
	"outputs": [],
	"source": [
	"train_df = pd.read_csv('./data/train.csv')\n",
	"test_df = pd.read_csv('./data/test.csv')"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "nr0tl8qwse6a"
	},
	"source": [
	"The training dataset and the test dataset consists of multiple rows, one for each person, and two columns:\n",
	"1. `bio` - The biographies as text (i.e., `string`). This is the input to the model.\n",
	"1. `occupation` - The occupations of each person as text (i.e., `string`). This is the model's output."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"id": "s6wWJuLcr-Qo"
	},
	"outputs": [],
	"source": [
	"train_df.head()"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"id": "OAVzyhBC9plG"
	},
	"outputs": [],
	"source": [
	"test_df.head()"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "1uPt7Rv6sho4"
	},
	"source": [
	"We used 75%-25% split between the training and the test dataset:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"id": "HaJpWBD_jQpE"
	},
	"outputs": [],
	"source": [
	"print(f'# train: {len(train_df)}')\n",
	"print(f'# test: {len(test_df)}')"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "vlBUnwJcsvI4"
	},
	"source": [
	"There are 28 occupations:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"id": "Vh6O8fYksvZc"
	},
	"outputs": [],
	"source": [
	"sorted(train_df['occupation'].unique())"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "w4pVCAm0so-h"
	},
	"source": [
	"Each running of the next cell will sample 10 random rows and show their occupations and biographies:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"id": "Kysov0YvgfyP"
	},
	"outputs": [],
	"source": [
	"for _, row in train_df.sample(10).iterrows():\n",
	" display(Markdown('### Ground-Truth Occupation: ' + row['occupation']))\n",
	" display(Markdown(row['bio']))"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "rwAD1w0dtg7o"
	},
	"source": [
	"## Your turn!"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "EQ4cIjub61Hm"
	},
	"source": [
	"### Training"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "i_nkiZFO61TL"
	},
	"source": [
	"Train a model on the training dataset and ensure that your model achieves at least 80% accuracy on the test dataset. Please use the variable name `model` to hold your model object (e.g., sklearn's LogisticRegression, PyTorch model, ...)."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"id": "tchtyBFBtPjY"
	},
	"outputs": [],
	"source": [
	"# traning code"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "56DN8mXy65Ri"
	},
	"source": [
	"### Evaluation I"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "MWfga12HuDNW"
	},
	"source": [
	"Show the accuracy of the model on the test dataset:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"id": "rPN0Q7ApuKNX"
	},
	"outputs": [],
	"source": [
	"# evaluation code 1"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "04RWxwgb665v"
	},
	"source": [
	"### `predict` function"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "9WiQVDBCuRuZ"
	},
	"source": [
	"Implement the function `predict`, that returns predictions to a sequence of bios given as list/pandas series of strings. You will need to use this function in class."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"id": "Ow1Px6gtuh14"
	},
	"outputs": [],
	"source": [
	"def predict(bios):\n",
	" pass"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "G5Kd4XUavGvu"
	},
	"source": [
	"Demonstrate that your `predict` function works, show the inputs and the predicted output of a few examples:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"id": "meXGMwtGvJqm"
	},
	"outputs": [],
	"source": [
	"# using predict function"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Evaluation II"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Show the accuracy-per-class on the test dataset:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"# evaluation code 2"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Evaluation III"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Now, show the Acceptance Rate, False Positive Rate and False Negative Rate of each occupation.\n",
	"\n",
	"Bonus: use the Seaborn's `PairPlot` + dot plot to plot it [demo](https://seaborn.pydata.org/examples/pairgrid_dotplot.html).\n",
	"\n",
	"\n",
	"#### Confusion Matrix\n",
	"\n",
	"Actual class/Predicted class \| P \| N\n",
	"-----------------------------\|---\|--------------\n",
	"P \| TP \| FN\n",
	"N \| FP \| TN \n",
	"\n",
	"\n",
	"#### Metric Definitions\n",
	"\n",
	"\n",
	"Acceptance Rate\n",
	"\n",
	"${\\displaystyle \\mathrm {AR} \n",
	"= {\\frac{\\mathrm {TP + FP}}{\\mathrm {TP+FN+FP+TN}}}}$\n",
	"\n",
	"\n",
	"False negative rate (FNR)\n",
	"\n",
	"${\\displaystyle \\mathrm {FNR} = {\\frac {\\mathrm {FN} }{\\mathrm {FN} +\\mathrm {TP} }}}$\n",
	"\n",
	"False Positive Rate (FPR)\n",
	"\n",
	"${\\displaystyle \\mathrm {FPR} = {\\frac {\\mathrm {FP} }{\\mathrm {FP} +\\mathrm {TN} }}}$"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"# evaluation code 3"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "nITerUlNwto5"
	},
	"source": [
	"## That's all!\n",
	"\n",
	"If you found a mistake / problem in this notebook, or something was unclear, please post at the `#ds` channel.\n",
	"\n",
	"Prepare to explain to your team about this data and the model you've trained.\n",
	"\n",
	"### Submission\n",
	"\n",
	"1. Save the notebook as a pdf file (In Colab: File > Print)\n",
	"2. Upload in Gradescope http://go.responsibly.ai/gradescope"
	]
	}
	],
	"metadata": {
	"colab": {
	"collapsed_sections": [],
	"name": "pre-class.ipynb",
	"provenance": [],
	"toc_visible": true
	},
	"kernel_info": {
	"name": "python3"
	},
	"kernelspec": {
	"display_name": "Python 3 (ipykernel)",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.11.0"
	},
	"nteract": {
	"version": "0.12.3"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 4
	}