Skip to content

Instantly share code, notes, and snippets.

@qbeer
Created October 4, 2021 10:23
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save qbeer/c7630c11339b659843e32e39eb732e42 to your computer and use it in GitHub Desktop.
Save qbeer/c7630c11339b659843e32e39eb732e42 to your computer and use it in GitHub Desktop.
HW5_raw.ipynb
Display the source blob
Display the rendered blob
Raw
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"name": "HW5_raw.ipynb",
"provenance": [],
"collapsed_sections": [],
"authorship_tag": "ABX9TyNDWdNEa3X4meGwV4cb9zN1",
"include_colab_link": true
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
},
"language_info": {
"name": "python"
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
"<a href=\"https://colab.research.google.com/gist/qbeer/c7630c11339b659843e32e39eb732e42/hw5_raw.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "b17I_WoP0eSE"
},
"source": [
"## Logistic regression\n",
"\n",
"### 1. Download data from https://science.sciencemag.org/content/359/6378/926 (supplementary materials).\n",
"\n",
"* read the abstract of the article to get familiar with data origin\n",
"* open the data in excel and get familiar with its content\n",
"* load the protein level data (you need to figure out which one is that) as a pandas dataframe\n",
"* handle missing values and convert features to numeric values when it is needed\n",
"* get rid of the unnecessary (which does not encode protein levels or the tumor type) columns and the CancerSEEK results\n",
"\n",
"### 2. Predict if a sample is cancerous or not\n",
"\n",
"* your need to build a classifier that predicts the probability of a sample coming from a cancerous (tumor type is normal or not) person based on the measured protein levels\n",
"* train a logistic regression (sklearn API) on every second sample (not first 50% of the data (!), use every second line)\n",
"* generate prediction for the samples that were not used during the training\n",
"\n",
"### 3. Comparision to CancerSEEK\n",
"* plot the ROC curve and calculate the confusion matrix for the predictions\n",
"* do the same for the CancerSEEK predictions\n",
"* compare your model's performance to CancerSEEK performance\n",
"\n",
"### 4. Hepatocellular carcinoma\n",
"\n",
"* fit a logistic regression (using statsmodels API this time) to predict if a sample has Hepatocellular carcinoma (liver cancer) or not. You need to keep only the liver and the normal samples for this exercise! For fitting use only the first 25 features and all the rows (which are liver or normal)\n",
"* select the 5 best predictor based on P values.\n",
"* Write down the most important features (based on P value) and compare them to the tumor markers that you find on wikipeida https://en.wikipedia.org/wiki/Hepatocellular_carcinoma or other sources!\n",
"\n",
"### 5. Multiclass classification\n",
"\n",
"* Again, using every second datapoint train a logistic regression (sklearn API) to predict the tumor type. It is a multiclass classification problem.\n",
"* Generate prediction for the rest of the dataset and show the confution matrix for the predictions!\n",
"* Plot the ROC curves for the different cancer types on the same plot!\n",
"* Intepret your results. Which cancer type can be predicted the most reliably?\n",
"\n",
"### Hints:\n",
"\n",
"* On total you can get 10 points for fully completing all tasks.\n",
"* Decorate your notebook with, questions, explanation etc, make it self contained and understandable!\n",
"* Comments you code when necessary\n",
"* Write functions for repetitive tasks!\n",
"* Use the pandas package for data loading and handling\n",
"* Use matplotlib and seaborn for plotting or bokeh and plotly for interactive investigation\n",
"* Use the scikit learn package for almost everything\n",
"* Use for loops only if it is really necessary!\n",
"* Code sharing is not allowed between student! Sharing code will result in zero points.\n",
"* If you use code found on web, it is OK, but, make its source clear!"
]
},
{
"cell_type": "code",
"metadata": {
"id": "inzU96ff1CV1"
},
"source": [
""
],
"execution_count": null,
"outputs": []
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment