qbeer/hw7_raw.ipynb

## hw7_raw.ipynb
{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "hw7_raw.ipynb",
      "provenance": [],
      "authorship_tag": "ABX9TyODo3dr1C/WLwPE029zS+pi",
      "include_colab_link": true
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/gist/qbeer/545fa2d88e7541f81a137f6d0363e6c9/hw7_raw.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "62-iRk9ZUEHv"
      },
      "source": [
        "## 1.) Linear SVC in case of linear separation\n",
        "\n",
        "* load the Iris dataset (can be found in sklearn API)\n",
        "* scale the data and plot the petal length vs petal width in a scatterplot colored with the target\n",
        "* train an SVC model with linear kernel with default parameter settings, but once with C=1 and then C=1000\n",
        "* visualize the model's decision boundary and the margins based on the coefficients learnt by the model\n",
        "* interpret the results, what is the role of the C hyperparameter?\n",
        "\n",
        "## 2.) Linear SVC but non-linear separation\n",
        "\n",
        "* create a dataset with the following: X, y = sklearn.datasets.make_moons(noise=0.1, random_state=0)\n",
        "* perform the same steps just as in the previous exercise and use the linear kernel for the SVC\n",
        "* since linear SVC cannot do non-linear separation, you will need to do some workaround, for example adding polynomial features (3rd order would be a good choice)\n",
        "* write down with your own words in few sentences how the support vector machine works\n",
        "\n",
        "## 3.) Load the dataset from 2 weeks ago and build/evaluate the SVC with default settings\n",
        "\n",
        "Reminder:\n",
        "\n",
        "* you need to build a classifier that predicts the probability of a sample coming from a cancerous (tumor type is normal or not) person based on the measured protein levels\n",
        "\n",
        "* train the SVM classifier (SVC in sklearn API) on every second sample (not first 50% of the data (!), use every second line)\n",
        "\n",
        "* generate prediction for the samples that were not used during the training\n",
        "\n",
        "To-do now:\n",
        "\n",
        "* build default SVC, but set it to predict probabilities\n",
        "* plot the ROC curve and calculate the confusion matrix for the predictions\n",
        "* do the same for the CancerSEEK predictions and compare your model's performance to CancerSEEK performance (as a reference, plot it on the same figure)\n",
        "*how good is the performance of the new model?\n",
        "\n",
        "## 4.) Scale data and try different kernels\n",
        "\n",
        "* scale your data before applying the SVC model\n",
        "* plot the ROC curve and calculate the confusion matrix for the predictions\n",
        "* does your model perform better or worse after scaling?\n",
        "* try out other kernels (linear, poly) and evaluate the performance of the model the same way\n",
        "\n",
        "## 5.) Split the data randomly to 3 parts: 70% train, 15% validation, 15% test data and tune hyperparameters\n",
        "\n",
        "* prepare data as described in the title, then scale all inputs based on the training set\n",
        "* select your best performing SVC model from the previous exercise\n",
        "* check the behaviour of the SVC by modifying at least 3 of its hyperparameters (C, gamma, ...) and plot the AUC value vs the modified parameter (logscale may be better for visualization)\n",
        "* create plots (at least 2) that shows the train, val and test accuracy based on a given hyperparameter's different values. Is it a good idea to rely on validation data when tuning hyperparameter in this case?\n",
        "* select the best settings, train the SVC and evaluate with reference to CancerSEEK results with the ROC curve and the confusion matrix (match your results with CancerSEEK's results on the same dataset splitting)"
      ]
    }
  ]
}
	{
	"nbformat": 4,
	"nbformat_minor": 0,
	"metadata": {
	"colab": {
	"name": "hw7_raw.ipynb",
	"provenance": [],
	"authorship_tag": "ABX9TyODo3dr1C/WLwPE029zS+pi",
	"include_colab_link": true
	},
	"kernelspec": {
	"name": "python3",
	"display_name": "Python 3"
	},
	"language_info": {
	"name": "python"
	}
	},
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "view-in-github",
	"colab_type": "text"
	},
	"source": [
	"<a href=\"https://colab.research.google.com/gist/qbeer/545fa2d88e7541f81a137f6d0363e6c9/hw7_raw.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "62-iRk9ZUEHv"
	},
	"source": [
	"## 1.) Linear SVC in case of linear separation\n",
	"\n",
	"* load the Iris dataset (can be found in sklearn API)\n",
	"* scale the data and plot the petal length vs petal width in a scatterplot colored with the target\n",
	"* train an SVC model with linear kernel with default parameter settings, but once with C=1 and then C=1000\n",
	"* visualize the model's decision boundary and the margins based on the coefficients learnt by the model\n",
	"* interpret the results, what is the role of the C hyperparameter?\n",
	"\n",
	"## 2.) Linear SVC but non-linear separation\n",
	"\n",
	"* create a dataset with the following: X, y = sklearn.datasets.make_moons(noise=0.1, random_state=0)\n",
	"* perform the same steps just as in the previous exercise and use the linear kernel for the SVC\n",
	"* since linear SVC cannot do non-linear separation, you will need to do some workaround, for example adding polynomial features (3rd order would be a good choice)\n",
	"* write down with your own words in few sentences how the support vector machine works\n",
	"\n",
	"## 3.) Load the dataset from 2 weeks ago and build/evaluate the SVC with default settings\n",
	"\n",
	"Reminder:\n",
	"\n",
	"* you need to build a classifier that predicts the probability of a sample coming from a cancerous (tumor type is normal or not) person based on the measured protein levels\n",
	"\n",
	"* train the SVM classifier (SVC in sklearn API) on every second sample (not first 50% of the data (!), use every second line)\n",
	"\n",
	"* generate prediction for the samples that were not used during the training\n",
	"\n",
	"To-do now:\n",
	"\n",
	"* build default SVC, but set it to predict probabilities\n",
	"* plot the ROC curve and calculate the confusion matrix for the predictions\n",
	"* do the same for the CancerSEEK predictions and compare your model's performance to CancerSEEK performance (as a reference, plot it on the same figure)\n",
	"*how good is the performance of the new model?\n",
	"\n",
	"## 4.) Scale data and try different kernels\n",
	"\n",
	"* scale your data before applying the SVC model\n",
	"* plot the ROC curve and calculate the confusion matrix for the predictions\n",
	"* does your model perform better or worse after scaling?\n",
	"* try out other kernels (linear, poly) and evaluate the performance of the model the same way\n",
	"\n",
	"## 5.) Split the data randomly to 3 parts: 70% train, 15% validation, 15% test data and tune hyperparameters\n",
	"\n",
	"* prepare data as described in the title, then scale all inputs based on the training set\n",
	"* select your best performing SVC model from the previous exercise\n",
	"* check the behaviour of the SVC by modifying at least 3 of its hyperparameters (C, gamma, ...) and plot the AUC value vs the modified parameter (logscale may be better for visualization)\n",
	"* create plots (at least 2) that shows the train, val and test accuracy based on a given hyperparameter's different values. Is it a good idea to rely on validation data when tuning hyperparameter in this case?\n",
	"* select the best settings, train the SVC and evaluate with reference to CancerSEEK results with the ROC curve and the confusion matrix (match your results with CancerSEEK's results on the same dataset splitting)"
	]
	}
	]
	}