bedohazizsolt/HW_03.ipynb Secret

## HW_03.ipynb
{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "lab03.ipynb",
      "provenance": [],
      "collapsed_sections": [],
      "include_colab_link": true
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/gist/bedohazizsolt/623636c1e881c260d520d33c13376907/HW_03.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Jp4J8Pk4B4Os"
      },
      "source": [
        "# Supervised learning introduction, K-Nearest Neighbors (KNN)\n",
        "\n",
        "Your task will be to predict wine quality from physicochemical features with the help of the \n",
        "[Wine Quality Data Set](https://archive.ics.uci.edu/ml/datasets/Wine+Quality). You will have to do it both as a regression and classification task. \n",
        "\n",
        "\n",
        "-------\n",
        "\n",
        "###1. Read data\n",
        "  - Read the provided winequality-red.csv file. \n",
        "  - Check for missing values and that all entries are numerical. Also, check for duplicated entries (rows) and drop them.  \n",
        "  - Use all columns except the last as features and the quality column as target. \n",
        "  - Make 80-20% train test split (use sklearn).\n",
        "  - Prepare a one-hot encoded version of the y_test and y_train values ie. make a six long vector of the 6 quality classes (3-8), with only one non-zero value, e.g. 3->[1,0,0,0,0,0], 4->[0,1,0,0,0,0], 5->[0,0,1,0,0,0] etc. (You can use pandas or sklearn for that.) *You will have to use the one-hot encoded labels in the classification exercise only.*\n",
        "  - Normalize the features by substracting the means and dividing by the standard deviation feature by feature. If you want to be very precise, you should use only the mean and std in the training set for normalization, because generally the test test is not available at training time.\n",
        "\n",
        "----\n",
        "\n",
        "###2. KNN regression\n",
        "- Implement naive K nearest neighbour regression as a function only using python and numpy. The signature of the function should be:\n",
        "```python\n",
        "def knn_regression(x_test, x_train, y_train, k=20):\n",
        "        \"\"\"Return prediction with knn regression.\"\"\"\n",
        "        .\n",
        "        .\n",
        "        .\n",
        "        return y_pred\n",
        "```\n",
        "- Use Euclidean distance as a measure of distance.\n",
        "- Make prediction with k=20 for the test set using the training data.\n",
        "- Plot the true and the predicted values from the test set on a scatterplot.\n",
        "\n",
        "-----\n",
        "\n",
        "### 3. Weighted KNN regression\n",
        "- Modify the knn_regression function by adding a weight to each neighbor that is inversely proportional to the distance.\n",
        "```python\n",
        "def knn_weighted_regression(x_test,x_train,y_train,k=20):\n",
        "    \"\"\"Return prediction with weighted knn regression.\"\"\"\n",
        "    ...\n",
        "    return y_pred\n",
        "```\n",
        "- Make prediction with k=20 for the test set using the training data.\n",
        "- Plot the true and the predicted values from the test set on a scatterplot.\n",
        "\n",
        "-----\n",
        "\n",
        "### 4. KNN classification\n",
        "- Implement the K-nearest neighbors classification algorithm using only pure Python3 and numpy! Use L2 distance to find the neighbors. The prediction for each class should be the number of neighbors supporting the given class divided by k (for example if k is 5 and we have 3 neighbors for class A, 2 for class B and 0 for class C neighbors, then the prediction for class A should be 3/5, for class B 2/5, for class C 0/5). Use the one-hot encoded labels!\n",
        "```python\n",
        "def knn_classifier(X_train, y_train, X_test, k=20):\n",
        "  \"\"\"Return prediction with knn classification.\"\"\"\n",
        "    ...\n",
        "    return y_pred\n",
        "```\n",
        "\n",
        "- Make prediction with k=20 for the test set using the training data.\n",
        "\n",
        "-----\n",
        "\n",
        "### 5. Compare the models\n",
        "- Make a baseline model: this can be the mean value of the training labels for every sample.\n",
        "- Compare the regression and classification models to the baseline: You can do this by rounding the continous predictions of the regression to the nearest integer. Calculate the accuracy (fraction of correctly classified samples) of the models.\n",
        "- Check your KNN implementations by running the sklearn built-in model. \n",
        "You can run it for any model you implented. The predictions should be the same as yours. Some help:\n",
        "  ```python\n",
        "  from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier\n",
        "  knn= KNeighborsRegressor(20, weights=\"distance\")\n",
        "  #knn= KNeighborsClassifier(20, weights=\"uniform\")\n",
        "  knn.fit(X_train, y_train)\n",
        "  knn.predict(X_test)\n",
        "  ```\n",
        "- Write down your observations.\n",
        "----\n",
        "### Hints:\n",
        "- On total you can get 10 points for fully completing all tasks.\n",
        "- Decorate your notebook with questions, explanation etc, make it self contained and understandable!\n",
        "- Comment your code when necessary!\n",
        "- Write functions for repetitive tasks!\n",
        "- Use the pandas package for data loading and handling\n",
        "- Use matplotlib and seaborn for plotting or bokeh and plotly for interactive investigation\n",
        "- Use the scikit learn package for almost everything\n",
        "- Use for loops only if it is really necessary!\n",
        "- Code sharing is not allowed between students! Sharing code will result in zero points.\n",
        "- If you use code found on web, it is OK, but, make its source clear!"
      ]
    }
  ]
}
	{
	"nbformat": 4,
	"nbformat_minor": 0,
	"metadata": {
	"colab": {
	"name": "lab03.ipynb",
	"provenance": [],
	"collapsed_sections": [],
	"include_colab_link": true
	},
	"kernelspec": {
	"name": "python3",
	"display_name": "Python 3"
	},
	"language_info": {
	"name": "python"
	}
	},
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "view-in-github",
	"colab_type": "text"
	},
	"source": [
	"<a href=\"https://colab.research.google.com/gist/bedohazizsolt/623636c1e881c260d520d33c13376907/HW_03.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "Jp4J8Pk4B4Os"
	},
	"source": [
	"# Supervised learning introduction, K-Nearest Neighbors (KNN)\n",
	"\n",
	"Your task will be to predict wine quality from physicochemical features with the help of the \n",
	"[Wine Quality Data Set](https://archive.ics.uci.edu/ml/datasets/Wine+Quality). You will have to do it both as a regression and classification task. \n",
	"\n",
	"\n",
	"-------\n",
	"\n",
	"###1. Read data\n",
	" - Read the provided winequality-red.csv file. \n",
	" - Check for missing values and that all entries are numerical. Also, check for duplicated entries (rows) and drop them. \n",
	" - Use all columns except the last as features and the quality column as target. \n",
	" - Make 80-20% train test split (use sklearn).\n",
	" - Prepare a one-hot encoded version of the y_test and y_train values ie. make a six long vector of the 6 quality classes (3-8), with only one non-zero value, e.g. 3->[1,0,0,0,0,0], 4->[0,1,0,0,0,0], 5->[0,0,1,0,0,0] etc. (You can use pandas or sklearn for that.) You will have to use the one-hot encoded labels in the classification exercise only.\n",
	" - Normalize the features by substracting the means and dividing by the standard deviation feature by feature. If you want to be very precise, you should use only the mean and std in the training set for normalization, because generally the test test is not available at training time.\n",
	"\n",
	"----\n",
	"\n",
	"###2. KNN regression\n",
	"- Implement naive K nearest neighbour regression as a function only using python and numpy. The signature of the function should be:\n",
	"```python\n",
	"def knn_regression(x_test, x_train, y_train, k=20):\n",
	" \"\"\"Return prediction with knn regression.\"\"\"\n",
	" .\n",
	" .\n",
	" .\n",
	" return y_pred\n",
	"```\n",
	"- Use Euclidean distance as a measure of distance.\n",
	"- Make prediction with k=20 for the test set using the training data.\n",
	"- Plot the true and the predicted values from the test set on a scatterplot.\n",
	"\n",
	"-----\n",
	"\n",
	"### 3. Weighted KNN regression\n",
	"- Modify the knn_regression function by adding a weight to each neighbor that is inversely proportional to the distance.\n",
	"```python\n",
	"def knn_weighted_regression(x_test,x_train,y_train,k=20):\n",
	" \"\"\"Return prediction with weighted knn regression.\"\"\"\n",
	" ...\n",
	" return y_pred\n",
	"```\n",
	"- Make prediction with k=20 for the test set using the training data.\n",
	"- Plot the true and the predicted values from the test set on a scatterplot.\n",
	"\n",
	"-----\n",
	"\n",
	"### 4. KNN classification\n",
	"- Implement the K-nearest neighbors classification algorithm using only pure Python3 and numpy! Use L2 distance to find the neighbors. The prediction for each class should be the number of neighbors supporting the given class divided by k (for example if k is 5 and we have 3 neighbors for class A, 2 for class B and 0 for class C neighbors, then the prediction for class A should be 3/5, for class B 2/5, for class C 0/5). Use the one-hot encoded labels!\n",
	"```python\n",
	"def knn_classifier(X_train, y_train, X_test, k=20):\n",
	" \"\"\"Return prediction with knn classification.\"\"\"\n",
	" ...\n",
	" return y_pred\n",
	"```\n",
	"\n",
	"- Make prediction with k=20 for the test set using the training data.\n",
	"\n",
	"-----\n",
	"\n",
	"### 5. Compare the models\n",
	"- Make a baseline model: this can be the mean value of the training labels for every sample.\n",
	"- Compare the regression and classification models to the baseline: You can do this by rounding the continous predictions of the regression to the nearest integer. Calculate the accuracy (fraction of correctly classified samples) of the models.\n",
	"- Check your KNN implementations by running the sklearn built-in model. \n",
	"You can run it for any model you implented. The predictions should be the same as yours. Some help:\n",
	" ```python\n",
	" from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier\n",
	" knn= KNeighborsRegressor(20, weights=\"distance\")\n",
	" #knn= KNeighborsClassifier(20, weights=\"uniform\")\n",
	" knn.fit(X_train, y_train)\n",
	" knn.predict(X_test)\n",
	" ```\n",
	"- Write down your observations.\n",
	"----\n",
	"### Hints:\n",
	"- On total you can get 10 points for fully completing all tasks.\n",
	"- Decorate your notebook with questions, explanation etc, make it self contained and understandable!\n",
	"- Comment your code when necessary!\n",
	"- Write functions for repetitive tasks!\n",
	"- Use the pandas package for data loading and handling\n",
	"- Use matplotlib and seaborn for plotting or bokeh and plotly for interactive investigation\n",
	"- Use the scikit learn package for almost everything\n",
	"- Use for loops only if it is really necessary!\n",
	"- Code sharing is not allowed between students! Sharing code will result in zero points.\n",
	"- If you use code found on web, it is OK, but, make its source clear!"
	]
	}
	]
	}