qbeer/HW_02.ipynb

## HW_02.ipynb
{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "02_BLANK.ipynb",
      "provenance": [],
      "collapsed_sections": [],
      "include_colab_link": true
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/gist/qbeer/370770dacb737a35fb06725b69a13c05/02_blank.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ZadqMhWov-qv"
      },
      "source": [
        "# Unsupervised learning & clustering\n",
        "----\n",
        "### 1. Reading data\n",
        "The worldbank_jobs_2016.tsv (can be found in the same folder with this notebook) file contains the Jobs (and other) data for the 2016 year, downloaded from The World Bank's webpage.\n",
        "\n",
        "- Look at the data in any text editor. Build up an overall sense how the data is built up and how the missing values are represented.\n",
        "- Read the file into a pandas dataframe and tell pandas the delimiter (or separator) that separates the columns and which special pattern means if a value is missing.\n",
        "- Keep only those rows, which represents countries, at the end there are some useless rows (with missing country code).\n",
        "- The data is in a long format. Convert it into a wide format, where each row is a single country (with country code) and the column names are the features i.e. the Series Codes, the values in the columns are the measured values of the 2016 [YR 2016 column]. (eg the first column is 'EG.CFT.ACCS.ZS', the second is 'EG.ELC.ACCS.ZS'. Order of the columns does not matter)! Try to use the pivot method.\n",
        "- Check that the features are in numeric format (dtypes), this will be needed for modeling!\n",
        "-----\n",
        "### 2. Data preprocessing and inspection\n",
        "- Visualize the missing values!\n",
        "- Keep only those countries which has less than 60 missing features in the original table.\n",
        "- After this drop all features which have missing values for the remaining countries. (Imputation would also work but may introduce a bias because there is less data for less developed countries generally.)\n",
        "- How many counties and features do we have left?\n",
        "- Read the kept features' descriptions. In the original table the Series Name describe the meaning of the features. What do you think, based only on these information, which counties are the most similar to Hungary? And Greece?\n",
        "------\n",
        "### 3. PCA\n",
        "- Perform PCA with 3 principal components on the filtered, imputed data (from now on, data refers to the filtered, imputed dataset)\n",
        "- Plot the three embedded 2D combination next to each other (0 vs 1, 0 vs 2 and 1 vs 2)\n",
        "- It seems that the embedding is really dominated by a single direction. Normalize the data (each feature should have zero mean and unit variance after normalization) and re-do the PCA and the plotting (do not delete the previous plots, just make new ones).\n",
        "- Give some explaination for the second principal component: Look at the coefficients of the features which were use the calculate that principal component. For the features with the largest coefficient (in absolute value) look up the Series Name for the Code.\n",
        "-----\n",
        "### 4. T-SNE\n",
        "- Perform T-SNE on the scaled data with 2 components\n",
        "- Plot the embeddings results. Add a text label for each point to make it possible to interpret the results. It will not be possible to read all, but try to make it useful, see the attached image as an example!\n",
        "- Highlight Hungary, Greece, Norway, China, Russia (HUN, GRC, NOR, CHN, RUS)! Which countries are the closest one to Hungary and Greece?\n",
        "-------\n",
        "### 5. Hierarchical and K-Means clustering\n",
        "- Perform hierarchical clustering on the filtered and scaled data (hint: use seaborn)\n",
        "- Try to plot in a way that all country's name is visible\n",
        "- Perform K-Means clustering on the filtered and scaled data with 4 clusters.\n",
        "- Make a plot with text label for each point as in the previous excersice but use different color for every cluster.\n",
        "- Write down your impressions that you got from these two plots! Which cluster are China and Hungary in?\n",
        "----\n",
        "### Hints:\n",
        "- On total you can get 10 points for fully completing all tasks.\n",
        "- Decorate your notebook with questions, explanation etc, make it self contained and understandable!\n",
        "- Comment your code when necessary!\n",
        "- Write functions for repetitive tasks!\n",
        "- Use the pandas package for data loading and handling\n",
        "- Use matplotlib and seaborn for plotting or bokeh and plotly for interactive investigation\n",
        "- Use the scikit learn package for almost everything\n",
        "- Use for loops only if it is really necessary!\n",
        "- Code sharing is not allowed between students! Sharing code will result in zero points.\n",
        "- If you use code found on web, it is OK, but, make its source clear!"
      ]
    }
  ]
}
	{
	"nbformat": 4,
	"nbformat_minor": 0,
	"metadata": {
	"colab": {
	"name": "02_BLANK.ipynb",
	"provenance": [],
	"collapsed_sections": [],
	"include_colab_link": true
	},
	"kernelspec": {
	"name": "python3",
	"display_name": "Python 3"
	},
	"language_info": {
	"name": "python"
	}
	},
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "view-in-github",
	"colab_type": "text"
	},
	"source": [
	"<a href=\"https://colab.research.google.com/gist/qbeer/370770dacb737a35fb06725b69a13c05/02_blank.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "ZadqMhWov-qv"
	},
	"source": [
	"# Unsupervised learning & clustering\n",
	"----\n",
	"### 1. Reading data\n",
	"The worldbank_jobs_2016.tsv (can be found in the same folder with this notebook) file contains the Jobs (and other) data for the 2016 year, downloaded from The World Bank's webpage.\n",
	"\n",
	"- Look at the data in any text editor. Build up an overall sense how the data is built up and how the missing values are represented.\n",
	"- Read the file into a pandas dataframe and tell pandas the delimiter (or separator) that separates the columns and which special pattern means if a value is missing.\n",
	"- Keep only those rows, which represents countries, at the end there are some useless rows (with missing country code).\n",
	"- The data is in a long format. Convert it into a wide format, where each row is a single country (with country code) and the column names are the features i.e. the Series Codes, the values in the columns are the measured values of the 2016 [YR 2016 column]. (eg the first column is 'EG.CFT.ACCS.ZS', the second is 'EG.ELC.ACCS.ZS'. Order of the columns does not matter)! Try to use the pivot method.\n",
	"- Check that the features are in numeric format (dtypes), this will be needed for modeling!\n",
	"-----\n",
	"### 2. Data preprocessing and inspection\n",
	"- Visualize the missing values!\n",
	"- Keep only those countries which has less than 60 missing features in the original table.\n",
	"- After this drop all features which have missing values for the remaining countries. (Imputation would also work but may introduce a bias because there is less data for less developed countries generally.)\n",
	"- How many counties and features do we have left?\n",
	"- Read the kept features' descriptions. In the original table the Series Name describe the meaning of the features. What do you think, based only on these information, which counties are the most similar to Hungary? And Greece?\n",
	"------\n",
	"### 3. PCA\n",
	"- Perform PCA with 3 principal components on the filtered, imputed data (from now on, data refers to the filtered, imputed dataset)\n",
	"- Plot the three embedded 2D combination next to each other (0 vs 1, 0 vs 2 and 1 vs 2)\n",
	"- It seems that the embedding is really dominated by a single direction. Normalize the data (each feature should have zero mean and unit variance after normalization) and re-do the PCA and the plotting (do not delete the previous plots, just make new ones).\n",
	"- Give some explaination for the second principal component: Look at the coefficients of the features which were use the calculate that principal component. For the features with the largest coefficient (in absolute value) look up the Series Name for the Code.\n",
	"-----\n",
	"### 4. T-SNE\n",
	"- Perform T-SNE on the scaled data with 2 components\n",
	"- Plot the embeddings results. Add a text label for each point to make it possible to interpret the results. It will not be possible to read all, but try to make it useful, see the attached image as an example!\n",
	"- Highlight Hungary, Greece, Norway, China, Russia (HUN, GRC, NOR, CHN, RUS)! Which countries are the closest one to Hungary and Greece?\n",
	"-------\n",
	"### 5. Hierarchical and K-Means clustering\n",
	"- Perform hierarchical clustering on the filtered and scaled data (hint: use seaborn)\n",
	"- Try to plot in a way that all country's name is visible\n",
	"- Perform K-Means clustering on the filtered and scaled data with 4 clusters.\n",
	"- Make a plot with text label for each point as in the previous excersice but use different color for every cluster.\n",
	"- Write down your impressions that you got from these two plots! Which cluster are China and Hungary in?\n",
	"----\n",
	"### Hints:\n",
	"- On total you can get 10 points for fully completing all tasks.\n",
	"- Decorate your notebook with questions, explanation etc, make it self contained and understandable!\n",
	"- Comment your code when necessary!\n",
	"- Write functions for repetitive tasks!\n",
	"- Use the pandas package for data loading and handling\n",
	"- Use matplotlib and seaborn for plotting or bokeh and plotly for interactive investigation\n",
	"- Use the scikit learn package for almost everything\n",
	"- Use for loops only if it is really necessary!\n",
	"- Code sharing is not allowed between students! Sharing code will result in zero points.\n",
	"- If you use code found on web, it is OK, but, make its source clear!"
	]
	}
	]
	}