JonasSchroeder/feature-engineering.ipynb

## feature-engineering.ipynb
{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "Feature Engineering.ipynb",
      "provenance": [],
      "collapsed_sections": [],
      "include_colab_link": true
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/gist/JonasSchroeder/2a53bcff82481414d90db0c995c79c61/feature-engineering.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "8BWgAVOyQF2w",
        "colab_type": "text"
      },
      "source": [
        "# Feature Engineering\n",
        "\n",
        "Machine learning algorithms often expect numerical data in a tidy format of[n_samples, n_features]. Since data rarely is available in this format, feature engineering as the practice of turning information about the problem into relevant numbers becomes necessary.\n",
        "\n",
        "1. Features for Categorical Data\n",
        "2. Text Features\n",
        "3. Image Features\n",
        "4. Derived Features\n",
        "5. Imputation of missing data\n",
        "\n",
        "# 1. Categorical Features\n",
        "Names are typical categorical data which can be vectorized using straightforward *numerical mapping*:"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "snYq_-02RWD8",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "name = {'Jonas' : 1,\n",
        "        'Alice' : 2,\n",
        "        'Bobby' : 3}"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Pfi-Rk04RkHM",
        "colab_type": "text"
      },
      "source": [
        "However, this approach does not work well with Scikit-Learn, since the models assume numerical features for any algebraic values, like ('Bobby' - ' Alice' == 'Jonas') => True.\n",
        "\n",
        "An alternative for this case is **one-hot encoding** where the presence or absence of a category is noted with either 1 or 0. Scikit-Learn's DictVectorizer can turn a list of dictionaries into one-hot encoded data."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "QWpGc0FDSigt",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 85
        },
        "outputId": "e97d1ba6-257a-4f3a-b1f9-8f9f5684b86d"
      },
      "source": [
        "from sklearn.feature_extraction import DictVectorizer\n",
        "\n",
        "data = [\n",
        "        {'price': 150, 'beds': 1, 'city': 'Frankfurt'},\n",
        "        {'price': 450, 'beds': 2, 'city': 'New York'}, \n",
        "        {'price': 100, 'beds': 1, 'city': 'Berlin'}, \n",
        "        {'price': 20, 'beds': 6, 'city': 'Gaggenau'}\n",
        "        ] \n",
        "\n",
        "vec = DictVectorizer(sparse=False, dtype=int)\n",
        "vec.fit_transform(data)"
      ],
      "execution_count": 3,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "array([[  1,   0,   1,   0,   0, 150],\n",
              "       [  2,   0,   0,   0,   1, 450],\n",
              "       [  1,   1,   0,   0,   0, 100],\n",
              "       [  6,   0,   0,   1,   0,  20]])"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 3
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "8cpWCnNcTipR",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 119
        },
        "outputId": "9d8156a2-a125-432f-d183-7a4b8e04fc9a"
      },
      "source": [
        "vec.get_feature_names()"
      ],
      "execution_count": 4,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "['beds',\n",
              " 'city=Berlin',\n",
              " 'city=Frankfurt',\n",
              " 'city=Gaggenau',\n",
              " 'city=New York',\n",
              " 'price']"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 4
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "dw3i-GBcUNl8",
        "colab_type": "text"
      },
      "source": [
        "A big disadvantage of this approach is that when the data has many categories, the size of the dataset will grow extremely. One solution to this issue is to store the 1s and 0s as a sparse matrix."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "fjGd_NIlVFfO",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 51
        },
        "outputId": "7d73d429-cc2c-4cbc-e764-fd52122c8b2f"
      },
      "source": [
        "vec = DictVectorizer(sparse=True, dtype=int)\n",
        "vec.fit_transform(data)"
      ],
      "execution_count": 5,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "<4x6 sparse matrix of type '<class 'numpy.int64'>'\n",
              "\twith 12 stored elements in Compressed Sparse Row format>"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 5
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "KbCZAIMQVPRZ",
        "colab_type": "text"
      },
      "source": [
        "Not all Scikit-Learn models allow sparse inputs when fitting and evaluating the model.\n",
        "\n",
        "# 2. Text Features\n",
        "\n",
        "In order to turn text into a model, it needs to be transformed into numbers first. One simple solution is to use *word counts* and encode text snippets as count of occurence of words illustrated in a table with each word as a column."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "KwaCENC3V74L",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 142
        },
        "outputId": "5e136d50-4702-486d-e3ea-e355b1b1d154"
      },
      "source": [
        "from sklearn.feature_extraction.text import CountVectorizer\n",
        "import pandas as pd\n",
        "\n",
        "text_sample = ['today is christmas', 'happy holidays', 'christmas is one of many holidays']\n",
        "\n",
        "vec = CountVectorizer()\n",
        "X = vec.fit_transform(text_sample)\n",
        "\n",
        "pd.DataFrame(X.toarray(), columns=vec.get_feature_names())"
      ],
      "execution_count": 10,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>christmas</th>\n",
              "      <th>happy</th>\n",
              "      <th>holidays</th>\n",
              "      <th>is</th>\n",
              "      <th>many</th>\n",
              "      <th>of</th>\n",
              "      <th>one</th>\n",
              "      <th>today</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>1</td>\n",
              "      <td>0</td>\n",
              "      <td>0</td>\n",
              "      <td>1</td>\n",
              "      <td>0</td>\n",
              "      <td>0</td>\n",
              "      <td>0</td>\n",
              "      <td>1</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>0</td>\n",
              "      <td>1</td>\n",
              "      <td>1</td>\n",
              "      <td>0</td>\n",
              "      <td>0</td>\n",
              "      <td>0</td>\n",
              "      <td>0</td>\n",
              "      <td>0</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>1</td>\n",
              "      <td>0</td>\n",
              "      <td>1</td>\n",
              "      <td>1</td>\n",
              "      <td>1</td>\n",
              "      <td>1</td>\n",
              "      <td>1</td>\n",
              "      <td>0</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>"
            ],
            "text/plain": [
              "   christmas  happy  holidays  is  many  of  one  today\n",
              "0          1      0         0   1     0   0    0      1\n",
              "1          0      1         1   0     0   0    0      0\n",
              "2          1      0         1   1     1   1    1      0"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 10
        }
      ]
    }
  ]
}
	{
	"nbformat": 4,
	"nbformat_minor": 0,
	"metadata": {
	"colab": {
	"name": "Feature Engineering.ipynb",
	"provenance": [],
	"collapsed_sections": [],
	"include_colab_link": true
	},
	"kernelspec": {
	"name": "python3",
	"display_name": "Python 3"
	}
	},
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "view-in-github",
	"colab_type": "text"
	},
	"source": [
	"<a href=\"https://colab.research.google.com/gist/JonasSchroeder/2a53bcff82481414d90db0c995c79c61/feature-engineering.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "8BWgAVOyQF2w",
	"colab_type": "text"
	},
	"source": [
	"# Feature Engineering\n",
	"\n",
	"Machine learning algorithms often expect numerical data in a tidy format of[n_samples, n_features]. Since data rarely is available in this format, feature engineering as the practice of turning information about the problem into relevant numbers becomes necessary.\n",
	"\n",
	"1. Features for Categorical Data\n",
	"2. Text Features\n",
	"3. Image Features\n",
	"4. Derived Features\n",
	"5. Imputation of missing data\n",
	"\n",
	"# 1. Categorical Features\n",
	"Names are typical categorical data which can be vectorized using straightforward numerical mapping:"
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "snYq_-02RWD8",
	"colab_type": "code",
	"colab": {}
	},
	"source": [
	"name = {'Jonas' : 1,\n",
	" 'Alice' : 2,\n",
	" 'Bobby' : 3}"
	],
	"execution_count": 0,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "Pfi-Rk04RkHM",
	"colab_type": "text"
	},
	"source": [
	"However, this approach does not work well with Scikit-Learn, since the models assume numerical features for any algebraic values, like ('Bobby' - ' Alice' == 'Jonas') => True.\n",
	"\n",
	"An alternative for this case is one-hot encoding where the presence or absence of a category is noted with either 1 or 0. Scikit-Learn's DictVectorizer can turn a list of dictionaries into one-hot encoded data."
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "QWpGc0FDSigt",
	"colab_type": "code",
	"colab": {
	"base_uri": "https://localhost:8080/",
	"height": 85
	},
	"outputId": "e97d1ba6-257a-4f3a-b1f9-8f9f5684b86d"
	},
	"source": [
	"from sklearn.feature_extraction import DictVectorizer\n",
	"\n",
	"data = [\n",
	" {'price': 150, 'beds': 1, 'city': 'Frankfurt'},\n",
	" {'price': 450, 'beds': 2, 'city': 'New York'}, \n",
	" {'price': 100, 'beds': 1, 'city': 'Berlin'}, \n",
	" {'price': 20, 'beds': 6, 'city': 'Gaggenau'}\n",
	" ] \n",
	"\n",
	"vec = DictVectorizer(sparse=False, dtype=int)\n",
	"vec.fit_transform(data)"
	],
	"execution_count": 3,
	"outputs": [
	{
	"output_type": "execute_result",
	"data": {
	"text/plain": [
	"array([[ 1, 0, 1, 0, 0, 150],\n",
	" [ 2, 0, 0, 0, 1, 450],\n",
	" [ 1, 1, 0, 0, 0, 100],\n",
	" [ 6, 0, 0, 1, 0, 20]])"
	]
	},
	"metadata": {
	"tags": []
	},
	"execution_count": 3
	}
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "8cpWCnNcTipR",
	"colab_type": "code",
	"colab": {
	"base_uri": "https://localhost:8080/",
	"height": 119
	},
	"outputId": "9d8156a2-a125-432f-d183-7a4b8e04fc9a"
	},
	"source": [
	"vec.get_feature_names()"
	],
	"execution_count": 4,
	"outputs": [
	{
	"output_type": "execute_result",
	"data": {
	"text/plain": [
	"['beds',\n",
	" 'city=Berlin',\n",
	" 'city=Frankfurt',\n",
	" 'city=Gaggenau',\n",
	" 'city=New York',\n",
	" 'price']"
	]
	},
	"metadata": {
	"tags": []
	},
	"execution_count": 4
	}
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "dw3i-GBcUNl8",
	"colab_type": "text"
	},
	"source": [
	"A big disadvantage of this approach is that when the data has many categories, the size of the dataset will grow extremely. One solution to this issue is to store the 1s and 0s as a sparse matrix."
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "fjGd_NIlVFfO",
	"colab_type": "code",
	"colab": {
	"base_uri": "https://localhost:8080/",
	"height": 51
	},
	"outputId": "7d73d429-cc2c-4cbc-e764-fd52122c8b2f"
	},
	"source": [
	"vec = DictVectorizer(sparse=True, dtype=int)\n",
	"vec.fit_transform(data)"
	],
	"execution_count": 5,
	"outputs": [
	{
	"output_type": "execute_result",
	"data": {
	"text/plain": [
	"<4x6 sparse matrix of type '<class 'numpy.int64'>'\n",
	"\twith 12 stored elements in Compressed Sparse Row format>"
	]
	},
	"metadata": {
	"tags": []
	},
	"execution_count": 5
	}
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "KbCZAIMQVPRZ",
	"colab_type": "text"
	},
	"source": [
	"Not all Scikit-Learn models allow sparse inputs when fitting and evaluating the model.\n",
	"\n",
	"# 2. Text Features\n",
	"\n",
	"In order to turn text into a model, it needs to be transformed into numbers first. One simple solution is to use word counts and encode text snippets as count of occurence of words illustrated in a table with each word as a column."
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "KwaCENC3V74L",
	"colab_type": "code",
	"colab": {
	"base_uri": "https://localhost:8080/",
	"height": 142
	},
	"outputId": "5e136d50-4702-486d-e3ea-e355b1b1d154"
	},
	"source": [
	"from sklearn.feature_extraction.text import CountVectorizer\n",
	"import pandas as pd\n",
	"\n",
	"text_sample = ['today is christmas', 'happy holidays', 'christmas is one of many holidays']\n",
	"\n",
	"vec = CountVectorizer()\n",
	"X = vec.fit_transform(text_sample)\n",
	"\n",
	"pd.DataFrame(X.toarray(), columns=vec.get_feature_names())"
	],
	"execution_count": 10,
	"outputs": [
	{
	"output_type": "execute_result",
	"data": {
	"text/html": [
	"<div>\n",
	"<style scoped>\n",
	" .dataframe tbody tr th:only-of-type {\n",
	" vertical-align: middle;\n",
	" }\n",
	"\n",
	" .dataframe tbody tr th {\n",
	" vertical-align: top;\n",
	" }\n",
	"\n",
	" .dataframe thead th {\n",
	" text-align: right;\n",
	" }\n",
	"</style>\n",
	"<table border=\"1\" class=\"dataframe\">\n",
	" <thead>\n",
	" <tr style=\"text-align: right;\">\n",
	" <th></th>\n",
	" <th>christmas</th>\n",
	" <th>happy</th>\n",
	" <th>holidays</th>\n",
	" <th>is</th>\n",
	" <th>many</th>\n",
	" <th>of</th>\n",
	" <th>one</th>\n",
	" <th>today</th>\n",
	" </tr>\n",
	" </thead>\n",
	" <tbody>\n",
	" <tr>\n",
	" <th>0</th>\n",
	" <td>1</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>1</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>1</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>1</th>\n",
	" <td>0</td>\n",
	" <td>1</td>\n",
	" <td>1</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>2</th>\n",
	" <td>1</td>\n",
	" <td>0</td>\n",
	" <td>1</td>\n",
	" <td>1</td>\n",
	" <td>1</td>\n",
	" <td>1</td>\n",
	" <td>1</td>\n",
	" <td>0</td>\n",
	" </tr>\n",
	" </tbody>\n",
	"</table>\n",
	"</div>"
	],
	"text/plain": [
	" christmas happy holidays is many of one today\n",
	"0 1 0 0 1 0 0 0 1\n",
	"1 0 1 1 0 0 0 0 0\n",
	"2 1 0 1 1 1 1 1 0"
	]
	},
	"metadata": {
	"tags": []
	},
	"execution_count": 10
	}
	]
	}
	]
	}