Skip to content

Instantly share code, notes, and snippets.

@JonasSchroeder
Created December 24, 2019 10:02
Show Gist options
  • Save JonasSchroeder/2a53bcff82481414d90db0c995c79c61 to your computer and use it in GitHub Desktop.
Save JonasSchroeder/2a53bcff82481414d90db0c995c79c61 to your computer and use it in GitHub Desktop.
Feature Engineering.ipynb
Display the source blob
Display the rendered blob
Raw
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"name": "Feature Engineering.ipynb",
"provenance": [],
"collapsed_sections": [],
"include_colab_link": true
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
"<a href=\"https://colab.research.google.com/gist/JonasSchroeder/2a53bcff82481414d90db0c995c79c61/feature-engineering.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "8BWgAVOyQF2w",
"colab_type": "text"
},
"source": [
"# Feature Engineering\n",
"\n",
"Machine learning algorithms often expect numerical data in a tidy format of[n_samples, n_features]. Since data rarely is available in this format, feature engineering as the practice of turning information about the problem into relevant numbers becomes necessary.\n",
"\n",
"1. Features for Categorical Data\n",
"2. Text Features\n",
"3. Image Features\n",
"4. Derived Features\n",
"5. Imputation of missing data\n",
"\n",
"# 1. Categorical Features\n",
"Names are typical categorical data which can be vectorized using straightforward *numerical mapping*:"
]
},
{
"cell_type": "code",
"metadata": {
"id": "snYq_-02RWD8",
"colab_type": "code",
"colab": {}
},
"source": [
"name = {'Jonas' : 1,\n",
" 'Alice' : 2,\n",
" 'Bobby' : 3}"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "Pfi-Rk04RkHM",
"colab_type": "text"
},
"source": [
"However, this approach does not work well with Scikit-Learn, since the models assume numerical features for any algebraic values, like ('Bobby' - ' Alice' == 'Jonas') => True.\n",
"\n",
"An alternative for this case is **one-hot encoding** where the presence or absence of a category is noted with either 1 or 0. Scikit-Learn's DictVectorizer can turn a list of dictionaries into one-hot encoded data."
]
},
{
"cell_type": "code",
"metadata": {
"id": "QWpGc0FDSigt",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 85
},
"outputId": "e97d1ba6-257a-4f3a-b1f9-8f9f5684b86d"
},
"source": [
"from sklearn.feature_extraction import DictVectorizer\n",
"\n",
"data = [\n",
" {'price': 150, 'beds': 1, 'city': 'Frankfurt'},\n",
" {'price': 450, 'beds': 2, 'city': 'New York'}, \n",
" {'price': 100, 'beds': 1, 'city': 'Berlin'}, \n",
" {'price': 20, 'beds': 6, 'city': 'Gaggenau'}\n",
" ] \n",
"\n",
"vec = DictVectorizer(sparse=False, dtype=int)\n",
"vec.fit_transform(data)"
],
"execution_count": 3,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"array([[ 1, 0, 1, 0, 0, 150],\n",
" [ 2, 0, 0, 0, 1, 450],\n",
" [ 1, 1, 0, 0, 0, 100],\n",
" [ 6, 0, 0, 1, 0, 20]])"
]
},
"metadata": {
"tags": []
},
"execution_count": 3
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "8cpWCnNcTipR",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 119
},
"outputId": "9d8156a2-a125-432f-d183-7a4b8e04fc9a"
},
"source": [
"vec.get_feature_names()"
],
"execution_count": 4,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"['beds',\n",
" 'city=Berlin',\n",
" 'city=Frankfurt',\n",
" 'city=Gaggenau',\n",
" 'city=New York',\n",
" 'price']"
]
},
"metadata": {
"tags": []
},
"execution_count": 4
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "dw3i-GBcUNl8",
"colab_type": "text"
},
"source": [
"A big disadvantage of this approach is that when the data has many categories, the size of the dataset will grow extremely. One solution to this issue is to store the 1s and 0s as a sparse matrix."
]
},
{
"cell_type": "code",
"metadata": {
"id": "fjGd_NIlVFfO",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 51
},
"outputId": "7d73d429-cc2c-4cbc-e764-fd52122c8b2f"
},
"source": [
"vec = DictVectorizer(sparse=True, dtype=int)\n",
"vec.fit_transform(data)"
],
"execution_count": 5,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"<4x6 sparse matrix of type '<class 'numpy.int64'>'\n",
"\twith 12 stored elements in Compressed Sparse Row format>"
]
},
"metadata": {
"tags": []
},
"execution_count": 5
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "KbCZAIMQVPRZ",
"colab_type": "text"
},
"source": [
"Not all Scikit-Learn models allow sparse inputs when fitting and evaluating the model.\n",
"\n",
"# 2. Text Features\n",
"\n",
"In order to turn text into a model, it needs to be transformed into numbers first. One simple solution is to use *word counts* and encode text snippets as count of occurence of words illustrated in a table with each word as a column."
]
},
{
"cell_type": "code",
"metadata": {
"id": "KwaCENC3V74L",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 142
},
"outputId": "5e136d50-4702-486d-e3ea-e355b1b1d154"
},
"source": [
"from sklearn.feature_extraction.text import CountVectorizer\n",
"import pandas as pd\n",
"\n",
"text_sample = ['today is christmas', 'happy holidays', 'christmas is one of many holidays']\n",
"\n",
"vec = CountVectorizer()\n",
"X = vec.fit_transform(text_sample)\n",
"\n",
"pd.DataFrame(X.toarray(), columns=vec.get_feature_names())"
],
"execution_count": 10,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>christmas</th>\n",
" <th>happy</th>\n",
" <th>holidays</th>\n",
" <th>is</th>\n",
" <th>many</th>\n",
" <th>of</th>\n",
" <th>one</th>\n",
" <th>today</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" christmas happy holidays is many of one today\n",
"0 1 0 0 1 0 0 0 1\n",
"1 0 1 1 0 0 0 0 0\n",
"2 1 0 1 1 1 1 1 0"
]
},
"metadata": {
"tags": []
},
"execution_count": 10
}
]
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment