sarguido/Data Preprocessing

## Data Preprocessing
{
 "metadata": {
  "name": ""
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Data Preprocessing in Scikit-Learn\n",
      "\n",
      "The first step to any data analysis workflow is getting your data into a usable format. The estimators in scikit-learn have very specific requirements for what they'll take in. This notebook will cover different ways of preprocessing your data into a more scikit-learn-friendly format.\n",
      "\n",
      "### Getting Data into Scikit-Learn\n",
      "\n",
      "One of the great things about scikit-learn is that it comes with several toy datasets. If all you want to is play around with certain algorithms, these built-in datasets are a good way to do that. Here's an example of the classic iris dataset."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from sklearn import datasets\n",
      "\n",
      "iris = datasets.load_iris()\n",
      "print iris.data[0:10]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "[[ 5.1  3.5  1.4  0.2]\n",
        " [ 4.9  3.   1.4  0.2]\n",
        " [ 4.7  3.2  1.3  0.2]\n",
        " [ 4.6  3.1  1.5  0.2]\n",
        " [ 5.   3.6  1.4  0.2]\n",
        " [ 5.4  3.9  1.7  0.4]\n",
        " [ 4.6  3.4  1.4  0.3]\n",
        " [ 5.   3.4  1.5  0.2]\n",
        " [ 4.4  2.9  1.4  0.2]\n",
        " [ 4.9  3.1  1.5  0.1]]\n"
       ]
      }
     ],
     "prompt_number": 4
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Scikit-learn also comes with the handwritten digits dataset, the diabetes dataset, a house prices dataset, and sample images.\n",
      "\n",
      "A really important thing to know is that scikit-learn estimators **only take in continuous data in the form of NumPy arrays**. If your data is already continuous, this isn't a problem. There's a function in NumPy called loadtxt() that can read in a CSV file and convert it to an array with ease. For example, here's both the first five data instances and their labels from the glass identification dataset."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from sklearn import preprocessing\n",
      "import numpy as np\n",
      "\n",
      "glass_data = np.loadtxt('../data/glass_data.csv', delimiter=',')\n",
      "glass_target = np.loadtxt('../data/glass_target.csv')\n",
      "print glass_data[0:5], glass_target[0:5]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "[[  1.52101000e+00   1.36400000e+01   4.49000000e+00   1.10000000e+00\n",
        "    7.17800000e+01   6.00000000e-02   8.75000000e+00   0.00000000e+00\n",
        "    0.00000000e+00]\n",
        " [  1.51761000e+00   1.38900000e+01   3.60000000e+00   1.36000000e+00\n",
        "    7.27300000e+01   4.80000000e-01   7.83000000e+00   0.00000000e+00\n",
        "    0.00000000e+00]\n",
        " [  1.51618000e+00   1.35300000e+01   3.55000000e+00   1.54000000e+00\n",
        "    7.29900000e+01   3.90000000e-01   7.78000000e+00   0.00000000e+00\n",
        "    0.00000000e+00]\n",
        " [  1.51766000e+00   1.32100000e+01   3.69000000e+00   1.29000000e+00\n",
        "    7.26100000e+01   5.70000000e-01   8.22000000e+00   0.00000000e+00\n",
        "    0.00000000e+00]\n",
        " [  1.51742000e+00   1.32700000e+01   3.62000000e+00   1.24000000e+00\n",
        "    7.30800000e+01   5.50000000e-01   8.07000000e+00   0.00000000e+00\n",
        "    0.00000000e+00]] [ 1.  1.  1.  1.  1.]\n"
       ]
      }
     ],
     "prompt_number": 8
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "And you're ready to go. However, with categorical data, it's a bit more complicated.\n",
      "\n",
      "In my presentation and in the sklearn_workflow notebook, I use the car evaluation dataset from the UCI machine learning repository. This is a great dataset to work with because it's simple and is great for classification, but all of the values in the dataset are categorical. This means that I have to transform these categorical values into continuous ones.\n",
      "\n",
      "One of the easiest ways I've found for importing categorical data is to read in a file from a csv and put it into a list of dictionaries, which can easily be encoded into 1s and 0s in scikit-learn. For the target variables, that simply gets read into a list and is then encoded."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import csv\n",
      "\n",
      "car_data = list(csv.DictReader(open('../data/cardata.csv', 'rU')))\n",
      "car_target = list(csv.reader(open('../data/cartarget.csv', 'rU')))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 1
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Here's what the first dictionary in the list looks like:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "car_data[10]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 2,
       "text": [
        "{'buying': 'vhigh',\n",
        " 'doors': '2',\n",
        " 'lug_boot': 'small',\n",
        " 'maint': 'vhigh',\n",
        " 'persons': '4',\n",
        " 'safety': 'med'}"
       ]
      }
     ],
     "prompt_number": 2
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The first step in vectorizing our categorical values is to create a DictVectorizer() object and then use fit_transform() and toarray() to get the values into a NumPy array."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from sklearn.feature_extraction import DictVectorizer\n",
      "\n",
      "vec = DictVectorizer()\n",
      "car_data = vec.fit_transform(car_data).toarray()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 3
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Here's a vectorized item and the unencoded item beneath."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print 'Vectorized:', car_data[10]\n",
      "print 'Unvectorized:', vec.inverse_transform(car_data[10])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Vectorized: [ 0.  0.  0.  1.  1.  0.  0.  0.  0.  0.  1.  0.  0.  0.  1.  0.  1.  0.\n",
        "  0.  0.  1.]\n",
        "Unvectorized: [{'persons=4': 1.0, 'buying=vhigh': 1.0, 'safety=med': 1.0, 'lug_boot=small': 1.0, 'doors=2': 1.0, 'maint=vhigh': 1.0}]\n"
       ]
      }
     ],
     "prompt_number": 4
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Because the labels are also categorical, those need to be transformed as well. There's a special LabelEncoder() object specifically for this task."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from sklearn import preprocessing\n",
      "\n",
      "le = preprocessing.LabelEncoder()\n",
      "le.fit([\"unacc\", \"acc\", \"good\", \"vgood\"])\n",
      "target = le.transform(car_target[0])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 5
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Here's the transformed label and what it means."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print 'Transformed:', target[10] \n",
      "print 'Inverse transformed:', le.inverse_transform(target[10])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Transformed: 2\n",
        "Inverse transformed: unacc\n"
       ]
      }
     ],
     "prompt_number": 6
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Splitting Up the Dataset\n",
      "\n",
      "Another preprocessing step is to split up the dataset, in order to avoid overfitting. The train_test_split() function is a really simple way to do that. By default, the size of the test set is 25%."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from sklearn.cross_validation import train_test_split\n",
      "\n",
      "car_data_train, car_data_test, target_train, target_test = train_test_split(car_data, target)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 7
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The length of the whole data set is 1728 instances. After train_test_split():"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print 'Training set:', len(car_data_train)\n",
      "print 'Test set:', len(car_data_test)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Training set: 1296\n",
        "Test set: 432\n"
       ]
      }
     ],
     "prompt_number": 8
    }
   ],
   "metadata": {}
  }
 ]
}
	{
	"metadata": {
	"name": ""
	},
	"nbformat": 3,
	"nbformat_minor": 0,
	"worksheets": [
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Data Preprocessing in Scikit-Learn\n",
	"\n",
	"The first step to any data analysis workflow is getting your data into a usable format. The estimators in scikit-learn have very specific requirements for what they'll take in. This notebook will cover different ways of preprocessing your data into a more scikit-learn-friendly format.\n",
	"\n",
	"### Getting Data into Scikit-Learn\n",
	"\n",
	"One of the great things about scikit-learn is that it comes with several toy datasets. If all you want to is play around with certain algorithms, these built-in datasets are a good way to do that. Here's an example of the classic iris dataset."
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"from sklearn import datasets\n",
	"\n",
	"iris = datasets.load_iris()\n",
	"print iris.data[0:10]"
	],
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	"[[ 5.1 3.5 1.4 0.2]\n",
	" [ 4.9 3. 1.4 0.2]\n",
	" [ 4.7 3.2 1.3 0.2]\n",
	" [ 4.6 3.1 1.5 0.2]\n",
	" [ 5. 3.6 1.4 0.2]\n",
	" [ 5.4 3.9 1.7 0.4]\n",
	" [ 4.6 3.4 1.4 0.3]\n",
	" [ 5. 3.4 1.5 0.2]\n",
	" [ 4.4 2.9 1.4 0.2]\n",
	" [ 4.9 3.1 1.5 0.1]]\n"
	]
	}
	],
	"prompt_number": 4
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Scikit-learn also comes with the handwritten digits dataset, the diabetes dataset, a house prices dataset, and sample images.\n",
	"\n",
	"A really important thing to know is that scikit-learn estimators only take in continuous data in the form of NumPy arrays. If your data is already continuous, this isn't a problem. There's a function in NumPy called loadtxt() that can read in a CSV file and convert it to an array with ease. For example, here's both the first five data instances and their labels from the glass identification dataset."
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"from sklearn import preprocessing\n",
	"import numpy as np\n",
	"\n",
	"glass_data = np.loadtxt('../data/glass_data.csv', delimiter=',')\n",
	"glass_target = np.loadtxt('../data/glass_target.csv')\n",
	"print glass_data[0:5], glass_target[0:5]"
	],
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	"[[ 1.52101000e+00 1.36400000e+01 4.49000000e+00 1.10000000e+00\n",
	" 7.17800000e+01 6.00000000e-02 8.75000000e+00 0.00000000e+00\n",
	" 0.00000000e+00]\n",
	" [ 1.51761000e+00 1.38900000e+01 3.60000000e+00 1.36000000e+00\n",
	" 7.27300000e+01 4.80000000e-01 7.83000000e+00 0.00000000e+00\n",
	" 0.00000000e+00]\n",
	" [ 1.51618000e+00 1.35300000e+01 3.55000000e+00 1.54000000e+00\n",
	" 7.29900000e+01 3.90000000e-01 7.78000000e+00 0.00000000e+00\n",
	" 0.00000000e+00]\n",
	" [ 1.51766000e+00 1.32100000e+01 3.69000000e+00 1.29000000e+00\n",
	" 7.26100000e+01 5.70000000e-01 8.22000000e+00 0.00000000e+00\n",
	" 0.00000000e+00]\n",
	" [ 1.51742000e+00 1.32700000e+01 3.62000000e+00 1.24000000e+00\n",
	" 7.30800000e+01 5.50000000e-01 8.07000000e+00 0.00000000e+00\n",
	" 0.00000000e+00]] [ 1. 1. 1. 1. 1.]\n"
	]
	}
	],
	"prompt_number": 8
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"And you're ready to go. However, with categorical data, it's a bit more complicated.\n",
	"\n",
	"In my presentation and in the sklearn_workflow notebook, I use the car evaluation dataset from the UCI machine learning repository. This is a great dataset to work with because it's simple and is great for classification, but all of the values in the dataset are categorical. This means that I have to transform these categorical values into continuous ones.\n",
	"\n",
	"One of the easiest ways I've found for importing categorical data is to read in a file from a csv and put it into a list of dictionaries, which can easily be encoded into 1s and 0s in scikit-learn. For the target variables, that simply gets read into a list and is then encoded."
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"import csv\n",
	"\n",
	"car_data = list(csv.DictReader(open('../data/cardata.csv', 'rU')))\n",
	"car_target = list(csv.reader(open('../data/cartarget.csv', 'rU')))"
	],
	"language": "python",
	"metadata": {},
	"outputs": [],
	"prompt_number": 1
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Here's what the first dictionary in the list looks like:"
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"car_data[10]"
	],
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"metadata": {},
	"output_type": "pyout",
	"prompt_number": 2,
	"text": [
	"{'buying': 'vhigh',\n",
	" 'doors': '2',\n",
	" 'lug_boot': 'small',\n",
	" 'maint': 'vhigh',\n",
	" 'persons': '4',\n",
	" 'safety': 'med'}"
	]
	}
	],
	"prompt_number": 2
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"The first step in vectorizing our categorical values is to create a DictVectorizer() object and then use fit_transform() and toarray() to get the values into a NumPy array."
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"from sklearn.feature_extraction import DictVectorizer\n",
	"\n",
	"vec = DictVectorizer()\n",
	"car_data = vec.fit_transform(car_data).toarray()"
	],
	"language": "python",
	"metadata": {},
	"outputs": [],
	"prompt_number": 3
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Here's a vectorized item and the unencoded item beneath."
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"print 'Vectorized:', car_data[10]\n",
	"print 'Unvectorized:', vec.inverse_transform(car_data[10])"
	],
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	"Vectorized: [ 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 1. 0. 1. 0.\n",
	" 0. 0. 1.]\n",
	"Unvectorized: [{'persons=4': 1.0, 'buying=vhigh': 1.0, 'safety=med': 1.0, 'lug_boot=small': 1.0, 'doors=2': 1.0, 'maint=vhigh': 1.0}]\n"
	]
	}
	],
	"prompt_number": 4
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Because the labels are also categorical, those need to be transformed as well. There's a special LabelEncoder() object specifically for this task."
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"from sklearn import preprocessing\n",
	"\n",
	"le = preprocessing.LabelEncoder()\n",
	"le.fit([\"unacc\", \"acc\", \"good\", \"vgood\"])\n",
	"target = le.transform(car_target[0])"
	],
	"language": "python",
	"metadata": {},
	"outputs": [],
	"prompt_number": 5
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Here's the transformed label and what it means."
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"print 'Transformed:', target[10] \n",
	"print 'Inverse transformed:', le.inverse_transform(target[10])"
	],
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	"Transformed: 2\n",
	"Inverse transformed: unacc\n"
	]
	}
	],
	"prompt_number": 6
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Splitting Up the Dataset\n",
	"\n",
	"Another preprocessing step is to split up the dataset, in order to avoid overfitting. The train_test_split() function is a really simple way to do that. By default, the size of the test set is 25%."
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"from sklearn.cross_validation import train_test_split\n",
	"\n",
	"car_data_train, car_data_test, target_train, target_test = train_test_split(car_data, target)"
	],
	"language": "python",
	"metadata": {},
	"outputs": [],
	"prompt_number": 7
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"The length of the whole data set is 1728 instances. After train_test_split():"
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"print 'Training set:', len(car_data_train)\n",
	"print 'Test set:', len(car_data_test)"
	],
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	"Training set: 1296\n",
	"Test set: 432\n"
	]
	}
	],
	"prompt_number": 8
	}
	],
	"metadata": {}
	}
	]
	}