JonathanCMitchell/GA Code_2

## GA Code_2
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Student Version(1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "#!/usr/bin/env python\n",
    "\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "\n",
    "# Should be from sklearn.model_selection import cross_val_score, train_test_split\n",
    "# remember to include the model_selection module otherwise the following two imports will not work\n",
    "\n",
    "# Below should be from sklearn.linear_model import LinearRegression\n",
    "from sklearn import LinearRegression\n",
    "\n",
    "from sklearn.cross_validation import cross_val_score\n",
    "\n",
    "# Load data\n",
    "# Should use './' to move into data directory from current directory '..' means 'go up one level' while './' means 'at same level' \n",
    "# should replace variable `d` with `data`  (1)\n",
    "d = pd.read_csv('../data/train.csv')\n",
    "\n",
    "\n",
    "# Setup data for prediction\n",
    "# data is unknown, d is a pointer to your dataframe so use d or follow suggestion in (1)\n",
    "x1 = data.SalaryNormalized\n",
    "x2 = pd.get_dummies(data.ContractType)\n",
    "\n",
    "# Setup model\n",
    "model = LinearRegression()\n",
    "\n",
    "# Evaluate model\n",
    "# It is helpful to move your import statements to the first lines in your code\n",
    "# Do not import a module if you do not intend to use it\n",
    "\n",
    "from sklearn.cross_validation import cross_val_score\n",
    "from sklearn.cross_validation import train_test_split\n",
    "\n",
    "# should include the following ======\n",
    "# specify how much train v test data you want\n",
    "split_percent = 0.2\n",
    "X_train, X_test, y_train, y_test = train_test_split(x1, x2, test_size = split_percent)\n",
    "\n",
    "# I would also suggest using y as your labeled data / ground_truth instead of X. \n",
    "# ===============\n",
    "# You must change the shape of the training labels, \n",
    "# either by encoding it or some other method so that \n",
    "# it is consistant with the training data. \n",
    "# If you have multiple columns as an output of the pd.get_dummies()\n",
    "# function you may want to consider encoding the data and reducing the dimensions.\n",
    "\n",
    "# must perform model.fit() on your data to train before evaluation\n",
    "# model.fit(train, test)\n",
    "\n",
    "# switch x1 and x2, because cross_val_score needs training data as 2nd param and test as 3rd param\n",
    "# note: there are 2 classes, so cv=2\n",
    "scores = cross_val_score(model, x2, x1, cv=1, scoring='mean_absolute_error')\n",
    "print(scores.mean())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice how your testing data has two columns where your training data has one column. This won't work. You should consider using a binarizer so you can encode your classes in a single column that way you only have to compare indices. You should encode 1 for full-time and 0 for part-time and use one column. [see herefore more info on LabelBinarizer](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelBinarizer.html)\n",
    "\n",
    "You may also consider using stratified k-fold to split your data instead of `train_test_split`. SKF's preserves the percentage of samples for each class. See more [here](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Student Version (2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "from sklearn.linear_model import LinearRegression\n",
    "\n",
    "# import below should be from sklearn.model_selection import cross_val_score\n",
    "from sklearn.cross_validation import cross_val_score\n",
    "\n",
    "# Load data\n",
    "data = pd.read_csv('../data/train.csv')\n",
    "\n",
    "\n",
    "# Setup data for prediction\n",
    "# incorrect procedure for accessing dataframe columns instead use data[category]\n",
    "# May want to switch x and y because y seems to be your training data and x seems to be your ground truth labels\n",
    "y = data.SalaryNormalized\n",
    "\n",
    "X = pd.get_dummies(data.ContractType)\n",
    "\n",
    "\n",
    "# Setup model\n",
    "model = LinearRegression()\n",
    "\n",
    "# Evaluate model\n",
    "# switch X and y for order is (train, test) inside cross_val_score\n",
    "scores = cross_val_score(model, X, y, cv=5, scoring='mean_absolute_error')\n",
    "print(scores.mean())"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python [default]",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Student Version(1)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"#!/usr/bin/env python\n",
	"\n",
	"import pandas as pd\n",
	"import numpy as np\n",
	"\n",
	"# Should be from sklearn.model_selection import cross_val_score, train_test_split\n",
	"# remember to include the model_selection module otherwise the following two imports will not work\n",
	"\n",
	"# Below should be from sklearn.linear_model import LinearRegression\n",
	"from sklearn import LinearRegression\n",
	"\n",
	"from sklearn.cross_validation import cross_val_score\n",
	"\n",
	"# Load data\n",
	"# Should use './' to move into data directory from current directory '..' means 'go up one level' while './' means 'at same level' \n",
	"# should replace variable `d` with `data` (1)\n",
	"d = pd.read_csv('../data/train.csv')\n",
	"\n",
	"\n",
	"# Setup data for prediction\n",
	"# data is unknown, d is a pointer to your dataframe so use d or follow suggestion in (1)\n",
	"x1 = data.SalaryNormalized\n",
	"x2 = pd.get_dummies(data.ContractType)\n",
	"\n",
	"# Setup model\n",
	"model = LinearRegression()\n",
	"\n",
	"# Evaluate model\n",
	"# It is helpful to move your import statements to the first lines in your code\n",
	"# Do not import a module if you do not intend to use it\n",
	"\n",
	"from sklearn.cross_validation import cross_val_score\n",
	"from sklearn.cross_validation import train_test_split\n",
	"\n",
	"# should include the following ======\n",
	"# specify how much train v test data you want\n",
	"split_percent = 0.2\n",
	"X_train, X_test, y_train, y_test = train_test_split(x1, x2, test_size = split_percent)\n",
	"\n",
	"# I would also suggest using y as your labeled data / ground_truth instead of X. \n",
	"# ===============\n",
	"# You must change the shape of the training labels, \n",
	"# either by encoding it or some other method so that \n",
	"# it is consistant with the training data. \n",
	"# If you have multiple columns as an output of the pd.get_dummies()\n",
	"# function you may want to consider encoding the data and reducing the dimensions.\n",
	"\n",
	"# must perform model.fit() on your data to train before evaluation\n",
	"# model.fit(train, test)\n",
	"\n",
	"# switch x1 and x2, because cross_val_score needs training data as 2nd param and test as 3rd param\n",
	"# note: there are 2 classes, so cv=2\n",
	"scores = cross_val_score(model, x2, x1, cv=1, scoring='mean_absolute_error')\n",
	"print(scores.mean())"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Notice how your testing data has two columns where your training data has one column. This won't work. You should consider using a binarizer so you can encode your classes in a single column that way you only have to compare indices. You should encode 1 for full-time and 0 for part-time and use one column. [see herefore more info on LabelBinarizer](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelBinarizer.html)\n",
	"\n",
	"You may also consider using stratified k-fold to split your data instead of `train_test_split`. SKF's preserves the percentage of samples for each class. See more [here](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Student Version (2)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"import pandas as pd\n",
	"import numpy as np\n",
	"from sklearn.linear_model import LinearRegression\n",
	"\n",
	"# import below should be from sklearn.model_selection import cross_val_score\n",
	"from sklearn.cross_validation import cross_val_score\n",
	"\n",
	"# Load data\n",
	"data = pd.read_csv('../data/train.csv')\n",
	"\n",
	"\n",
	"# Setup data for prediction\n",
	"# incorrect procedure for accessing dataframe columns instead use data[category]\n",
	"# May want to switch x and y because y seems to be your training data and x seems to be your ground truth labels\n",
	"y = data.SalaryNormalized\n",
	"\n",
	"X = pd.get_dummies(data.ContractType)\n",
	"\n",
	"\n",
	"# Setup model\n",
	"model = LinearRegression()\n",
	"\n",
	"# Evaluate model\n",
	"# switch X and y for order is (train, test) inside cross_val_score\n",
	"scores = cross_val_score(model, X, y, cv=5, scoring='mean_absolute_error')\n",
	"print(scores.mean())"
	]
	}
	],
	"metadata": {
	"anaconda-cloud": {},
	"kernelspec": {
	"display_name": "Python [default]",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.5.2"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 2
	}