ggodreau/task2.ipynb

## task2.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Part 2: Feedback"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## student-sample-1.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#!/usr/bin/env python\n",
    "\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "from sklearn import LinearRegression\n",
    "from sklearn.cross_validation import cross_val_score\n",
    "\n",
    "# Load data\n",
    "d = pd.read_csv('part-2-data.train.csv')\n",
    "\n",
    "\n",
    "# Setup data for prediction\n",
    "x1 = data.SalaryNormalized\n",
    "x2 = pd.get_dummies(data.ContractType)\n",
    "\n",
    "# Setup model\n",
    "model = LinearRegression()\n",
    "\n",
    "# Evaluate model\n",
    "from sklearn.cross_validation import cross_val_score\n",
    "from sklearn.cross_validation import train_test_split\n",
    "scores = cross_val_score(model, x2, x1, cv=1, scoring='mean_absolute_error')\n",
    "print(scores.mean())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Student1,\n",
    "\n",
    "Thanks and great work so far on the assignment. Here are a few things to watch out for and consider looking into. Please check them out and let me know what you find:\n",
    "\n",
    "- When importing libraries, most languages generally bring all the import statements to the top of the code. This will make it easier to debug by showing all library dependencies up-front when the code is executed. Also keep in mind that once a library has been imported, it will remain imported and available globally to any function within the session and usually does not need to be imported twice.\n",
    "- When importing LinearRegression from the sklearn library, you may have an error with your code as the scikit-learn libraries may have changed since you wrote this. Check out this link, make changes if necessary, and let me know if you get stuck: http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html\n",
    "- I see you've imported the data under the variable name 'd', but later refer to it as 'data'. Do you think it might improve clarity to use a consistent variable name between the two?\n",
    "- I see you've imported the train_test_split function but haven't used it. Is your data currently split, as-is? Do you think it might benefit your model to split your data into test/train sets using this function? Why or why not?\n",
    "- When performing cross-validation, the number of folds for CV is determined by the 'cv' argument of cross_val_score. A value of 1 means 1 fold is used. How does this value affect your results? Do you think a value of 1 is the best value to use for your purposes?\n",
    "- It looks like you're trying to predict Salary, x1, based off of the factor of ContractType, x2. What are the 'best practice' variable names for factors and target classes? Do you think your code may benefit by using these standards?\n",
    "- I see you've selected ContractType as a factor to predict Salary. How did you arrive at this specific factor to use? Do you think considering the other factors might help? Why or why not?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## student-sample-2.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#!/usr/bin/env python\n",
    "\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "from sklearn.linear_model import LinearRegression\n",
    "from sklearn.cross_validation import cross_val_score\n",
    "\n",
    "# Load data\n",
    "data = pd.read_csv('part-2-data.train.csv')\n",
    "\n",
    "\n",
    "# Setup data for prediction\n",
    "y = data.SalaryNormalized\n",
    "X = pd.get_dummies(data.ContractType)\n",
    "\n",
    "# Setup model\n",
    "model = LinearRegression()\n",
    "\n",
    "# Evaluate model\n",
    "scores = cross_val_score(model, X, y, cv=5, scoring='mean_absolute_error')\n",
    "print(scores.mean())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Student2,\n",
    "\n",
    "Great work so far on the assignment. I see you've made a lot of progress since we spoke last and you're really starting to get the hang of it! Here are some things I noticed that I'd like you to take a look at for next time we meet:\n",
    "\n",
    "- Note that the .cross_validation module is being phased out (deprecated) in scikit-learn library v0.18 and should be replaced with .model_selection. Please make sure you're up to date with the latest version of scikit-learn by typing 'conda list scikit-learn' in your terminal and seeing the version number returned to you. Upgrade if necessary, then migrate libraries. Please drop me a line if you need any help with this.\n",
    "- Per the prior comment, 'mean_absolute_error' in the scoring argument of your cross_val_score is also deprecated and will become 'neg_mean_absolute' once you upgrade scikit-learn, so please update your code to reflect that change.\n",
    "- I see you've used ContractType as a predictor of your target class, SalaryNormalized. How many possible factors exist for ContractType? How might this number influence your ability to predict salary? Do you think the introduction of additional factors from your data set would help or hurt your ability to predict salary? Why or why not?\n",
    "- When choosing a number of folds for your cross-validation, you seem to have picked 5. Would it benefit you to increase or decrease this number? What are the tradeoffs?"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Part 2: Feedback"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## student-sample-1.py"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"#!/usr/bin/env python\n",
	"\n",
	"import pandas as pd\n",
	"import numpy as np\n",
	"from sklearn import LinearRegression\n",
	"from sklearn.cross_validation import cross_val_score\n",
	"\n",
	"# Load data\n",
	"d = pd.read_csv('part-2-data.train.csv')\n",
	"\n",
	"\n",
	"# Setup data for prediction\n",
	"x1 = data.SalaryNormalized\n",
	"x2 = pd.get_dummies(data.ContractType)\n",
	"\n",
	"# Setup model\n",
	"model = LinearRegression()\n",
	"\n",
	"# Evaluate model\n",
	"from sklearn.cross_validation import cross_val_score\n",
	"from sklearn.cross_validation import train_test_split\n",
	"scores = cross_val_score(model, x2, x1, cv=1, scoring='mean_absolute_error')\n",
	"print(scores.mean())"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Student1,\n",
	"\n",
	"Thanks and great work so far on the assignment. Here are a few things to watch out for and consider looking into. Please check them out and let me know what you find:\n",
	"\n",
	"- When importing libraries, most languages generally bring all the import statements to the top of the code. This will make it easier to debug by showing all library dependencies up-front when the code is executed. Also keep in mind that once a library has been imported, it will remain imported and available globally to any function within the session and usually does not need to be imported twice.\n",
	"- When importing LinearRegression from the sklearn library, you may have an error with your code as the scikit-learn libraries may have changed since you wrote this. Check out this link, make changes if necessary, and let me know if you get stuck: http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html\n",
	"- I see you've imported the data under the variable name 'd', but later refer to it as 'data'. Do you think it might improve clarity to use a consistent variable name between the two?\n",
	"- I see you've imported the train_test_split function but haven't used it. Is your data currently split, as-is? Do you think it might benefit your model to split your data into test/train sets using this function? Why or why not?\n",
	"- When performing cross-validation, the number of folds for CV is determined by the 'cv' argument of cross_val_score. A value of 1 means 1 fold is used. How does this value affect your results? Do you think a value of 1 is the best value to use for your purposes?\n",
	"- It looks like you're trying to predict Salary, x1, based off of the factor of ContractType, x2. What are the 'best practice' variable names for factors and target classes? Do you think your code may benefit by using these standards?\n",
	"- I see you've selected ContractType as a factor to predict Salary. How did you arrive at this specific factor to use? Do you think considering the other factors might help? Why or why not?"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## student-sample-2.py"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"#!/usr/bin/env python\n",
	"\n",
	"import pandas as pd\n",
	"import numpy as np\n",
	"from sklearn.linear_model import LinearRegression\n",
	"from sklearn.cross_validation import cross_val_score\n",
	"\n",
	"# Load data\n",
	"data = pd.read_csv('part-2-data.train.csv')\n",
	"\n",
	"\n",
	"# Setup data for prediction\n",
	"y = data.SalaryNormalized\n",
	"X = pd.get_dummies(data.ContractType)\n",
	"\n",
	"# Setup model\n",
	"model = LinearRegression()\n",
	"\n",
	"# Evaluate model\n",
	"scores = cross_val_score(model, X, y, cv=5, scoring='mean_absolute_error')\n",
	"print(scores.mean())"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Student2,\n",
	"\n",
	"Great work so far on the assignment. I see you've made a lot of progress since we spoke last and you're really starting to get the hang of it! Here are some things I noticed that I'd like you to take a look at for next time we meet:\n",
	"\n",
	"- Note that the .cross_validation module is being phased out (deprecated) in scikit-learn library v0.18 and should be replaced with .model_selection. Please make sure you're up to date with the latest version of scikit-learn by typing 'conda list scikit-learn' in your terminal and seeing the version number returned to you. Upgrade if necessary, then migrate libraries. Please drop me a line if you need any help with this.\n",
	"- Per the prior comment, 'mean_absolute_error' in the scoring argument of your cross_val_score is also deprecated and will become 'neg_mean_absolute' once you upgrade scikit-learn, so please update your code to reflect that change.\n",
	"- I see you've used ContractType as a predictor of your target class, SalaryNormalized. How many possible factors exist for ContractType? How might this number influence your ability to predict salary? Do you think the introduction of additional factors from your data set would help or hurt your ability to predict salary? Why or why not?\n",
	"- When choosing a number of folds for your cross-validation, you seem to have picked 5. Would it benefit you to increase or decrease this number? What are the tradeoffs?"
	]
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.6.2"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 2
	}