Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save shankarshastri/e944f51bc891f7c0bc02fbdbdb534487 to your computer and use it in GitHub Desktop.
Save shankarshastri/e944f51bc891f7c0bc02fbdbdb534487 to your computer and use it in GitHub Desktop.
machine_learning_scikit_kaggle.ipynb
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**[Machine Learning Micro-Course Home Page](https://www.kaggle.com/learn/intro-to-machine-learning)**\n",
"\n",
"---\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Introduction\n",
"Machine learning competitions are a great way to improve your data science skills and measure your progress. \n",
"\n",
"In this exercise, you will create and submit predictions for a Kaggle competition. You can then improve your model (e.g. by adding features) to improve and see how you stack up to others taking this micro-course.\n",
"\n",
"The steps in this notebook are:\n",
"1. Build a Random Forest model with all of your data (**X** and **y**)\n",
"2. Read in the \"test\" data, which doesn't include values for the target. Predict home values in the test data with your Random Forest model.\n",
"3. Submit those predictions to the competition and see your score.\n",
"4. Optionally, come back to see if you can improve your model by adding features or changing your model. Then you can resubmit to see how that stacks up on the competition leaderboard."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Recap\n",
"Here's the code you've written so far. Start by running it again."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Validation MAE when not specifying max_leaf_nodes: 29,653\n",
"Validation MAE for best value of max_leaf_nodes: 27,283\n",
"Validation MAE for Random Forest Model: 22,762\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"/opt/conda/lib/python3.6/site-packages/sklearn/ensemble/forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.\n",
" \"10 in version 0.20 to 100 in 0.22.\", FutureWarning)\n"
]
}
],
"source": [
"# Code you have previously used to load data\n",
"import pandas as pd\n",
"from sklearn.ensemble import RandomForestRegressor\n",
"from sklearn.metrics import mean_absolute_error\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.tree import DecisionTreeRegressor\n",
"from learntools.core import *\n",
"\n",
"\n",
"\n",
"# Path of the file to read. We changed the directory structure to simplify submitting to a competition\n",
"iowa_file_path = '../input/train.csv'\n",
"\n",
"home_data = pd.read_csv(iowa_file_path)\n",
"# Create target object and call it y\n",
"y = home_data.SalePrice\n",
"# Create X\n",
"features = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']\n",
"X = home_data[features]\n",
"\n",
"# Split into validation and training data\n",
"train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)\n",
"\n",
"# Specify Model\n",
"iowa_model = DecisionTreeRegressor(random_state=1)\n",
"# Fit Model\n",
"iowa_model.fit(train_X, train_y)\n",
"\n",
"# Make validation predictions and calculate mean absolute error\n",
"val_predictions = iowa_model.predict(val_X)\n",
"val_mae = mean_absolute_error(val_predictions, val_y)\n",
"print(\"Validation MAE when not specifying max_leaf_nodes: {:,.0f}\".format(val_mae))\n",
"\n",
"# Using best value for max_leaf_nodes\n",
"iowa_model = DecisionTreeRegressor(max_leaf_nodes=100, random_state=1)\n",
"iowa_model.fit(train_X, train_y)\n",
"val_predictions = iowa_model.predict(val_X)\n",
"val_mae = mean_absolute_error(val_predictions, val_y)\n",
"print(\"Validation MAE for best value of max_leaf_nodes: {:,.0f}\".format(val_mae))\n",
"\n",
"# Define the model. Set random_state to 1\n",
"rf_model = RandomForestRegressor(random_state=1)\n",
"rf_model.fit(train_X, train_y)\n",
"rf_val_predictions = rf_model.predict(val_X)\n",
"rf_val_mae = mean_absolute_error(rf_val_predictions, val_y)\n",
"\n",
"print(\"Validation MAE for Random Forest Model: {:,.0f}\".format(rf_val_mae))\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Creating a Model For the Competition\n",
"\n",
"Build a Random Forest model and train it on all of **X** and **y**. "
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/opt/conda/lib/python3.6/site-packages/sklearn/ensemble/forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.\n",
" \"10 in version 0.20 to 100 in 0.22.\", FutureWarning)\n"
]
},
{
"data": {
"text/plain": [
"RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,\n",
" max_features='auto', max_leaf_nodes=None,\n",
" min_impurity_decrease=0.0, min_impurity_split=None,\n",
" min_samples_leaf=1, min_samples_split=2,\n",
" min_weight_fraction_leaf=0.0, n_estimators=10,\n",
" n_jobs=None, oob_score=False, random_state=1, verbose=0,\n",
" warm_start=False)"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# To improve accuracy, create a new Random Forest model which you will train on all training data\n",
"rf_model_on_full_data = RandomForestRegressor(random_state=1)\n",
"\n",
"# fit rf_model_on_full_data on all data from the training data\n",
"rf_model_on_full_data.fit(train_X, train_y)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Make Predictions\n",
"Read the file of \"test\" data. And apply your model to make predictions"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"# path to file you will use for predictions\n",
"test_data_path = '../input/test.csv'\n",
"\n",
"# read test data file using pandas\n",
"test_data = pd.read_csv(test_data_path)\n",
"\n",
"test_data.describe()\n",
"# create test_X which comes from test_data but includes only the columns you used for prediction.\n",
"# The list of columns is stored in a variable called features\n",
"test_X = test_data[['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']]\n",
"\n",
"# make predictions which we will submit. \n",
"test_preds = rf_model_on_full_data.predict(test_X)\n",
"\n",
"# The lines below shows how to save predictions in format used for competition scoring\n",
"# Just uncomment them.\n",
"\n",
"output = pd.DataFrame({'Id': test_data.Id,\n",
" 'SalePrice': test_preds})\n",
"output.to_csv('submission.csv', index=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Test Your Work\n",
"After filling in the code above:\n",
"1. Click the **Commit** button. \n",
"2. After your code has finished running, click the \"Open Version\" button. This brings you into the \"viewer mode\" for your notebook. You will need to scroll down to get back to these instructions.\n",
"3. Click **Output** button on the left of your screen. \n",
"\n",
"This will bring you to a part of the screen that looks like this: \n",
"![](https://imgur.com/a/QRHL7Uv)\n",
"\n",
"Select the button to submit and you will see your score. You have now successfully submitted to the competition.\n",
"\n",
"4. If you want to keep working to improve your model, select the edit button. Then you can change your model and repeat the process to submit again. There's a lot of room to improve your model, and you will climb up the leaderboard as you work.\n",
"\n",
"# Continuing Your Progress\n",
"There are many ways to improve your model, and **experimenting is a great way to learn at this point.**\n",
"\n",
"The best way to improve your model is to add features. Look at the list of columns and think about what might affect home prices. Some features will cause errors because of issues like missing values or non-numeric data types. \n",
"\n",
"The [Intermediate Machine Learning](https://www.kaggle.com/learn/intermediate-machine-learning) micro-course will teach you how to handle these types of features. You will also learn to use **xgboost**, a technique giving even better accuracy than Random Forest.\n",
"\n",
"\n",
"# Other Micro-Courses\n",
"The **[Pandas Micro-Course](https://kaggle.com/Learn/Pandas)** will give you the data manipulation skills to quickly go from conceptual idea to implementation in your data science projects. \n",
"\n",
"You are also ready for the **[Deep Learning](https://kaggle.com/Learn/Deep-Learning)** micro-course, where you will build models with better-than-human level performance at computer vision tasks."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"**[Machine Learning Micro-Course Home Page](https://www.kaggle.com/learn/intro-to-machine-learning)**\n",
"\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.6"
}
},
"nbformat": 4,
"nbformat_minor": 1
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment