Skip to content

Instantly share code, notes, and snippets.

@kumarsuraj9450
Created March 31, 2019 10:48
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save kumarsuraj9450/74a88c37acc6be752f0d542f4278633b to your computer and use it in GitHub Desktop.
Save kumarsuraj9450/74a88c37acc6be752f0d542f4278633b to your computer and use it in GitHub Desktop.
courses/ml1/BHU ml competition/lesson1-rf.ipynb
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"metadata": {},
"cell_type": "markdown",
"source": "**Important: This notebook will only work with fastai-0.7.x. Do not try to run any fastai-1.x code from this path in the repository because it will load fastai-0.7.x**"
},
{
"metadata": {
"heading_collapsed": true
},
"cell_type": "markdown",
"source": "# Intro to Random Forests"
},
{
"metadata": {
"heading_collapsed": true,
"hidden": true
},
"cell_type": "markdown",
"source": "## About this course"
},
{
"metadata": {
"hidden": true
},
"cell_type": "markdown",
"source": "### Teaching approach"
},
{
"metadata": {
"hidden": true
},
"cell_type": "markdown",
"source": "This course is being taught by Jeremy Howard, and was developed by Jeremy along with Rachel Thomas. Rachel has been dealing with a life-threatening illness so will not be teaching as originally planned this year.\n\nJeremy has worked in a number of different areas - feel free to ask about anything that he might be able to help you with at any time, even if not directly related to the current topic:\n\n- Management consultant (McKinsey; AT Kearney)\n- Self-funded startup entrepreneur (Fastmail: first consumer synchronized email; Optimal Decisions: first optimized insurance pricing)\n- VC-funded startup entrepreneur: (Kaggle; Enlitic: first deep-learning medical company)"
},
{
"metadata": {
"hidden": true
},
"cell_type": "markdown",
"source": "I'll be using a *top-down* teaching method, which is different from how most math courses operate. Typically, in a *bottom-up* approach, you first learn all the separate components you will be using, and then you gradually build them up into more complex structures. The problems with this are that students often lose motivation, don't have a sense of the \"big picture\", and don't know what they'll need.\n\nIf you took the fast.ai deep learning course, that is what we used. You can hear more about my teaching philosophy [in this blog post](http://www.fast.ai/2016/10/08/teaching-philosophy/) or [in this talk](https://vimeo.com/214233053).\n\nHarvard Professor David Perkins has a book, [Making Learning Whole](https://www.amazon.com/Making-Learning-Whole-Principles-Transform/dp/0470633719) in which he uses baseball as an analogy. We don't require kids to memorize all the rules of baseball and understand all the technical details before we let them play the game. Rather, they start playing with a just general sense of it, and then gradually learn more rules/details as time goes on.\n\nAll that to say, don't worry if you don't understand everything at first! You're not supposed to. We will start using some \"black boxes\" such as random forests that haven't yet been explained in detail, and then we'll dig into the lower level details later.\n\nTo start, focus on what things DO, not what they ARE."
},
{
"metadata": {
"hidden": true
},
"cell_type": "markdown",
"source": "### Your practice"
},
{
"metadata": {
"hidden": true
},
"cell_type": "markdown",
"source": "People learn by:\n1. **doing** (coding and building)\n2. **explaining** what they've learned (by writing or helping others)\n\nTherefore, we suggest that you practice these skills on Kaggle by:\n1. Entering competitions (*doing*)\n2. Creating Kaggle kernels (*explaining*)\n\nIt's OK if you don't get good competition ranks or any kernel votes at first - that's totally normal! Just try to keep improving every day, and you'll see the results over time."
},
{
"metadata": {
"hidden": true
},
"cell_type": "markdown",
"source": "To get better at technical writing, study the top ranked Kaggle kernels from past competitions, and read posts from well-regarded technical bloggers. Some good role models include:\n\n- [Peter Norvig](http://nbviewer.jupyter.org/url/norvig.com/ipython/ProbabilityParadox.ipynb) (more [here](http://norvig.com/ipython/))\n- [Stephen Merity](https://smerity.com/articles/2017/deepcoder_and_ai_hype.html)\n- [Julia Evans](https://codewords.recurse.com/issues/five/why-do-neural-networks-think-a-panda-is-a-vulture) (more [here](https://jvns.ca/blog/2014/08/12/what-happens-if-you-write-a-tcp-stack-in-python/))\n- [Julia Ferraioli](http://blog.juliaferraioli.com/2016/02/exploring-world-using-vision-twilio.html)\n- [Edwin Chen](http://blog.echen.me/2014/10/07/moving-beyond-ctr-better-recommendations-through-human-evaluation/)\n- [Slav Ivanov](https://blog.slavv.com/picking-an-optimizer-for-style-transfer-86e7b8cba84b) (fast.ai student)\n- [Brad Kenstler](https://hackernoon.com/non-artistic-style-transfer-or-how-to-draw-kanye-using-captain-picards-face-c4a50256b814) (fast.ai and USF MSAN student)"
},
{
"metadata": {
"hidden": true
},
"cell_type": "markdown",
"source": "### Books"
},
{
"metadata": {
"hidden": true
},
"cell_type": "markdown",
"source": "The more familiarity you have with numeric programming in Python, the better. If you're looking to improve in this area, we strongly suggest Wes McKinney's [Python for Data Analysis, 2nd ed](https://www.amazon.com/Python-Data-Analysis-Wrangling-IPython/dp/1491957662/ref=asap_bc?ie=UTF8).\n\nFor machine learning with Python, we recommend:\n\n- [Introduction to Machine Learning with Python](https://www.amazon.com/Introduction-Machine-Learning-Andreas-Mueller/dp/1449369413): From one of the scikit-learn authors, which is the main library we'll be using\n- [Python Machine Learning: Machine Learning and Deep Learning with Python, scikit-learn, and TensorFlow, 2nd Edition](https://www.amazon.com/Python-Machine-Learning-scikit-learn-TensorFlow/dp/1787125939/ref=dp_ob_title_bk): New version of a very successful book. A lot of the new material however covers deep learning in Tensorflow, which isn't relevant to this course\n- [Hands-On Machine Learning with Scikit-Learn and TensorFlow](https://www.amazon.com/Hands-Machine-Learning-Scikit-Learn-TensorFlow/dp/1491962291/ref=pd_lpo_sbs_14_t_0?_encoding=UTF8&psc=1&refRID=MBV2QMFH3EZ6B3YBY40K)\n"
},
{
"metadata": {
"hidden": true
},
"cell_type": "markdown",
"source": "### Syllabus in brief"
},
{
"metadata": {
"hidden": true
},
"cell_type": "markdown",
"source": "Depending on time and class interests, we'll cover something like (not necessarily in this order):\n\n- Train vs test\n - Effective validation set construction\n- Trees and ensembles\n - Creating random forests\n - Interpreting random forests\n- What is ML? Why do we use it?\n - What makes a good ML project?\n - Structured vs unstructured data\n - Examples of failures/mistakes\n- Feature engineering\n - Domain specific - dates, URLs, text\n - Embeddings / latent factors\n- Regularized models trained with SGD\n - GLMs, Elasticnet, etc (NB: see what James covered)\n- Basic neural nets\n - PyTorch\n - Broadcasting, Matrix Multiplication\n - Training loop, backpropagation\n- KNN\n- CV / bootstrap (Diabetes data set?)\n- Ethical considerations"
},
{
"metadata": {
"hidden": true
},
"cell_type": "markdown",
"source": "Skip:\n\n- Dimensionality reduction\n- Interactions\n- Monitoring training\n- Collaborative filtering\n- Momentum and LR annealing\n"
},
{
"metadata": {
"heading_collapsed": true,
"hidden": true
},
"cell_type": "markdown",
"source": "## Imports"
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2019-03-30T17:44:46.867890Z",
"start_time": "2019-03-30T17:44:46.012183Z"
},
"hidden": true,
"trusted": true
},
"cell_type": "code",
"source": "%load_ext autoreload\n%autoreload 2\n\n%matplotlib inline",
"execution_count": 1,
"outputs": []
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2019-03-30T17:46:21.886832Z",
"start_time": "2019-03-30T17:46:21.760000Z"
},
"hidden": true,
"trusted": true
},
"cell_type": "code",
"source": "from fastai.imports import *\nfrom fastai.structured import *\n\nfrom pandas_summary import DataFrameSummary\nfrom sklearn.ensemble import RandomForestRegressor, RandomForestClassifier\nfrom IPython.display import display\n\nfrom sklearn import metrics",
"execution_count": 3,
"outputs": []
},
{
"metadata": {
"hidden": true,
"trusted": true
},
"cell_type": "code",
"source": "PATH = \"bulldozers/\"",
"execution_count": 9,
"outputs": []
},
{
"metadata": {
"hidden": true,
"trusted": true
},
"cell_type": "code",
"source": "! ls {PATH}",
"execution_count": 11,
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": "'lsbulldozers' is not recognized as an internal or external command,\noperable program or batch file.\n"
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "# Introduction to *Blue Book for Bulldozers*"
},
{
"metadata": {
"code_folding": [],
"heading_collapsed": true
},
"cell_type": "markdown",
"source": "## About..."
},
{
"metadata": {
"heading_collapsed": true,
"hidden": true
},
"cell_type": "markdown",
"source": "### ...our teaching"
},
{
"metadata": {
"hidden": true
},
"cell_type": "markdown",
"source": "At fast.ai we have a distinctive [teaching philosophy](http://www.fast.ai/2016/10/08/teaching-philosophy/) of [\"the whole game\"](https://www.amazon.com/Making-Learning-Whole-Principles-Transform/dp/0470633719/ref=sr_1_1?ie=UTF8&qid=1505094653). This is different from how most traditional math & technical courses are taught, where you have to learn all the individual elements before you can combine them (Harvard professor David Perkins call this *elementitis*), but it is similar to how topics like *driving* and *baseball* are taught. That is, you can start driving without [knowing how an internal combustion engine works](https://medium.com/towards-data-science/thoughts-after-taking-the-deeplearning-ai-courses-8568f132153), and children begin playing baseball before they learn all the formal rules."
},
{
"metadata": {
"heading_collapsed": true,
"hidden": true
},
"cell_type": "markdown",
"source": "### ...our approach to machine learning"
},
{
"metadata": {
"hidden": true
},
"cell_type": "markdown",
"source": "Most machine learning courses will throw at you dozens of different algorithms, with a brief technical description of the math behind them, and maybe a toy example. You're left confused by the enormous range of techniques shown and have little practical understanding of how to apply them.\n\nThe good news is that modern machine learning can be distilled down to a couple of key techniques that are of very wide applicability. Recent studies have shown that the vast majority of datasets can be best modeled with just two methods:\n\n- *Ensembles of decision trees* (i.e. Random Forests and Gradient Boosting Machines), mainly for structured data (such as you might find in a database table at most companies)\n- *Multi-layered neural networks learnt with SGD* (i.e. shallow and/or deep learning), mainly for unstructured data (such as audio, vision, and natural language)\n\nIn this course we'll be doing a deep dive into random forests, and simple models learnt with SGD. You'll be learning about gradient boosting and deep learning in part 2."
},
{
"metadata": {
"heading_collapsed": true,
"hidden": true
},
"cell_type": "markdown",
"source": "### ...this dataset"
},
{
"metadata": {
"hidden": true
},
"cell_type": "markdown",
"source": "We will be looking at the Blue Book for Bulldozers Kaggle Competition: \"The goal of the contest is to predict the sale price of a particular piece of heavy equiment at auction based on it's usage, equipment type, and configuration. The data is sourced from auction result postings and includes information on usage and equipment configurations.\"\n\nThis is a very common type of dataset and prediciton problem, and similar to what you may see in your project or workplace."
},
{
"metadata": {
"heading_collapsed": true,
"hidden": true
},
"cell_type": "markdown",
"source": "### ...Kaggle Competitions"
},
{
"metadata": {
"hidden": true
},
"cell_type": "markdown",
"source": "Kaggle is an awesome resource for aspiring data scientists or anyone looking to improve their machine learning skills. There is nothing like being able to get hands-on practice and receiving real-time feedback to help you improve your skills.\n\nKaggle provides:\n\n1. Interesting data sets\n2. Feedback on how you're doing\n3. A leader board to see what's good, what's possible, and what's state-of-art.\n4. Blog posts by winning contestants share useful tips and techniques."
},
{
"metadata": {},
"cell_type": "markdown",
"source": "## The data"
},
{
"metadata": {
"heading_collapsed": true
},
"cell_type": "markdown",
"source": "### Look at the data"
},
{
"metadata": {
"hidden": true
},
"cell_type": "markdown",
"source": "Kaggle provides info about some of the fields of our dataset; on the [Kaggle Data info](https://www.kaggle.com/c/bluebook-for-bulldozers/data) page they say the following:\n\nFor this competition, you are predicting the sale price of bulldozers sold at auctions. The data for this competition is split into three parts:\n\n- **Train.csv** is the training set, which contains data through the end of 2011.\n- **Valid.csv** is the validation set, which contains data from January 1, 2012 - April 30, 2012. You make predictions on this set throughout the majority of the competition. Your score on this set is used to create the public leaderboard.\n- **Test.csv** is the test set, which won't be released until the last week of the competition. It contains data from May 1, 2012 - November 2012. Your score on the test set determines your final rank for the competition.\n\nThe key fields are in train.csv are:\n\n- SalesID: the unique identifier of the sale\n- MachineID: the unique identifier of a machine. A machine can be sold multiple times\n- saleprice: what the machine sold for at auction (only provided in train.csv)\n- saledate: the date of the sale"
},
{
"metadata": {
"hidden": true
},
"cell_type": "markdown",
"source": "*Question*\n\nWhat stands out to you from the above description? What needs to be true of our training and validation sets?"
},
{
"metadata": {
"hidden": true,
"trusted": true
},
"cell_type": "code",
"source": "df_raw = pd.read_csv(f'{PATH}Train.csv', low_memory=True, \n parse_dates=[\"saledate\"])",
"execution_count": 80,
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": "C:\\Users\\HP\\Anaconda3\\envs\\fastai\\lib\\site-packages\\IPython\\core\\interactiveshell.py:3049: DtypeWarning: Columns (13,39,40,41) have mixed types. Specify dtype option on import or set low_memory=False.\n interactivity=interactivity, compiler=compiler, result=result)\n"
}
]
},
{
"metadata": {
"hidden": true
},
"cell_type": "markdown",
"source": "In any sort of data science work, it's **important to look at your data**, to make sure you understand the format, how it's stored, what type of values it holds, etc. Even if you've read descriptions about your data, the actual data may not be what you expect."
},
{
"metadata": {
"hidden": true,
"trusted": true
},
"cell_type": "code",
"source": "def display_all(df):\n with pd.option_context(\"display.max_rows\", 1000, \"display.max_columns\", 1000): \n display(df)",
"execution_count": 81,
"outputs": []
},
{
"metadata": {
"hidden": true,
"trusted": true
},
"cell_type": "code",
"source": "display_all(df_raw.tail().T)",
"execution_count": 82,
"outputs": [
{
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>401120</th>\n <th>401121</th>\n <th>401122</th>\n <th>401123</th>\n <th>401124</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>SalesID</th>\n <td>6333336</td>\n <td>6333337</td>\n <td>6333338</td>\n <td>6333341</td>\n <td>6333342</td>\n </tr>\n <tr>\n <th>SalePrice</th>\n <td>10500</td>\n <td>11000</td>\n <td>11500</td>\n <td>9000</td>\n <td>7750</td>\n </tr>\n <tr>\n <th>MachineID</th>\n <td>1840702</td>\n <td>1830472</td>\n <td>1887659</td>\n <td>1903570</td>\n <td>1926965</td>\n </tr>\n <tr>\n <th>ModelID</th>\n <td>21439</td>\n <td>21439</td>\n <td>21439</td>\n <td>21435</td>\n <td>21435</td>\n </tr>\n <tr>\n <th>datasource</th>\n <td>149</td>\n <td>149</td>\n <td>149</td>\n <td>149</td>\n <td>149</td>\n </tr>\n <tr>\n <th>auctioneerID</th>\n <td>1</td>\n <td>1</td>\n <td>1</td>\n <td>2</td>\n <td>2</td>\n </tr>\n <tr>\n <th>YearMade</th>\n <td>2005</td>\n <td>2005</td>\n <td>2005</td>\n <td>2005</td>\n <td>2005</td>\n </tr>\n <tr>\n <th>MachineHoursCurrentMeter</th>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>UsageBand</th>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>saledate</th>\n <td>2011-11-02 00:00:00</td>\n <td>2011-11-02 00:00:00</td>\n <td>2011-11-02 00:00:00</td>\n <td>2011-10-25 00:00:00</td>\n <td>2011-10-25 00:00:00</td>\n </tr>\n <tr>\n <th>fiModelDesc</th>\n <td>35NX2</td>\n <td>35NX2</td>\n <td>35NX2</td>\n <td>30NX</td>\n <td>30NX</td>\n </tr>\n <tr>\n <th>fiBaseModel</th>\n <td>35</td>\n <td>35</td>\n <td>35</td>\n <td>30</td>\n <td>30</td>\n </tr>\n <tr>\n <th>fiSecondaryDesc</th>\n <td>NX</td>\n <td>NX</td>\n <td>NX</td>\n <td>NX</td>\n <td>NX</td>\n </tr>\n <tr>\n <th>fiModelSeries</th>\n <td>2</td>\n <td>2</td>\n <td>2</td>\n <td>NaN</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>fiModelDescriptor</th>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>ProductSize</th>\n <td>Mini</td>\n <td>Mini</td>\n <td>Mini</td>\n <td>Mini</td>\n <td>Mini</td>\n </tr>\n <tr>\n <th>fiProductClassDesc</th>\n <td>Hydraulic Excavator, Track - 3.0 to 4.0 Metric...</td>\n <td>Hydraulic Excavator, Track - 3.0 to 4.0 Metric...</td>\n <td>Hydraulic Excavator, Track - 3.0 to 4.0 Metric...</td>\n <td>Hydraulic Excavator, Track - 2.0 to 3.0 Metric...</td>\n <td>Hydraulic Excavator, Track - 2.0 to 3.0 Metric...</td>\n </tr>\n <tr>\n <th>state</th>\n <td>Maryland</td>\n <td>Maryland</td>\n <td>Maryland</td>\n <td>Florida</td>\n <td>Florida</td>\n </tr>\n <tr>\n <th>ProductGroup</th>\n <td>TEX</td>\n <td>TEX</td>\n <td>TEX</td>\n <td>TEX</td>\n <td>TEX</td>\n </tr>\n <tr>\n <th>ProductGroupDesc</th>\n <td>Track Excavators</td>\n <td>Track Excavators</td>\n <td>Track Excavators</td>\n <td>Track Excavators</td>\n <td>Track Excavators</td>\n </tr>\n <tr>\n <th>Drive_System</th>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>Enclosure</th>\n <td>EROPS</td>\n <td>EROPS</td>\n <td>EROPS</td>\n <td>EROPS</td>\n <td>EROPS</td>\n </tr>\n <tr>\n <th>Forks</th>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>Pad_Type</th>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>Ride_Control</th>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>Stick</th>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>Transmission</th>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>Turbocharged</th>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>Blade_Extension</th>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>Blade_Width</th>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>Enclosure_Type</th>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>Engine_Horsepower</th>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>Hydraulics</th>\n <td>Auxiliary</td>\n <td>Standard</td>\n <td>Auxiliary</td>\n <td>Standard</td>\n <td>Standard</td>\n </tr>\n <tr>\n <th>Pushblock</th>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>Ripper</th>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>Scarifier</th>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>Tip_Control</th>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>Tire_Size</th>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>Coupler</th>\n <td>None or Unspecified</td>\n <td>None or Unspecified</td>\n <td>None or Unspecified</td>\n <td>None or Unspecified</td>\n <td>None or Unspecified</td>\n </tr>\n <tr>\n <th>Coupler_System</th>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>Grouser_Tracks</th>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>Hydraulics_Flow</th>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>Track_Type</th>\n <td>Steel</td>\n <td>Steel</td>\n <td>Steel</td>\n <td>Steel</td>\n <td>Steel</td>\n </tr>\n <tr>\n <th>Undercarriage_Pad_Width</th>\n <td>None or Unspecified</td>\n <td>None or Unspecified</td>\n <td>None or Unspecified</td>\n <td>None or Unspecified</td>\n <td>None or Unspecified</td>\n </tr>\n <tr>\n <th>Stick_Length</th>\n <td>None or Unspecified</td>\n <td>None or Unspecified</td>\n <td>None or Unspecified</td>\n <td>None or Unspecified</td>\n <td>None or Unspecified</td>\n </tr>\n <tr>\n <th>Thumb</th>\n <td>None or Unspecified</td>\n <td>None or Unspecified</td>\n <td>None or Unspecified</td>\n <td>None or Unspecified</td>\n <td>None or Unspecified</td>\n </tr>\n <tr>\n <th>Pattern_Changer</th>\n <td>None or Unspecified</td>\n <td>None or Unspecified</td>\n <td>None or Unspecified</td>\n <td>None or Unspecified</td>\n <td>None or Unspecified</td>\n </tr>\n <tr>\n <th>Grouser_Type</th>\n <td>Double</td>\n <td>Double</td>\n <td>Double</td>\n <td>Double</td>\n <td>Double</td>\n </tr>\n <tr>\n <th>Backhoe_Mounting</th>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>Blade_Type</th>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>Travel_Controls</th>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>Differential_Type</th>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>Steering_Controls</th>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " 401120 \\\nSalesID 6333336 \nSalePrice 10500 \nMachineID 1840702 \nModelID 21439 \ndatasource 149 \nauctioneerID 1 \nYearMade 2005 \nMachineHoursCurrentMeter NaN \nUsageBand NaN \nsaledate 2011-11-02 00:00:00 \nfiModelDesc 35NX2 \nfiBaseModel 35 \nfiSecondaryDesc NX \nfiModelSeries 2 \nfiModelDescriptor NaN \nProductSize Mini \nfiProductClassDesc Hydraulic Excavator, Track - 3.0 to 4.0 Metric... \nstate Maryland \nProductGroup TEX \nProductGroupDesc Track Excavators \nDrive_System NaN \nEnclosure EROPS \nForks NaN \nPad_Type NaN \nRide_Control NaN \nStick NaN \nTransmission NaN \nTurbocharged NaN \nBlade_Extension NaN \nBlade_Width NaN \nEnclosure_Type NaN \nEngine_Horsepower NaN \nHydraulics Auxiliary \nPushblock NaN \nRipper NaN \nScarifier NaN \nTip_Control NaN \nTire_Size NaN \nCoupler None or Unspecified \nCoupler_System NaN \nGrouser_Tracks NaN \nHydraulics_Flow NaN \nTrack_Type Steel \nUndercarriage_Pad_Width None or Unspecified \nStick_Length None or Unspecified \nThumb None or Unspecified \nPattern_Changer None or Unspecified \nGrouser_Type Double \nBackhoe_Mounting NaN \nBlade_Type NaN \nTravel_Controls NaN \nDifferential_Type NaN \nSteering_Controls NaN \n\n 401121 \\\nSalesID 6333337 \nSalePrice 11000 \nMachineID 1830472 \nModelID 21439 \ndatasource 149 \nauctioneerID 1 \nYearMade 2005 \nMachineHoursCurrentMeter NaN \nUsageBand NaN \nsaledate 2011-11-02 00:00:00 \nfiModelDesc 35NX2 \nfiBaseModel 35 \nfiSecondaryDesc NX \nfiModelSeries 2 \nfiModelDescriptor NaN \nProductSize Mini \nfiProductClassDesc Hydraulic Excavator, Track - 3.0 to 4.0 Metric... \nstate Maryland \nProductGroup TEX \nProductGroupDesc Track Excavators \nDrive_System NaN \nEnclosure EROPS \nForks NaN \nPad_Type NaN \nRide_Control NaN \nStick NaN \nTransmission NaN \nTurbocharged NaN \nBlade_Extension NaN \nBlade_Width NaN \nEnclosure_Type NaN \nEngine_Horsepower NaN \nHydraulics Standard \nPushblock NaN \nRipper NaN \nScarifier NaN \nTip_Control NaN \nTire_Size NaN \nCoupler None or Unspecified \nCoupler_System NaN \nGrouser_Tracks NaN \nHydraulics_Flow NaN \nTrack_Type Steel \nUndercarriage_Pad_Width None or Unspecified \nStick_Length None or Unspecified \nThumb None or Unspecified \nPattern_Changer None or Unspecified \nGrouser_Type Double \nBackhoe_Mounting NaN \nBlade_Type NaN \nTravel_Controls NaN \nDifferential_Type NaN \nSteering_Controls NaN \n\n 401122 \\\nSalesID 6333338 \nSalePrice 11500 \nMachineID 1887659 \nModelID 21439 \ndatasource 149 \nauctioneerID 1 \nYearMade 2005 \nMachineHoursCurrentMeter NaN \nUsageBand NaN \nsaledate 2011-11-02 00:00:00 \nfiModelDesc 35NX2 \nfiBaseModel 35 \nfiSecondaryDesc NX \nfiModelSeries 2 \nfiModelDescriptor NaN \nProductSize Mini \nfiProductClassDesc Hydraulic Excavator, Track - 3.0 to 4.0 Metric... \nstate Maryland \nProductGroup TEX \nProductGroupDesc Track Excavators \nDrive_System NaN \nEnclosure EROPS \nForks NaN \nPad_Type NaN \nRide_Control NaN \nStick NaN \nTransmission NaN \nTurbocharged NaN \nBlade_Extension NaN \nBlade_Width NaN \nEnclosure_Type NaN \nEngine_Horsepower NaN \nHydraulics Auxiliary \nPushblock NaN \nRipper NaN \nScarifier NaN \nTip_Control NaN \nTire_Size NaN \nCoupler None or Unspecified \nCoupler_System NaN \nGrouser_Tracks NaN \nHydraulics_Flow NaN \nTrack_Type Steel \nUndercarriage_Pad_Width None or Unspecified \nStick_Length None or Unspecified \nThumb None or Unspecified \nPattern_Changer None or Unspecified \nGrouser_Type Double \nBackhoe_Mounting NaN \nBlade_Type NaN \nTravel_Controls NaN \nDifferential_Type NaN \nSteering_Controls NaN \n\n 401123 \\\nSalesID 6333341 \nSalePrice 9000 \nMachineID 1903570 \nModelID 21435 \ndatasource 149 \nauctioneerID 2 \nYearMade 2005 \nMachineHoursCurrentMeter NaN \nUsageBand NaN \nsaledate 2011-10-25 00:00:00 \nfiModelDesc 30NX \nfiBaseModel 30 \nfiSecondaryDesc NX \nfiModelSeries NaN \nfiModelDescriptor NaN \nProductSize Mini \nfiProductClassDesc Hydraulic Excavator, Track - 2.0 to 3.0 Metric... \nstate Florida \nProductGroup TEX \nProductGroupDesc Track Excavators \nDrive_System NaN \nEnclosure EROPS \nForks NaN \nPad_Type NaN \nRide_Control NaN \nStick NaN \nTransmission NaN \nTurbocharged NaN \nBlade_Extension NaN \nBlade_Width NaN \nEnclosure_Type NaN \nEngine_Horsepower NaN \nHydraulics Standard \nPushblock NaN \nRipper NaN \nScarifier NaN \nTip_Control NaN \nTire_Size NaN \nCoupler None or Unspecified \nCoupler_System NaN \nGrouser_Tracks NaN \nHydraulics_Flow NaN \nTrack_Type Steel \nUndercarriage_Pad_Width None or Unspecified \nStick_Length None or Unspecified \nThumb None or Unspecified \nPattern_Changer None or Unspecified \nGrouser_Type Double \nBackhoe_Mounting NaN \nBlade_Type NaN \nTravel_Controls NaN \nDifferential_Type NaN \nSteering_Controls NaN \n\n 401124 \nSalesID 6333342 \nSalePrice 7750 \nMachineID 1926965 \nModelID 21435 \ndatasource 149 \nauctioneerID 2 \nYearMade 2005 \nMachineHoursCurrentMeter NaN \nUsageBand NaN \nsaledate 2011-10-25 00:00:00 \nfiModelDesc 30NX \nfiBaseModel 30 \nfiSecondaryDesc NX \nfiModelSeries NaN \nfiModelDescriptor NaN \nProductSize Mini \nfiProductClassDesc Hydraulic Excavator, Track - 2.0 to 3.0 Metric... \nstate Florida \nProductGroup TEX \nProductGroupDesc Track Excavators \nDrive_System NaN \nEnclosure EROPS \nForks NaN \nPad_Type NaN \nRide_Control NaN \nStick NaN \nTransmission NaN \nTurbocharged NaN \nBlade_Extension NaN \nBlade_Width NaN \nEnclosure_Type NaN \nEngine_Horsepower NaN \nHydraulics Standard \nPushblock NaN \nRipper NaN \nScarifier NaN \nTip_Control NaN \nTire_Size NaN \nCoupler None or Unspecified \nCoupler_System NaN \nGrouser_Tracks NaN \nHydraulics_Flow NaN \nTrack_Type Steel \nUndercarriage_Pad_Width None or Unspecified \nStick_Length None or Unspecified \nThumb None or Unspecified \nPattern_Changer None or Unspecified \nGrouser_Type Double \nBackhoe_Mounting NaN \nBlade_Type NaN \nTravel_Controls NaN \nDifferential_Type NaN \nSteering_Controls NaN "
},
"metadata": {},
"output_type": "display_data"
}
]
},
{
"metadata": {
"hidden": true,
"trusted": true
},
"cell_type": "code",
"source": "display_all(df_raw.describe(include='all').T)",
"execution_count": 83,
"outputs": [
{
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>count</th>\n <th>unique</th>\n <th>top</th>\n <th>freq</th>\n <th>first</th>\n <th>last</th>\n <th>mean</th>\n <th>std</th>\n <th>min</th>\n <th>25%</th>\n <th>50%</th>\n <th>75%</th>\n <th>max</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>SalesID</th>\n <td>401125</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>1.91971e+06</td>\n <td>909021</td>\n <td>1.13925e+06</td>\n <td>1.41837e+06</td>\n <td>1.63942e+06</td>\n <td>2.24271e+06</td>\n <td>6.33334e+06</td>\n </tr>\n <tr>\n <th>SalePrice</th>\n <td>401125</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>31099.7</td>\n <td>23036.9</td>\n <td>4750</td>\n <td>14500</td>\n <td>24000</td>\n <td>40000</td>\n <td>142000</td>\n </tr>\n <tr>\n <th>MachineID</th>\n <td>401125</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>1.2179e+06</td>\n <td>440992</td>\n <td>0</td>\n <td>1.0887e+06</td>\n <td>1.27949e+06</td>\n <td>1.46807e+06</td>\n <td>2.48633e+06</td>\n </tr>\n <tr>\n <th>ModelID</th>\n <td>401125</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>6889.7</td>\n <td>6221.78</td>\n <td>28</td>\n <td>3259</td>\n <td>4604</td>\n <td>8724</td>\n <td>37198</td>\n </tr>\n <tr>\n <th>datasource</th>\n <td>401125</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>134.666</td>\n <td>8.96224</td>\n <td>121</td>\n <td>132</td>\n <td>132</td>\n <td>136</td>\n <td>172</td>\n </tr>\n <tr>\n <th>auctioneerID</th>\n <td>380989</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>6.55604</td>\n <td>16.9768</td>\n <td>0</td>\n <td>1</td>\n <td>2</td>\n <td>4</td>\n <td>99</td>\n </tr>\n <tr>\n <th>YearMade</th>\n <td>401125</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>1899.16</td>\n <td>291.797</td>\n <td>1000</td>\n <td>1985</td>\n <td>1995</td>\n <td>2000</td>\n <td>2013</td>\n </tr>\n <tr>\n <th>MachineHoursCurrentMeter</th>\n <td>142765</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>3457.96</td>\n <td>27590.3</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>3025</td>\n <td>2.4833e+06</td>\n </tr>\n <tr>\n <th>UsageBand</th>\n <td>69639</td>\n <td>3</td>\n <td>Medium</td>\n <td>33985</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>saledate</th>\n <td>401125</td>\n <td>3919</td>\n <td>2009-02-16 00:00:00</td>\n <td>1932</td>\n <td>1989-01-17 00:00:00</td>\n <td>2011-12-30 00:00:00</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>fiModelDesc</th>\n <td>401125</td>\n <td>4999</td>\n <td>310G</td>\n <td>5039</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>fiBaseModel</th>\n <td>401125</td>\n <td>1950</td>\n <td>580</td>\n <td>19798</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>fiSecondaryDesc</th>\n <td>263934</td>\n <td>175</td>\n <td>C</td>\n <td>43235</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>fiModelSeries</th>\n <td>56908</td>\n <td>128</td>\n <td>II</td>\n <td>13202</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>fiModelDescriptor</th>\n <td>71919</td>\n <td>139</td>\n <td>L</td>\n <td>15875</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>ProductSize</th>\n <td>190350</td>\n <td>6</td>\n <td>Medium</td>\n <td>62274</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>fiProductClassDesc</th>\n <td>401125</td>\n <td>74</td>\n <td>Backhoe Loader - 14.0 to 15.0 Ft Standard Digg...</td>\n <td>56166</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>state</th>\n <td>401125</td>\n <td>53</td>\n <td>Florida</td>\n <td>63944</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>ProductGroup</th>\n <td>401125</td>\n <td>6</td>\n <td>TEX</td>\n <td>101167</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>ProductGroupDesc</th>\n <td>401125</td>\n <td>6</td>\n <td>Track Excavators</td>\n <td>101167</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>Drive_System</th>\n <td>104361</td>\n <td>4</td>\n <td>Two Wheel Drive</td>\n <td>46139</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>Enclosure</th>\n <td>400800</td>\n <td>6</td>\n <td>OROPS</td>\n <td>173932</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>Forks</th>\n <td>192077</td>\n <td>2</td>\n <td>None or Unspecified</td>\n <td>178300</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>Pad_Type</th>\n <td>79134</td>\n <td>4</td>\n <td>None or Unspecified</td>\n <td>70614</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>Ride_Control</th>\n <td>148606</td>\n <td>3</td>\n <td>No</td>\n <td>77685</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>Stick</th>\n <td>79134</td>\n <td>2</td>\n <td>Standard</td>\n <td>48829</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>Transmission</th>\n <td>183230</td>\n <td>8</td>\n <td>Standard</td>\n <td>140328</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>Turbocharged</th>\n <td>79134</td>\n <td>2</td>\n <td>None or Unspecified</td>\n <td>75211</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>Blade_Extension</th>\n <td>25219</td>\n <td>2</td>\n <td>None or Unspecified</td>\n <td>24692</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>Blade_Width</th>\n <td>25219</td>\n <td>6</td>\n <td>14'</td>\n <td>9615</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>Enclosure_Type</th>\n <td>25219</td>\n <td>3</td>\n <td>None or Unspecified</td>\n <td>21923</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>Engine_Horsepower</th>\n <td>25219</td>\n <td>2</td>\n <td>No</td>\n <td>23937</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>Hydraulics</th>\n <td>320570</td>\n <td>12</td>\n <td>2 Valve</td>\n <td>141404</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>Pushblock</th>\n <td>25219</td>\n <td>2</td>\n <td>None or Unspecified</td>\n <td>19463</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>Ripper</th>\n <td>104137</td>\n <td>4</td>\n <td>None or Unspecified</td>\n <td>83452</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>Scarifier</th>\n <td>25230</td>\n <td>2</td>\n <td>None or Unspecified</td>\n <td>12719</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>Tip_Control</th>\n <td>25219</td>\n <td>3</td>\n <td>None or Unspecified</td>\n <td>16207</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>Tire_Size</th>\n <td>94718</td>\n <td>17</td>\n <td>None or Unspecified</td>\n <td>46339</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>Coupler</th>\n <td>213952</td>\n <td>3</td>\n <td>None or Unspecified</td>\n <td>184582</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>Coupler_System</th>\n <td>43458</td>\n <td>2</td>\n <td>None or Unspecified</td>\n <td>40430</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>Grouser_Tracks</th>\n <td>43362</td>\n <td>2</td>\n <td>None or Unspecified</td>\n <td>40515</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>Hydraulics_Flow</th>\n <td>43362</td>\n <td>3</td>\n <td>Standard</td>\n <td>42784</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>Track_Type</th>\n <td>99153</td>\n <td>2</td>\n <td>Steel</td>\n <td>84880</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>Undercarriage_Pad_Width</th>\n <td>99872</td>\n <td>19</td>\n <td>None or Unspecified</td>\n <td>79651</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>Stick_Length</th>\n <td>99218</td>\n <td>29</td>\n <td>None or Unspecified</td>\n <td>78820</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>Thumb</th>\n <td>99288</td>\n <td>3</td>\n <td>None or Unspecified</td>\n <td>83093</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>Pattern_Changer</th>\n <td>99218</td>\n <td>3</td>\n <td>None or Unspecified</td>\n <td>90255</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>Grouser_Type</th>\n <td>99153</td>\n <td>3</td>\n <td>Double</td>\n <td>84653</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>Backhoe_Mounting</th>\n <td>78672</td>\n <td>2</td>\n <td>None or Unspecified</td>\n <td>78652</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>Blade_Type</th>\n <td>79833</td>\n <td>10</td>\n <td>PAT</td>\n <td>38612</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>Travel_Controls</th>\n <td>79834</td>\n <td>7</td>\n <td>None or Unspecified</td>\n <td>69923</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>Differential_Type</th>\n <td>69411</td>\n <td>4</td>\n <td>Standard</td>\n <td>68073</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>Steering_Controls</th>\n <td>69369</td>\n <td>5</td>\n <td>Conventional</td>\n <td>68679</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " count unique \\\nSalesID 401125 NaN \nSalePrice 401125 NaN \nMachineID 401125 NaN \nModelID 401125 NaN \ndatasource 401125 NaN \nauctioneerID 380989 NaN \nYearMade 401125 NaN \nMachineHoursCurrentMeter 142765 NaN \nUsageBand 69639 3 \nsaledate 401125 3919 \nfiModelDesc 401125 4999 \nfiBaseModel 401125 1950 \nfiSecondaryDesc 263934 175 \nfiModelSeries 56908 128 \nfiModelDescriptor 71919 139 \nProductSize 190350 6 \nfiProductClassDesc 401125 74 \nstate 401125 53 \nProductGroup 401125 6 \nProductGroupDesc 401125 6 \nDrive_System 104361 4 \nEnclosure 400800 6 \nForks 192077 2 \nPad_Type 79134 4 \nRide_Control 148606 3 \nStick 79134 2 \nTransmission 183230 8 \nTurbocharged 79134 2 \nBlade_Extension 25219 2 \nBlade_Width 25219 6 \nEnclosure_Type 25219 3 \nEngine_Horsepower 25219 2 \nHydraulics 320570 12 \nPushblock 25219 2 \nRipper 104137 4 \nScarifier 25230 2 \nTip_Control 25219 3 \nTire_Size 94718 17 \nCoupler 213952 3 \nCoupler_System 43458 2 \nGrouser_Tracks 43362 2 \nHydraulics_Flow 43362 3 \nTrack_Type 99153 2 \nUndercarriage_Pad_Width 99872 19 \nStick_Length 99218 29 \nThumb 99288 3 \nPattern_Changer 99218 3 \nGrouser_Type 99153 3 \nBackhoe_Mounting 78672 2 \nBlade_Type 79833 10 \nTravel_Controls 79834 7 \nDifferential_Type 69411 4 \nSteering_Controls 69369 5 \n\n top \\\nSalesID NaN \nSalePrice NaN \nMachineID NaN \nModelID NaN \ndatasource NaN \nauctioneerID NaN \nYearMade NaN \nMachineHoursCurrentMeter NaN \nUsageBand Medium \nsaledate 2009-02-16 00:00:00 \nfiModelDesc 310G \nfiBaseModel 580 \nfiSecondaryDesc C \nfiModelSeries II \nfiModelDescriptor L \nProductSize Medium \nfiProductClassDesc Backhoe Loader - 14.0 to 15.0 Ft Standard Digg... \nstate Florida \nProductGroup TEX \nProductGroupDesc Track Excavators \nDrive_System Two Wheel Drive \nEnclosure OROPS \nForks None or Unspecified \nPad_Type None or Unspecified \nRide_Control No \nStick Standard \nTransmission Standard \nTurbocharged None or Unspecified \nBlade_Extension None or Unspecified \nBlade_Width 14' \nEnclosure_Type None or Unspecified \nEngine_Horsepower No \nHydraulics 2 Valve \nPushblock None or Unspecified \nRipper None or Unspecified \nScarifier None or Unspecified \nTip_Control None or Unspecified \nTire_Size None or Unspecified \nCoupler None or Unspecified \nCoupler_System None or Unspecified \nGrouser_Tracks None or Unspecified \nHydraulics_Flow Standard \nTrack_Type Steel \nUndercarriage_Pad_Width None or Unspecified \nStick_Length None or Unspecified \nThumb None or Unspecified \nPattern_Changer None or Unspecified \nGrouser_Type Double \nBackhoe_Mounting None or Unspecified \nBlade_Type PAT \nTravel_Controls None or Unspecified \nDifferential_Type Standard \nSteering_Controls Conventional \n\n freq first last \\\nSalesID NaN NaN NaN \nSalePrice NaN NaN NaN \nMachineID NaN NaN NaN \nModelID NaN NaN NaN \ndatasource NaN NaN NaN \nauctioneerID NaN NaN NaN \nYearMade NaN NaN NaN \nMachineHoursCurrentMeter NaN NaN NaN \nUsageBand 33985 NaN NaN \nsaledate 1932 1989-01-17 00:00:00 2011-12-30 00:00:00 \nfiModelDesc 5039 NaN NaN \nfiBaseModel 19798 NaN NaN \nfiSecondaryDesc 43235 NaN NaN \nfiModelSeries 13202 NaN NaN \nfiModelDescriptor 15875 NaN NaN \nProductSize 62274 NaN NaN \nfiProductClassDesc 56166 NaN NaN \nstate 63944 NaN NaN \nProductGroup 101167 NaN NaN \nProductGroupDesc 101167 NaN NaN \nDrive_System 46139 NaN NaN \nEnclosure 173932 NaN NaN \nForks 178300 NaN NaN \nPad_Type 70614 NaN NaN \nRide_Control 77685 NaN NaN \nStick 48829 NaN NaN \nTransmission 140328 NaN NaN \nTurbocharged 75211 NaN NaN \nBlade_Extension 24692 NaN NaN \nBlade_Width 9615 NaN NaN \nEnclosure_Type 21923 NaN NaN \nEngine_Horsepower 23937 NaN NaN \nHydraulics 141404 NaN NaN \nPushblock 19463 NaN NaN \nRipper 83452 NaN NaN \nScarifier 12719 NaN NaN \nTip_Control 16207 NaN NaN \nTire_Size 46339 NaN NaN \nCoupler 184582 NaN NaN \nCoupler_System 40430 NaN NaN \nGrouser_Tracks 40515 NaN NaN \nHydraulics_Flow 42784 NaN NaN \nTrack_Type 84880 NaN NaN \nUndercarriage_Pad_Width 79651 NaN NaN \nStick_Length 78820 NaN NaN \nThumb 83093 NaN NaN \nPattern_Changer 90255 NaN NaN \nGrouser_Type 84653 NaN NaN \nBackhoe_Mounting 78652 NaN NaN \nBlade_Type 38612 NaN NaN \nTravel_Controls 69923 NaN NaN \nDifferential_Type 68073 NaN NaN \nSteering_Controls 68679 NaN NaN \n\n mean std min 25% \\\nSalesID 1.91971e+06 909021 1.13925e+06 1.41837e+06 \nSalePrice 31099.7 23036.9 4750 14500 \nMachineID 1.2179e+06 440992 0 1.0887e+06 \nModelID 6889.7 6221.78 28 3259 \ndatasource 134.666 8.96224 121 132 \nauctioneerID 6.55604 16.9768 0 1 \nYearMade 1899.16 291.797 1000 1985 \nMachineHoursCurrentMeter 3457.96 27590.3 0 0 \nUsageBand NaN NaN NaN NaN \nsaledate NaN NaN NaN NaN \nfiModelDesc NaN NaN NaN NaN \nfiBaseModel NaN NaN NaN NaN \nfiSecondaryDesc NaN NaN NaN NaN \nfiModelSeries NaN NaN NaN NaN \nfiModelDescriptor NaN NaN NaN NaN \nProductSize NaN NaN NaN NaN \nfiProductClassDesc NaN NaN NaN NaN \nstate NaN NaN NaN NaN \nProductGroup NaN NaN NaN NaN \nProductGroupDesc NaN NaN NaN NaN \nDrive_System NaN NaN NaN NaN \nEnclosure NaN NaN NaN NaN \nForks NaN NaN NaN NaN \nPad_Type NaN NaN NaN NaN \nRide_Control NaN NaN NaN NaN \nStick NaN NaN NaN NaN \nTransmission NaN NaN NaN NaN \nTurbocharged NaN NaN NaN NaN \nBlade_Extension NaN NaN NaN NaN \nBlade_Width NaN NaN NaN NaN \nEnclosure_Type NaN NaN NaN NaN \nEngine_Horsepower NaN NaN NaN NaN \nHydraulics NaN NaN NaN NaN \nPushblock NaN NaN NaN NaN \nRipper NaN NaN NaN NaN \nScarifier NaN NaN NaN NaN \nTip_Control NaN NaN NaN NaN \nTire_Size NaN NaN NaN NaN \nCoupler NaN NaN NaN NaN \nCoupler_System NaN NaN NaN NaN \nGrouser_Tracks NaN NaN NaN NaN \nHydraulics_Flow NaN NaN NaN NaN \nTrack_Type NaN NaN NaN NaN \nUndercarriage_Pad_Width NaN NaN NaN NaN \nStick_Length NaN NaN NaN NaN \nThumb NaN NaN NaN NaN \nPattern_Changer NaN NaN NaN NaN \nGrouser_Type NaN NaN NaN NaN \nBackhoe_Mounting NaN NaN NaN NaN \nBlade_Type NaN NaN NaN NaN \nTravel_Controls NaN NaN NaN NaN \nDifferential_Type NaN NaN NaN NaN \nSteering_Controls NaN NaN NaN NaN \n\n 50% 75% max \nSalesID 1.63942e+06 2.24271e+06 6.33334e+06 \nSalePrice 24000 40000 142000 \nMachineID 1.27949e+06 1.46807e+06 2.48633e+06 \nModelID 4604 8724 37198 \ndatasource 132 136 172 \nauctioneerID 2 4 99 \nYearMade 1995 2000 2013 \nMachineHoursCurrentMeter 0 3025 2.4833e+06 \nUsageBand NaN NaN NaN \nsaledate NaN NaN NaN \nfiModelDesc NaN NaN NaN \nfiBaseModel NaN NaN NaN \nfiSecondaryDesc NaN NaN NaN \nfiModelSeries NaN NaN NaN \nfiModelDescriptor NaN NaN NaN \nProductSize NaN NaN NaN \nfiProductClassDesc NaN NaN NaN \nstate NaN NaN NaN \nProductGroup NaN NaN NaN \nProductGroupDesc NaN NaN NaN \nDrive_System NaN NaN NaN \nEnclosure NaN NaN NaN \nForks NaN NaN NaN \nPad_Type NaN NaN NaN \nRide_Control NaN NaN NaN \nStick NaN NaN NaN \nTransmission NaN NaN NaN \nTurbocharged NaN NaN NaN \nBlade_Extension NaN NaN NaN \nBlade_Width NaN NaN NaN \nEnclosure_Type NaN NaN NaN \nEngine_Horsepower NaN NaN NaN \nHydraulics NaN NaN NaN \nPushblock NaN NaN NaN \nRipper NaN NaN NaN \nScarifier NaN NaN NaN \nTip_Control NaN NaN NaN \nTire_Size NaN NaN NaN \nCoupler NaN NaN NaN \nCoupler_System NaN NaN NaN \nGrouser_Tracks NaN NaN NaN \nHydraulics_Flow NaN NaN NaN \nTrack_Type NaN NaN NaN \nUndercarriage_Pad_Width NaN NaN NaN \nStick_Length NaN NaN NaN \nThumb NaN NaN NaN \nPattern_Changer NaN NaN NaN \nGrouser_Type NaN NaN NaN \nBackhoe_Mounting NaN NaN NaN \nBlade_Type NaN NaN NaN \nTravel_Controls NaN NaN NaN \nDifferential_Type NaN NaN NaN \nSteering_Controls NaN NaN NaN "
},
"metadata": {},
"output_type": "display_data"
}
]
},
{
"metadata": {
"hidden": true
},
"cell_type": "markdown",
"source": "It's important to note what metric is being used for a project. Generally, selecting the metric(s) is an important part of the project setup. However, in this case Kaggle tells us what metric to use: RMSLE (root mean squared log error) between the actual and predicted auction prices. Therefore we take the log of the prices, so that RMSE will give us what we need."
},
{
"metadata": {
"hidden": true,
"trusted": true
},
"cell_type": "code",
"source": "df_raw.SalePrice = np.log(df_raw.SalePrice)",
"execution_count": 84,
"outputs": []
},
{
"metadata": {},
"cell_type": "markdown",
"source": "### Initial processing"
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "m = RandomForestRegressor(n_jobs=-1)\n# The following code is supposed to fail due to string values in the input data\nm.fit(df_raw.drop('SalePrice', axis=1), df_raw.SalePrice)",
"execution_count": 10,
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": "C:\\Users\\HP\\Anaconda3\\envs\\fastai\\lib\\site-packages\\sklearn\\ensemble\\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.\n \"10 in version 0.20 to 100 in 0.22.\", FutureWarning)\n"
},
{
"ename": "ValueError",
"evalue": "could not convert string to float: 'Low'",
"output_type": "error",
"traceback": [
"\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[1;31mValueError\u001b[0m Traceback (most recent call last)",
"\u001b[1;32m<ipython-input-10-6e70335c9573>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m\u001b[0m\n\u001b[0;32m 1\u001b[0m \u001b[0mm\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mRandomForestRegressor\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mn_jobs\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;33m-\u001b[0m\u001b[1;36m1\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 2\u001b[0m \u001b[1;31m# The following code is supposed to fail due to string values in the input data\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m----> 3\u001b[1;33m \u001b[0mm\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mfit\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mdf_raw\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mdrop\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m'SalePrice'\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0maxis\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;36m1\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mdf_raw\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mSalePrice\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m",
"\u001b[1;32m~\\Anaconda3\\envs\\fastai\\lib\\site-packages\\sklearn\\ensemble\\forest.py\u001b[0m in \u001b[0;36mfit\u001b[1;34m(self, X, y, sample_weight)\u001b[0m\n\u001b[0;32m 248\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 249\u001b[0m \u001b[1;31m# Validate or convert input data\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 250\u001b[1;33m \u001b[0mX\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mcheck_array\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mX\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0maccept_sparse\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;34m\"csc\"\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mDTYPE\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 251\u001b[0m \u001b[0my\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mcheck_array\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0my\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0maccept_sparse\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;34m'csc'\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mensure_2d\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;32mFalse\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;32mNone\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 252\u001b[0m \u001b[1;32mif\u001b[0m \u001b[0msample_weight\u001b[0m \u001b[1;32mis\u001b[0m \u001b[1;32mnot\u001b[0m \u001b[1;32mNone\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
"\u001b[1;32m~\\Anaconda3\\envs\\fastai\\lib\\site-packages\\sklearn\\utils\\validation.py\u001b[0m in \u001b[0;36mcheck_array\u001b[1;34m(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)\u001b[0m\n\u001b[0;32m 525\u001b[0m \u001b[1;32mtry\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 526\u001b[0m \u001b[0mwarnings\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0msimplefilter\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m'error'\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mComplexWarning\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 527\u001b[1;33m \u001b[0marray\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mnp\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0masarray\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0marray\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mdtype\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0morder\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0morder\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 528\u001b[0m \u001b[1;32mexcept\u001b[0m \u001b[0mComplexWarning\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 529\u001b[0m raise ValueError(\"Complex data not supported\\n\"\n",
"\u001b[1;32m~\\Anaconda3\\envs\\fastai\\lib\\site-packages\\numpy\\core\\numeric.py\u001b[0m in \u001b[0;36masarray\u001b[1;34m(a, dtype, order)\u001b[0m\n\u001b[0;32m 536\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 537\u001b[0m \"\"\"\n\u001b[1;32m--> 538\u001b[1;33m \u001b[1;32mreturn\u001b[0m \u001b[0marray\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0ma\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mcopy\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;32mFalse\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0morder\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0morder\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 539\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 540\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n",
"\u001b[1;31mValueError\u001b[0m: could not convert string to float: 'Low'"
]
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "This dataset contains a mix of **continuous** and **categorical** variables.\n\nThe following method extracts particular date fields from a complete datetime for the purpose of constructing categoricals. You should always consider this feature extraction step when working with date-time. Without expanding your date-time into these additional fields, you can't capture any trend/cyclical behavior as a function of time at any of these granularities."
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "add_datepart(df_raw, 'saledate')\ndf_raw.saleYear.head()",
"execution_count": 85,
"outputs": [
{
"data": {
"text/plain": "0 2006\n1 2004\n2 2004\n3 2011\n4 2009\nName: saleYear, dtype: int64"
},
"execution_count": 85,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "The categorical variables are currently stored as strings, which is inefficient, and doesn't provide the numeric coding required for a random forest. Therefore we call `train_cats` to convert strings to pandas categories."
},
{
"metadata": {
"code_folding": [],
"collapsed": true,
"trusted": true
},
"cell_type": "code",
"source": "cat_var=[]\nfor _ in df_raw.columns:\n if df_raw[_].dtype=='object':\n cat_var.append(_)",
"execution_count": 86,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": "False\nFalse\nFalse\nFalse\nFalse\nFalse\nFalse\nFalse\nTrue\nTrue\nTrue\nTrue\nTrue\nTrue\nTrue\nTrue\nTrue\nTrue\nTrue\nTrue\nTrue\nTrue\nTrue\nTrue\nTrue\nTrue\nTrue\nTrue\nTrue\nTrue\nTrue\nTrue\nTrue\nTrue\nTrue\nTrue\nTrue\nTrue\nTrue\nTrue\nTrue\nTrue\nTrue\nTrue\nTrue\nTrue\nTrue\nTrue\nTrue\nTrue\nTrue\nTrue\nFalse\nFalse\nFalse\nFalse\nFalse\nFalse\nFalse\nFalse\nFalse\nFalse\nFalse\nFalse\nFalse\n"
}
]
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "train_cats(df_raw)",
"execution_count": 99,
"outputs": []
},
{
"metadata": {},
"cell_type": "markdown",
"source": "We can specify the order to use for categorical variables if we wish:"
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "df_raw.UsageBand.cat.categories",
"execution_count": 101,
"outputs": [
{
"data": {
"text/plain": "Index(['High', 'Low', 'Medium'], dtype='object')"
},
"execution_count": 101,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "df_raw.UsageBand.cat.set_categories(['High', 'Medium', 'Low'], ordered=True, inplace=True)",
"execution_count": 102,
"outputs": []
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Normally, pandas will continue displaying the text categories, while treating them as numerical data internally. Optionally, we can replace the text categories with numbers, which will make this variable non-categorical, like so:."
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "df_raw.UsageBand = df_raw.UsageBand.cat.codes",
"execution_count": 103,
"outputs": []
},
{
"metadata": {},
"cell_type": "markdown",
"source": "We're still not quite done - for instance we have lots of missing values, which we can't pass directly to a random forest."
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "for _ in cat_var:\n df_raw[_] = df_raw[_].cat.codes",
"execution_count": 104,
"outputs": []
},
{
"metadata": {
"collapsed": true,
"trusted": true
},
"cell_type": "code",
"source": "display_all(df_raw.isnull().sum().sort_index()/len(df_raw))",
"execution_count": 105,
"outputs": [
{
"data": {
"text/plain": "Backhoe_Mounting 0.000000\nBlade_Extension 0.000000\nBlade_Type 0.000000\nBlade_Width 0.000000\nCoupler 0.000000\nCoupler_System 0.000000\nDifferential_Type 0.000000\nDrive_System 0.000000\nEnclosure 0.000000\nEnclosure_Type 0.000000\nEngine_Horsepower 0.000000\nForks 0.000000\nGrouser_Tracks 0.000000\nGrouser_Type 0.000000\nHydraulics 0.000000\nHydraulics_Flow 0.000000\nMachineHoursCurrentMeter 0.644089\nMachineID 0.000000\nModelID 0.000000\nPad_Type 0.000000\nPattern_Changer 0.000000\nProductGroup 0.000000\nProductGroupDesc 0.000000\nProductSize 0.000000\nPushblock 0.000000\nRide_Control 0.000000\nRipper 0.000000\nSalePrice 0.000000\nSalesID 0.000000\nScarifier 0.000000\nSteering_Controls 0.000000\nStick 0.000000\nStick_Length 0.000000\nThumb 0.000000\nTip_Control 0.000000\nTire_Size 0.000000\nTrack_Type 0.000000\nTransmission 0.000000\nTravel_Controls 0.000000\nTurbocharged 0.000000\nUndercarriage_Pad_Width 0.000000\nUsageBand 0.000000\nYearMade 0.000000\nauctioneerID 0.050199\ndatasource 0.000000\nfiBaseModel 0.000000\nfiModelDesc 0.000000\nfiModelDescriptor 0.000000\nfiModelSeries 0.000000\nfiProductClassDesc 0.000000\nfiSecondaryDesc 0.000000\nsaleDay 0.000000\nsaleDayofweek 0.000000\nsaleDayofyear 0.000000\nsaleElapsed 0.000000\nsaleIs_month_end 0.000000\nsaleIs_month_start 0.000000\nsaleIs_quarter_end 0.000000\nsaleIs_quarter_start 0.000000\nsaleIs_year_end 0.000000\nsaleIs_year_start 0.000000\nsaleMonth 0.000000\nsaleWeek 0.000000\nsaleYear 0.000000\nstate 0.000000\ndtype: float64"
},
"metadata": {},
"output_type": "display_data"
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "But let's save this file for now, since it's already in format can we be stored and accessed efficiently."
},
{
"metadata": {
"scrolled": true,
"trusted": true
},
"cell_type": "code",
"source": "os.makedirs('tmp', exist_ok=True)\ndf_raw.to_feather('tmp/bulldozers-raw')",
"execution_count": 107,
"outputs": []
},
{
"metadata": {
"heading_collapsed": true
},
"cell_type": "markdown",
"source": "### Pre-processing"
},
{
"metadata": {
"hidden": true
},
"cell_type": "markdown",
"source": "In the future we can simply read it from this fast format."
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2019-03-29T19:25:21.393793Z",
"start_time": "2019-03-29T19:25:21.389146Z"
},
"hidden": true,
"trusted": true
},
"cell_type": "code",
"source": "df_raw = pd.read_feather('tmp/bulldozers-raw')",
"execution_count": 108,
"outputs": []
},
{
"metadata": {
"hidden": true
},
"cell_type": "markdown",
"source": "We'll replace categories with their numeric codes, handle missing continuous values, and split the dependent variable into a separate variable."
},
{
"metadata": {
"hidden": true,
"trusted": true
},
"cell_type": "code",
"source": "df, y, nas = proc_df(df_raw, 'SalePrice')",
"execution_count": 109,
"outputs": []
},
{
"metadata": {
"hidden": true
},
"cell_type": "markdown",
"source": "We now have something we can pass to a random forest!"
},
{
"metadata": {
"hidden": true,
"trusted": true
},
"cell_type": "code",
"source": "m = RandomForestRegressor(n_jobs=-1)\nm.fit(df, y)\nm.score(df,y)",
"execution_count": 110,
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": "C:\\Users\\HP\\Anaconda3\\envs\\fastai\\lib\\site-packages\\sklearn\\ensemble\\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.\n \"10 in version 0.20 to 100 in 0.22.\", FutureWarning)\n"
},
{
"data": {
"text/plain": "0.9830487929674454"
},
"execution_count": 110,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {
"hidden": true
},
"cell_type": "markdown",
"source": "In statistics, the coefficient of determination, denoted R2 or r2 and pronounced \"R squared\", is the proportion of the variance in the dependent variable that is predictable from the independent variable(s). https://en.wikipedia.org/wiki/Coefficient_of_determination"
},
{
"metadata": {
"hidden": true
},
"cell_type": "markdown",
"source": "Wow, an r^2 of 0.98 - that's great, right? Well, perhaps not...\n\nPossibly **the most important idea** in machine learning is that of having separate training & validation data sets. As motivation, suppose you don't divide up your data, but instead use all of it. And suppose you have lots of parameters:\n\n<img src=\"images/overfitting2.png\" alt=\"\" style=\"width: 70%\"/>\n<center>\n[Underfitting and Overfitting](https://datascience.stackexchange.com/questions/361/when-is-a-model-underfitted)\n</center>\n\nThe error for the pictured data points is lowest for the model on the far right (the blue curve passes through the red points almost perfectly), yet it's not the best choice. Why is that? If you were to gather some new data points, they most likely would not be on that curve in the graph on the right, but would be closer to the curve in the middle graph.\n\nThis illustrates how using all our data can lead to **overfitting**. A validation set helps diagnose this problem."
},
{
"metadata": {
"hidden": true,
"trusted": true
},
"cell_type": "code",
"source": "def split_vals(a,n): return a[:n].copy(), a[n:].copy()\n\nn_valid = 12000 # same as Kaggle's test set size\nn_trn = len(df)-n_valid\nraw_train, raw_valid = split_vals(df_raw, n_trn)\nX_train, X_valid = split_vals(df, n_trn)\ny_train, y_valid = split_vals(y, n_trn)\n\nX_train.shape, y_train.shape, X_valid.shape",
"execution_count": 111,
"outputs": [
{
"data": {
"text/plain": "((389125, 66), (389125,), (12000, 66))"
},
"execution_count": 111,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {
"heading_collapsed": true
},
"cell_type": "markdown",
"source": "# Random Forests"
},
{
"metadata": {
"heading_collapsed": true,
"hidden": true
},
"cell_type": "markdown",
"source": "## Base model"
},
{
"metadata": {
"hidden": true
},
"cell_type": "markdown",
"source": "Let's try our model again, this time with separate training and validation sets."
},
{
"metadata": {
"hidden": true,
"trusted": true
},
"cell_type": "code",
"source": "def rmse(x,y): return math.sqrt(((x-y)**2).mean())\n\ndef print_score(m):\n res = [rmse(m.predict(X_train), y_train), rmse(m.predict(X_valid), y_valid),\n m.score(X_train, y_train), m.score(X_valid, y_valid)]\n if hasattr(m, 'oob_score_'): res.append(m.oob_score_)\n print(res)",
"execution_count": 112,
"outputs": []
},
{
"metadata": {
"hidden": true,
"trusted": true
},
"cell_type": "code",
"source": "m = RandomForestRegressor(n_jobs=-1)\n%time m.fit(X_train, y_train)\nprint_score(m)",
"execution_count": 113,
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": "C:\\Users\\HP\\Anaconda3\\envs\\fastai\\lib\\site-packages\\sklearn\\ensemble\\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.\n \"10 in version 0.20 to 100 in 0.22.\", FutureWarning)\n"
},
{
"name": "stdout",
"output_type": "stream",
"text": "Wall time: 36.2 s\n[0.09035034440431733, 0.24998246754439177, 0.9829393941017184, 0.8883992597730627]\n"
}
]
},
{
"metadata": {
"hidden": true
},
"cell_type": "markdown",
"source": "An r^2 in the high-80's isn't bad at all (and the RMSLE puts us around rank 100 of 470 on the Kaggle leaderboard), but we can see from the validation set score that we're over-fitting badly. To understand this issue, let's simplify things down to a single small tree."
},
{
"metadata": {
"heading_collapsed": true,
"hidden": true
},
"cell_type": "markdown",
"source": "## Speeding things up"
},
{
"metadata": {
"hidden": true,
"trusted": true
},
"cell_type": "code",
"source": "df_trn, y_trn, nas = proc_df(df_raw, 'SalePrice', subset=30000, na_dict=nas)\nX_train, _ = split_vals(df_trn, 20000)\ny_train, _ = split_vals(y_trn, 20000)",
"execution_count": 114,
"outputs": []
},
{
"metadata": {
"hidden": true,
"trusted": true
},
"cell_type": "code",
"source": "m = RandomForestRegressor(n_jobs=-1)\n%time m.fit(X_train, y_train)\nprint_score(m)",
"execution_count": 115,
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": "C:\\Users\\HP\\Anaconda3\\envs\\fastai\\lib\\site-packages\\sklearn\\ensemble\\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.\n \"10 in version 0.20 to 100 in 0.22.\", FutureWarning)\n"
},
{
"name": "stdout",
"output_type": "stream",
"text": "Wall time: 1.72 s\n[0.11113934060836367, 0.38117014777495856, 0.9725644877454035, 0.7405308464713501]\n"
}
]
},
{
"metadata": {
"heading_collapsed": true,
"hidden": true
},
"cell_type": "markdown",
"source": "## Single tree"
},
{
"metadata": {
"hidden": true,
"trusted": true
},
"cell_type": "code",
"source": "m = RandomForestRegressor(n_estimators=1, max_depth=3, bootstrap=False, n_jobs=-1)\nm.fit(X_train, y_train)\nprint_score(m)",
"execution_count": 116,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": "[0.5200246121818117, 0.5804134570275868, 0.39934480244110204, 0.3983790928144195]\n"
}
]
},
{
"metadata": {
"hidden": true,
"scrolled": true,
"trusted": true
},
"cell_type": "code",
"source": "draw_tree(m.estimators_[0], df_trn, precision=3)",
"execution_count": 119,
"outputs": [
{
"data": {
"image/svg+xml": "<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\"?>\r\n<!DOCTYPE svg PUBLIC \"-//W3C//DTD SVG 1.1//EN\"\r\n \"http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd\">\r\n<!-- Generated by graphviz version 2.38.0 (20140413.2041)\r\n -->\r\n<!-- Title: Tree Pages: 1 -->\r\n<svg width=\"720pt\" height=\"434pt\"\r\n viewBox=\"0.00 0.00 720.00 434.49\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\">\r\n<g id=\"graph0\" class=\"graph\" transform=\"scale(0.778659 0.778659) rotate(0) translate(4 554)\">\r\n<title>Tree</title>\r\n<polygon fill=\"white\" stroke=\"none\" points=\"-4,4 -4,-554 920.667,-554 920.667,4 -4,4\"/>\r\n<!-- 0 -->\r\n<g id=\"node1\" class=\"node\"><title>0</title>\r\n<polygon fill=\"#e58139\" fill-opacity=\"0.729412\" stroke=\"black\" points=\"177.167,-308.5 30.1667,-308.5 30.1667,-240.5 177.167,-240.5 177.167,-308.5\"/>\r\n<text text-anchor=\"start\" x=\"38.1667\" y=\"-293.3\" font-family=\"Times New Roman,serif\" font-size=\"14.00\">Coupler_System ≤ &#45;0.5</text>\r\n<text text-anchor=\"start\" x=\"72.1667\" y=\"-278.3\" font-family=\"Times New Roman,serif\" font-size=\"14.00\">mse = 0.45</text>\r\n<text text-anchor=\"start\" x=\"56.1667\" y=\"-263.3\" font-family=\"Times New Roman,serif\" font-size=\"14.00\">samples = 20000</text>\r\n<text text-anchor=\"start\" x=\"62.6667\" y=\"-248.3\" font-family=\"Times New Roman,serif\" font-size=\"14.00\">value = 10.112</text>\r\n</g>\r\n<!-- 1 -->\r\n<g id=\"node2\" class=\"node\"><title>1</title>\r\n<polygon fill=\"#e58139\" fill-opacity=\"0.784314\" stroke=\"black\" points=\"409.167,-361.5 282.167,-361.5 282.167,-293.5 409.167,-293.5 409.167,-361.5\"/>\r\n<text text-anchor=\"start\" x=\"290.167\" y=\"-346.3\" font-family=\"Times New Roman,serif\" font-size=\"14.00\">YearMade ≤ 1987.5</text>\r\n<text text-anchor=\"start\" x=\"310.667\" y=\"-331.3\" font-family=\"Times New Roman,serif\" font-size=\"14.00\">mse = 0.395</text>\r\n<text text-anchor=\"start\" x=\"298.167\" y=\"-316.3\" font-family=\"Times New Roman,serif\" font-size=\"14.00\">samples = 18334</text>\r\n<text text-anchor=\"start\" x=\"304.667\" y=\"-301.3\" font-family=\"Times New Roman,serif\" font-size=\"14.00\">value = 10.196</text>\r\n</g>\r\n<!-- 0&#45;&gt;1 -->\r\n<g id=\"edge1\" class=\"edge\"><title>0&#45;&gt;1</title>\r\n<path fill=\"none\" stroke=\"black\" d=\"M177.326,-290.546C207.323,-297.17 242.019,-304.832 272.015,-311.456\"/>\r\n<polygon fill=\"black\" stroke=\"black\" points=\"271.474,-314.921 281.994,-313.66 272.984,-308.086 271.474,-314.921\"/>\r\n<text text-anchor=\"middle\" x=\"260.92\" y=\"-323.41\" font-family=\"Times New Roman,serif\" font-size=\"14.00\">True</text>\r\n</g>\r\n<!-- 8 -->\r\n<g id=\"node9\" class=\"node\"><title>8</title>\r\n<polygon fill=\"#e58139\" fill-opacity=\"0.129412\" stroke=\"black\" points=\"409.167,-255.5 282.167,-255.5 282.167,-187.5 409.167,-187.5 409.167,-255.5\"/>\r\n<text text-anchor=\"start\" x=\"290.167\" y=\"-240.3\" font-family=\"Times New Roman,serif\" font-size=\"14.00\">YearMade ≤ 1998.5</text>\r\n<text text-anchor=\"start\" x=\"310.667\" y=\"-225.3\" font-family=\"Times New Roman,serif\" font-size=\"14.00\">mse = 0.116</text>\r\n<text text-anchor=\"start\" x=\"301.667\" y=\"-210.3\" font-family=\"Times New Roman,serif\" font-size=\"14.00\">samples = 1666</text>\r\n<text text-anchor=\"start\" x=\"307.667\" y=\"-195.3\" font-family=\"Times New Roman,serif\" font-size=\"14.00\">value = 9.185</text>\r\n</g>\r\n<!-- 0&#45;&gt;8 -->\r\n<g id=\"edge8\" class=\"edge\"><title>0&#45;&gt;8</title>\r\n<path fill=\"none\" stroke=\"black\" d=\"M177.326,-258.454C207.323,-251.83 242.019,-244.168 272.015,-237.544\"/>\r\n<polygon fill=\"black\" stroke=\"black\" points=\"272.984,-240.914 281.994,-235.34 271.474,-234.079 272.984,-240.914\"/>\r\n<text text-anchor=\"middle\" x=\"260.92\" y=\"-218.19\" font-family=\"Times New Roman,serif\" font-size=\"14.00\">False</text>\r\n</g>\r\n<!-- 2 -->\r\n<g id=\"node3\" class=\"node\"><title>2</title>\r\n<polygon fill=\"#e58139\" fill-opacity=\"0.600000\" stroke=\"black\" points=\"662.167,-486.5 543.167,-486.5 543.167,-418.5 662.167,-418.5 662.167,-486.5\"/>\r\n<text text-anchor=\"start\" x=\"551.167\" y=\"-471.3\" font-family=\"Times New Roman,serif\" font-size=\"14.00\">ModelID ≤ 4510.5</text>\r\n<text text-anchor=\"start\" x=\"567.667\" y=\"-456.3\" font-family=\"Times New Roman,serif\" font-size=\"14.00\">mse = 0.342</text>\r\n<text text-anchor=\"start\" x=\"558.667\" y=\"-441.3\" font-family=\"Times New Roman,serif\" font-size=\"14.00\">samples = 6955</text>\r\n<text text-anchor=\"start\" x=\"564.667\" y=\"-426.3\" font-family=\"Times New Roman,serif\" font-size=\"14.00\">value = 9.914</text>\r\n</g>\r\n<!-- 1&#45;&gt;2 -->\r\n<g id=\"edge2\" class=\"edge\"><title>1&#45;&gt;2</title>\r\n<path fill=\"none\" stroke=\"black\" d=\"M409.373,-358.239C447.19,-376.776 495.364,-400.391 534.033,-419.346\"/>\r\n<polygon fill=\"black\" stroke=\"black\" points=\"532.62,-422.552 543.14,-423.811 535.701,-416.266 532.62,-422.552\"/>\r\n</g>\r\n<!-- 5 -->\r\n<g id=\"node6\" class=\"node\"><title>5</title>\r\n<polygon fill=\"#e58139\" fill-opacity=\"0.894118\" stroke=\"black\" points=\"683.167,-361.5 522.167,-361.5 522.167,-293.5 683.167,-293.5 683.167,-361.5\"/>\r\n<text text-anchor=\"start\" x=\"530.167\" y=\"-346.3\" font-family=\"Times New Roman,serif\" font-size=\"14.00\">fiProductClassDesc ≤ 6.5</text>\r\n<text text-anchor=\"start\" x=\"571.167\" y=\"-331.3\" font-family=\"Times New Roman,serif\" font-size=\"14.00\">mse = 0.35</text>\r\n<text text-anchor=\"start\" x=\"555.167\" y=\"-316.3\" font-family=\"Times New Roman,serif\" font-size=\"14.00\">samples = 11379</text>\r\n<text text-anchor=\"start\" x=\"561.667\" y=\"-301.3\" font-family=\"Times New Roman,serif\" font-size=\"14.00\">value = 10.369</text>\r\n</g>\r\n<!-- 1&#45;&gt;5 -->\r\n<g id=\"edge5\" class=\"edge\"><title>1&#45;&gt;5</title>\r\n<path fill=\"none\" stroke=\"black\" d=\"M409.373,-327.5C440.111,-327.5 477.691,-327.5 511.449,-327.5\"/>\r\n<polygon fill=\"black\" stroke=\"black\" points=\"511.791,-331 521.791,-327.5 511.791,-324 511.791,-331\"/>\r\n</g>\r\n<!-- 3 -->\r\n<g id=\"node4\" class=\"node\"><title>3</title>\r\n<polygon fill=\"#e58139\" fill-opacity=\"0.749020\" stroke=\"black\" points=\"895.667,-550 791.667,-550 791.667,-497 895.667,-497 895.667,-550\"/>\r\n<text text-anchor=\"start\" x=\"812.167\" y=\"-534.8\" font-family=\"Times New Roman,serif\" font-size=\"14.00\">mse = 0.37</text>\r\n<text text-anchor=\"start\" x=\"799.667\" y=\"-519.8\" font-family=\"Times New Roman,serif\" font-size=\"14.00\">samples = 3376</text>\r\n<text text-anchor=\"start\" x=\"802.667\" y=\"-504.8\" font-family=\"Times New Roman,serif\" font-size=\"14.00\">value = 10.142</text>\r\n</g>\r\n<!-- 2&#45;&gt;3 -->\r\n<g id=\"edge3\" class=\"edge\"><title>2&#45;&gt;3</title>\r\n<path fill=\"none\" stroke=\"black\" d=\"M662.439,-469.959C698.571,-480.693 744.791,-494.424 781.379,-505.293\"/>\r\n<polygon fill=\"black\" stroke=\"black\" points=\"780.843,-508.785 791.426,-508.278 782.837,-502.075 780.843,-508.785\"/>\r\n</g>\r\n<!-- 4 -->\r\n<g id=\"node5\" class=\"node\"><title>4</title>\r\n<polygon fill=\"#e58139\" fill-opacity=\"0.462745\" stroke=\"black\" points=\"895.667,-479 791.667,-479 791.667,-426 895.667,-426 895.667,-479\"/>\r\n<text text-anchor=\"start\" x=\"808.667\" y=\"-463.8\" font-family=\"Times New Roman,serif\" font-size=\"14.00\">mse = 0.221</text>\r\n<text text-anchor=\"start\" x=\"799.667\" y=\"-448.8\" font-family=\"Times New Roman,serif\" font-size=\"14.00\">samples = 3579</text>\r\n<text text-anchor=\"start\" x=\"805.667\" y=\"-433.8\" font-family=\"Times New Roman,serif\" font-size=\"14.00\">value = 9.699</text>\r\n</g>\r\n<!-- 2&#45;&gt;4 -->\r\n<g id=\"edge4\" class=\"edge\"><title>2&#45;&gt;4</title>\r\n<path fill=\"none\" stroke=\"black\" d=\"M662.439,-452.5C698.571,-452.5 744.791,-452.5 781.379,-452.5\"/>\r\n<polygon fill=\"black\" stroke=\"black\" points=\"781.426,-456 791.426,-452.5 781.426,-449 781.426,-456\"/>\r\n</g>\r\n<!-- 6 -->\r\n<g id=\"node7\" class=\"node\"><title>6</title>\r\n<polygon fill=\"#e58139\" fill-opacity=\"0.639216\" stroke=\"black\" points=\"895.667,-408 791.667,-408 791.667,-355 895.667,-355 895.667,-408\"/>\r\n<text text-anchor=\"start\" x=\"808.667\" y=\"-392.8\" font-family=\"Times New Roman,serif\" font-size=\"14.00\">mse = 0.115</text>\r\n<text text-anchor=\"start\" x=\"799.667\" y=\"-377.8\" font-family=\"Times New Roman,serif\" font-size=\"14.00\">samples = 3370</text>\r\n<text text-anchor=\"start\" x=\"805.667\" y=\"-362.8\" font-family=\"Times New Roman,serif\" font-size=\"14.00\">value = 9.976</text>\r\n</g>\r\n<!-- 5&#45;&gt;6 -->\r\n<g id=\"edge6\" class=\"edge\"><title>5&#45;&gt;6</title>\r\n<path fill=\"none\" stroke=\"black\" d=\"M683.25,-345.481C715.336,-352.731 751.76,-360.96 781.684,-367.721\"/>\r\n<polygon fill=\"black\" stroke=\"black\" points=\"781.041,-371.164 791.567,-369.954 782.584,-364.337 781.041,-371.164\"/>\r\n</g>\r\n<!-- 7 -->\r\n<g id=\"node8\" class=\"node\"><title>7</title>\r\n<polygon fill=\"#e58139\" stroke=\"black\" points=\"895.667,-337 791.667,-337 791.667,-284 895.667,-284 895.667,-337\"/>\r\n<text text-anchor=\"start\" x=\"808.667\" y=\"-321.8\" font-family=\"Times New Roman,serif\" font-size=\"14.00\">mse = 0.356</text>\r\n<text text-anchor=\"start\" x=\"799.667\" y=\"-306.8\" font-family=\"Times New Roman,serif\" font-size=\"14.00\">samples = 8009</text>\r\n<text text-anchor=\"start\" x=\"802.667\" y=\"-291.8\" font-family=\"Times New Roman,serif\" font-size=\"14.00\">value = 10.534</text>\r\n</g>\r\n<!-- 5&#45;&gt;7 -->\r\n<g id=\"edge7\" class=\"edge\"><title>5&#45;&gt;7</title>\r\n<path fill=\"none\" stroke=\"black\" d=\"M683.25,-321.839C715.197,-319.567 751.443,-316.989 781.293,-314.865\"/>\r\n<polygon fill=\"black\" stroke=\"black\" points=\"781.841,-318.335 791.567,-314.135 781.344,-311.353 781.841,-318.335\"/>\r\n</g>\r\n<!-- 9 -->\r\n<g id=\"node10\" class=\"node\"><title>9</title>\r\n<polygon fill=\"#e58139\" fill-opacity=\"0.023529\" stroke=\"black\" points=\"678.667,-255.5 526.667,-255.5 526.667,-187.5 678.667,-187.5 678.667,-255.5\"/>\r\n<text text-anchor=\"start\" x=\"534.667\" y=\"-240.3\" font-family=\"Times New Roman,serif\" font-size=\"14.00\">fiSecondaryDesc ≤ 24.0</text>\r\n<text text-anchor=\"start\" x=\"567.667\" y=\"-225.3\" font-family=\"Times New Roman,serif\" font-size=\"14.00\">mse = 0.088</text>\r\n<text text-anchor=\"start\" x=\"561.667\" y=\"-210.3\" font-family=\"Times New Roman,serif\" font-size=\"14.00\">samples = 891</text>\r\n<text text-anchor=\"start\" x=\"564.667\" y=\"-195.3\" font-family=\"Times New Roman,serif\" font-size=\"14.00\">value = 9.021</text>\r\n</g>\r\n<!-- 8&#45;&gt;9 -->\r\n<g id=\"edge9\" class=\"edge\"><title>8&#45;&gt;9</title>\r\n<path fill=\"none\" stroke=\"black\" d=\"M409.373,-221.5C441.586,-221.5 481.313,-221.5 516.28,-221.5\"/>\r\n<polygon fill=\"black\" stroke=\"black\" points=\"516.484,-225 526.484,-221.5 516.484,-218 516.484,-225\"/>\r\n</g>\r\n<!-- 12 -->\r\n<g id=\"node13\" class=\"node\"><title>12</title>\r\n<polygon fill=\"#e58139\" fill-opacity=\"0.250980\" stroke=\"black\" points=\"686.167,-131.5 519.167,-131.5 519.167,-63.5 686.167,-63.5 686.167,-131.5\"/>\r\n<text text-anchor=\"start\" x=\"527.167\" y=\"-116.3\" font-family=\"Times New Roman,serif\" font-size=\"14.00\">fiProductClassDesc ≤ 39.5</text>\r\n<text text-anchor=\"start\" x=\"567.667\" y=\"-101.3\" font-family=\"Times New Roman,serif\" font-size=\"14.00\">mse = 0.081</text>\r\n<text text-anchor=\"start\" x=\"561.667\" y=\"-86.3\" font-family=\"Times New Roman,serif\" font-size=\"14.00\">samples = 775</text>\r\n<text text-anchor=\"start\" x=\"564.667\" y=\"-71.3\" font-family=\"Times New Roman,serif\" font-size=\"14.00\">value = 9.374</text>\r\n</g>\r\n<!-- 8&#45;&gt;12 -->\r\n<g id=\"edge12\" class=\"edge\"><title>8&#45;&gt;12</title>\r\n<path fill=\"none\" stroke=\"black\" d=\"M409.373,-191.007C443.496,-174.414 486.051,-153.721 522.447,-136.022\"/>\r\n<polygon fill=\"black\" stroke=\"black\" points=\"524.097,-139.112 531.559,-131.591 521.035,-132.817 524.097,-139.112\"/>\r\n</g>\r\n<!-- 10 -->\r\n<g id=\"node11\" class=\"node\"><title>10</title>\r\n<polygon fill=\"none\" stroke=\"black\" points=\"892.667,-266 794.667,-266 794.667,-213 892.667,-213 892.667,-266\"/>\r\n<text text-anchor=\"start\" x=\"812.167\" y=\"-250.8\" font-family=\"Times New Roman,serif\" font-size=\"14.00\">mse = 0.08</text>\r\n<text text-anchor=\"start\" x=\"802.667\" y=\"-235.8\" font-family=\"Times New Roman,serif\" font-size=\"14.00\">samples = 741</text>\r\n<text text-anchor=\"start\" x=\"805.667\" y=\"-220.8\" font-family=\"Times New Roman,serif\" font-size=\"14.00\">value = 8.983</text>\r\n</g>\r\n<!-- 9&#45;&gt;10 -->\r\n<g id=\"edge10\" class=\"edge\"><title>9&#45;&gt;10</title>\r\n<path fill=\"none\" stroke=\"black\" d=\"M678.96,-227.171C713.036,-229.737 752.695,-232.724 784.444,-235.115\"/>\r\n<polygon fill=\"black\" stroke=\"black\" points=\"784.222,-238.608 794.457,-235.869 784.748,-231.628 784.222,-238.608\"/>\r\n</g>\r\n<!-- 11 -->\r\n<g id=\"node12\" class=\"node\"><title>11</title>\r\n<polygon fill=\"#e58139\" fill-opacity=\"0.145098\" stroke=\"black\" points=\"892.667,-195 794.667,-195 794.667,-142 892.667,-142 892.667,-195\"/>\r\n<text text-anchor=\"start\" x=\"808.667\" y=\"-179.8\" font-family=\"Times New Roman,serif\" font-size=\"14.00\">mse = 0.086</text>\r\n<text text-anchor=\"start\" x=\"802.667\" y=\"-164.8\" font-family=\"Times New Roman,serif\" font-size=\"14.00\">samples = 150</text>\r\n<text text-anchor=\"start\" x=\"805.667\" y=\"-149.8\" font-family=\"Times New Roman,serif\" font-size=\"14.00\">value = 9.207</text>\r\n</g>\r\n<!-- 9&#45;&gt;11 -->\r\n<g id=\"edge11\" class=\"edge\"><title>9&#45;&gt;11</title>\r\n<path fill=\"none\" stroke=\"black\" d=\"M678.96,-204.803C713.036,-197.247 752.695,-188.452 784.444,-181.411\"/>\r\n<polygon fill=\"black\" stroke=\"black\" points=\"785.451,-184.773 794.457,-179.191 783.936,-177.939 785.451,-184.773\"/>\r\n</g>\r\n<!-- 13 -->\r\n<g id=\"node14\" class=\"node\"><title>13</title>\r\n<polygon fill=\"#e58139\" fill-opacity=\"0.203922\" stroke=\"black\" points=\"892.667,-124 794.667,-124 794.667,-71 892.667,-71 892.667,-124\"/>\r\n<text text-anchor=\"start\" x=\"812.167\" y=\"-108.8\" font-family=\"Times New Roman,serif\" font-size=\"14.00\">mse = 0.07</text>\r\n<text text-anchor=\"start\" x=\"802.667\" y=\"-93.8\" font-family=\"Times New Roman,serif\" font-size=\"14.00\">samples = 455</text>\r\n<text text-anchor=\"start\" x=\"805.667\" y=\"-78.8\" font-family=\"Times New Roman,serif\" font-size=\"14.00\">value = 9.298</text>\r\n</g>\r\n<!-- 12&#45;&gt;13 -->\r\n<g id=\"edge13\" class=\"edge\"><title>12&#45;&gt;13</title>\r\n<path fill=\"none\" stroke=\"black\" d=\"M686.252,-97.5C718.598,-97.5 754.946,-97.5 784.456,-97.5\"/>\r\n<polygon fill=\"black\" stroke=\"black\" points=\"784.591,-101 794.591,-97.5 784.591,-94.0001 784.591,-101\"/>\r\n</g>\r\n<!-- 14 -->\r\n<g id=\"node15\" class=\"node\"><title>14</title>\r\n<polygon fill=\"#e58139\" fill-opacity=\"0.321569\" stroke=\"black\" points=\"892.667,-53 794.667,-53 794.667,-0 892.667,-0 892.667,-53\"/>\r\n<text text-anchor=\"start\" x=\"808.667\" y=\"-37.8\" font-family=\"Times New Roman,serif\" font-size=\"14.00\">mse = 0.075</text>\r\n<text text-anchor=\"start\" x=\"802.667\" y=\"-22.8\" font-family=\"Times New Roman,serif\" font-size=\"14.00\">samples = 320</text>\r\n<text text-anchor=\"start\" x=\"805.667\" y=\"-7.8\" font-family=\"Times New Roman,serif\" font-size=\"14.00\">value = 9.482</text>\r\n</g>\r\n<!-- 12&#45;&gt;14 -->\r\n<g id=\"edge14\" class=\"edge\"><title>12&#45;&gt;14</title>\r\n<path fill=\"none\" stroke=\"black\" d=\"M686.252,-72.9664C718.739,-63.3153 755.263,-52.4651 784.842,-43.678\"/>\r\n<polygon fill=\"black\" stroke=\"black\" points=\"786.002,-46.9847 794.591,-40.7818 784.009,-40.2745 786.002,-46.9847\"/>\r\n</g>\r\n</g>\r\n</svg>\r\n",
"text/plain": "<graphviz.files.Source at 0x256b03370b8>"
},
"metadata": {},
"output_type": "display_data"
}
]
},
{
"metadata": {
"hidden": true
},
"cell_type": "markdown",
"source": "Let's see what happens if we create a bigger tree."
},
{
"metadata": {
"hidden": true,
"trusted": true
},
"cell_type": "code",
"source": "m = RandomForestRegressor(n_estimators=1, bootstrap=False, n_jobs=-1)\nm.fit(X_train, y_train)\nprint_score(m)",
"execution_count": 120,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": "[6.153480596427405e-17, 0.5131504668553788, 1.0, 0.5297406592420147]\n"
}
]
},
{
"metadata": {
"hidden": true
},
"cell_type": "markdown",
"source": "The training set result looks great! But the validation set is worse than our original model. This is why we need to use *bagging* of multiple trees to get more generalizable results."
},
{
"metadata": {
"heading_collapsed": true,
"hidden": true
},
"cell_type": "markdown",
"source": "## Bagging"
},
{
"metadata": {
"heading_collapsed": true,
"hidden": true
},
"cell_type": "markdown",
"source": "### Intro to bagging"
},
{
"metadata": {
"hidden": true
},
"cell_type": "markdown",
"source": "To learn about bagging in random forests, let's start with our basic model again."
},
{
"metadata": {
"hidden": true,
"trusted": true
},
"cell_type": "code",
"source": "m = RandomForestRegressor(n_jobs=-1)\nm.fit(X_train, y_train)\nprint_score(m)",
"execution_count": 121,
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": "C:\\Users\\HP\\Anaconda3\\envs\\fastai\\lib\\site-packages\\sklearn\\ensemble\\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.\n \"10 in version 0.20 to 100 in 0.22.\", FutureWarning)\n"
},
{
"name": "stdout",
"output_type": "stream",
"text": "[0.11084705899567794, 0.37376174155572084, 0.9727086014572663, 0.750518892442501]\n"
}
]
},
{
"metadata": {
"hidden": true
},
"cell_type": "markdown",
"source": "We'll grab the predictions for each individual tree, and look at one example."
},
{
"metadata": {
"hidden": true,
"trusted": true
},
"cell_type": "code",
"source": "preds = np.stack([t.predict(X_valid) for t in m.estimators_])\npreds[:,0], np.mean(preds[:,0]), y_valid[0]",
"execution_count": 122,
"outputs": [
{
"data": {
"text/plain": "(array([9.92818, 9.85219, 9.85219, 9.76996, 9.10498, 9.30565, 9.30565, 9.87817, 9.21034, 9.30565]),\n 9.551296646952316,\n 9.104979856318357)"
},
"execution_count": 122,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {
"hidden": true,
"trusted": true
},
"cell_type": "code",
"source": "preds.shape",
"execution_count": 123,
"outputs": [
{
"data": {
"text/plain": "(10, 12000)"
},
"execution_count": 123,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {
"hidden": true,
"trusted": true
},
"cell_type": "code",
"source": "plt.plot([metrics.r2_score(y_valid, np.mean(preds[:i+1], axis=0)) for i in range(10)]);",
"execution_count": 124,
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": "<Figure size 432x288 with 1 Axes>"
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
]
},
{
"metadata": {
"hidden": true
},
"cell_type": "markdown",
"source": "The shape of this curve suggests that adding more trees isn't going to help us much. Let's check. (Compare this to our original model on a sample)"
},
{
"metadata": {
"hidden": true,
"trusted": true
},
"cell_type": "code",
"source": "m = RandomForestRegressor(n_estimators=20, n_jobs=-1)\nm.fit(X_train, y_train)\nprint_score(m)",
"execution_count": 125,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": "[0.10172761854001205, 0.36554929675468534, 0.9770144342682152, 0.7613618467588243]\n"
}
]
},
{
"metadata": {
"hidden": true,
"trusted": true
},
"cell_type": "code",
"source": "m = RandomForestRegressor(n_estimators=40, n_jobs=-1)\nm.fit(X_train, y_train)\nprint_score(m)",
"execution_count": 126,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": "[0.09543202722119888, 0.3520655135469086, 0.9797714040049498, 0.7786421408140615]\n"
}
]
},
{
"metadata": {
"hidden": true,
"trusted": true
},
"cell_type": "code",
"source": "m = RandomForestRegressor(n_estimators=80, n_jobs=-1)\nm.fit(X_train, y_train)\nprint_score(m)",
"execution_count": 127,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": "[0.09255757883811551, 0.35223397510530247, 0.9809716376423486, 0.7784302529119919]\n"
}
]
},
{
"metadata": {
"heading_collapsed": true,
"hidden": true
},
"cell_type": "markdown",
"source": "### Out-of-bag (OOB) score"
},
{
"metadata": {
"hidden": true
},
"cell_type": "markdown",
"source": "Is our validation set worse than our training set because we're over-fitting, or because the validation set is for a different time period, or a bit of both? With the existing information we've shown, we can't tell. However, random forests have a very clever trick called *out-of-bag (OOB) error* which can handle this (and more!)\n\nThe idea is to calculate error on the training set, but only include the trees in the calculation of a row's error where that row was **not** included in training that tree. This allows us to see whether the model is over-fitting, without needing a separate validation set.\n\nThis also has the benefit of allowing us to see whether our model generalizes, even if we only have a small amount of data so want to avoid separating some out to create a validation set.\n\nThis is as simple as adding one more parameter to our model constructor. We print the OOB error last in our `print_score` function below."
},
{
"metadata": {
"hidden": true,
"trusted": true
},
"cell_type": "code",
"source": "m = RandomForestRegressor(n_estimators=40, n_jobs=-1, oob_score=True)\nm.fit(X_train, y_train)\nprint_score(m)",
"execution_count": 128,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": "[0.09610888340867661, 0.3576626778889991, 0.9794834418645035, 0.7715478646743503, 0.8551629370189948]\n"
}
]
},
{
"metadata": {
"hidden": true
},
"cell_type": "markdown",
"source": "This shows that our validation set time difference is making an impact, as is model over-fitting."
},
{
"metadata": {
"hidden": true
},
"cell_type": "markdown",
"source": "## Reducing over-fitting"
},
{
"metadata": {
"heading_collapsed": true,
"hidden": true
},
"cell_type": "markdown",
"source": "### Subsampling"
},
{
"metadata": {
"hidden": true
},
"cell_type": "markdown",
"source": "It turns out that one of the easiest ways to avoid over-fitting is also one of the best ways to speed up analysis: *subsampling*. Let's return to using our full dataset, so that we can demonstrate the impact of this technique."
},
{
"metadata": {
"hidden": true,
"trusted": true
},
"cell_type": "code",
"source": "df_trn, y_trn, nas = proc_df(df_raw, 'SalePrice')\nX_train, X_valid = split_vals(df_trn, n_trn)\ny_train, y_valid = split_vals(y_trn, n_trn)",
"execution_count": 129,
"outputs": []
},
{
"metadata": {
"hidden": true
},
"cell_type": "markdown",
"source": "The basic idea is this: rather than limit the total amount of data that our model can access, let's instead limit it to a *different* random subset per tree. That way, given enough trees, the model can still see *all* the data, but for each individual tree it'll be just as fast as if we had cut down our dataset as before."
},
{
"metadata": {
"hidden": true,
"trusted": true
},
"cell_type": "code",
"source": "set_rf_samples(20000)",
"execution_count": 130,
"outputs": []
},
{
"metadata": {
"hidden": true,
"trusted": true
},
"cell_type": "code",
"source": "m = RandomForestRegressor(n_jobs=-1, oob_score=True)\n%time m.fit(X_train, y_train)\nprint_score(m)",
"execution_count": 131,
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": "C:\\Users\\HP\\Anaconda3\\envs\\fastai\\lib\\site-packages\\sklearn\\ensemble\\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.\n \"10 in version 0.20 to 100 in 0.22.\", FutureWarning)\n"
},
{
"name": "stdout",
"output_type": "stream",
"text": "Wall time: 8.15 s\n[0.24122494719400017, 0.2759867342751804, 0.878387206060842, 0.863973228952055, 0.8658550315447264]\n"
}
]
},
{
"metadata": {
"hidden": true
},
"cell_type": "markdown",
"source": "Since each additional tree allows the model to see more data, this approach can make additional trees more useful."
},
{
"metadata": {
"hidden": true,
"trusted": true
},
"cell_type": "code",
"source": "preds = np.stack([t.predict(X_valid) for t in m.estimators_])\npreds[:,0], np.mean(preds[:,0]), y_valid[0]\n\nprint(preds.shape)\n\nplt.plot([metrics.r2_score(y_valid, np.mean(preds[:i+1], axis=0)) for i in range(10)]);",
"execution_count": 132,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": "(10, 12000)\n"
},
{
"data": {
"image/png": "\n",
"text/plain": "<Figure size 432x288 with 1 Axes>"
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
]
},
{
"metadata": {
"hidden": true,
"trusted": true
},
"cell_type": "code",
"source": "m = RandomForestRegressor(n_estimators=40, n_jobs=-1, oob_score=True)\nm.fit(X_train, y_train)\nprint_score(m)",
"execution_count": 133,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": "[0.227129484361024, 0.2635912588479528, 0.8921843374704868, 0.8759176578965595, 0.8806396852688025]\n"
}
]
},
{
"metadata": {
"heading_collapsed": true,
"hidden": true
},
"cell_type": "markdown",
"source": "### Tree building parameters"
},
{
"metadata": {
"hidden": true
},
"cell_type": "markdown",
"source": "We revert to using a full bootstrap sample in order to show the impact of other over-fitting avoidance methods."
},
{
"metadata": {
"hidden": true,
"trusted": true
},
"cell_type": "code",
"source": "reset_rf_samples()",
"execution_count": 134,
"outputs": []
},
{
"metadata": {
"hidden": true
},
"cell_type": "markdown",
"source": "Let's get a baseline for this full set to compare to."
},
{
"metadata": {
"hidden": true,
"trusted": true
},
"cell_type": "code",
"source": "def dectree_max_depth(tree):\n children_left = tree.children_left\n children_right = tree.children_right\n\n def walk(node_id):\n if (children_left[node_id] != children_right[node_id]):\n left_max = 1 + walk(children_left[node_id])\n right_max = 1 + walk(children_right[node_id])\n return max(left_max, right_max)\n else: # leaf\n return 1\n\n root_node_id = 0\n return walk(root_node_id)",
"execution_count": 135,
"outputs": []
},
{
"metadata": {
"hidden": true,
"trusted": true
},
"cell_type": "code",
"source": "m = RandomForestRegressor(n_estimators=40, n_jobs=-1, oob_score=True)\nm.fit(X_train, y_train)\nprint_score(m)",
"execution_count": 136,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": "[0.07837691562104095, 0.23778675763998935, 0.9871615922945133, 0.8990228046123347, 0.9084765309895813]\n"
}
]
},
{
"metadata": {
"hidden": true,
"trusted": true
},
"cell_type": "code",
"source": "t=m.estimators_[0].tree_",
"execution_count": 137,
"outputs": []
},
{
"metadata": {
"hidden": true,
"trusted": true
},
"cell_type": "code",
"source": "dectree_max_depth(t)",
"execution_count": 138,
"outputs": [
{
"data": {
"text/plain": "48"
},
"execution_count": 138,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {
"hidden": true,
"trusted": true
},
"cell_type": "code",
"source": "m = RandomForestRegressor(n_estimators=40, min_samples_leaf=5, n_jobs=-1, oob_score=True)\nm.fit(X_train, y_train)\nprint_score(m)",
"execution_count": 139,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": "[0.1406976555290867, 0.23379601727087287, 0.9586278064633849, 0.9023837339512348, 0.906973944387718]\n"
}
]
},
{
"metadata": {
"hidden": true,
"trusted": true
},
"cell_type": "code",
"source": "t=m.estimators_[0].tree_",
"execution_count": 140,
"outputs": []
},
{
"metadata": {
"hidden": true,
"trusted": true
},
"cell_type": "code",
"source": "dectree_max_depth(t)",
"execution_count": 141,
"outputs": [
{
"data": {
"text/plain": "38"
},
"execution_count": 141,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {
"hidden": true
},
"cell_type": "markdown",
"source": "Another way to reduce over-fitting is to grow our trees less deeply. We do this by specifying (with `min_samples_leaf`) that we require some minimum number of rows in every leaf node. This has two benefits:\n\n- There are less decision rules for each leaf node; simpler models should generalize better\n- The predictions are made by averaging more rows in the leaf node, resulting in less volatility"
},
{
"metadata": {
"hidden": true,
"trusted": true
},
"cell_type": "code",
"source": "m = RandomForestRegressor(n_estimators=40, min_samples_leaf=3, n_jobs=-1, oob_score=True)\nm.fit(X_train, y_train)\nprint_score(m)",
"execution_count": 142,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": "[0.11498345368372771, 0.2337647561310838, 0.972368432384591, 0.9024098369769239, 0.9085519934334578]\n"
}
]
},
{
"metadata": {
"hidden": true
},
"cell_type": "markdown",
"source": "We can also increase the amount of variation amongst the trees by not only use a sample of rows for each tree, but to also using a sample of *columns* for each *split*. We do this by specifying `max_features`, which is the proportion of features to randomly select from at each split."
},
{
"metadata": {
"hidden": true
},
"cell_type": "markdown",
"source": "- None\n- 0.5\n- 'sqrt'"
},
{
"metadata": {
"hidden": true
},
"cell_type": "markdown",
"source": "- 1, 3, 5, 10, 25, 100"
},
{
"metadata": {
"hidden": true,
"trusted": true
},
"cell_type": "code",
"source": "m = RandomForestRegressor(n_estimators=40, min_samples_leaf=3, max_features=0.5, n_jobs=-1, oob_score=True)\nm.fit(X_train, y_train)\nprint_score(m)",
"execution_count": 143,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": "[0.11925952924670367, 0.22912896153648526, 0.9702750585718218, 0.9062420835574635, 0.911538837132081]\n"
}
]
},
{
"metadata": {
"hidden": true
},
"cell_type": "markdown",
"source": "We can't compare our results directly with the Kaggle competition, since it used a different validation set (and we can no longer to submit to this competition) - but we can at least see that we're getting similar results to the winners based on the dataset we have.\n\nThe sklearn docs [show an example](http://scikit-learn.org/stable/auto_examples/ensemble/plot_ensemble_oob.html) of different `max_features` methods with increasing numbers of trees - as you see, using a subset of features on each split requires using more trees, but results in better models:\n![sklearn max_features chart](http://scikit-learn.org/stable/_images/sphx_glr_plot_ensemble_oob_001.png)"
}
],
"metadata": {
"gist": {
"id": "",
"data": {
"description": "courses/ml1/BHU ml competition/lesson1-rf.ipynb",
"public": true
}
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3",
"language": "python"
},
"language_info": {
"name": "python",
"version": "3.6.8",
"mimetype": "text/x-python",
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"pygments_lexer": "ipython3",
"nbconvert_exporter": "python",
"file_extension": ".py"
},
"varInspector": {
"window_display": false,
"cols": {
"lenName": 16,
"lenType": 16,
"lenVar": 40
},
"kernels_config": {
"python": {
"library": "var_list.py",
"delete_cmd_prefix": "del ",
"delete_cmd_postfix": "",
"varRefreshCmd": "print(var_dic_list())"
},
"r": {
"library": "var_list.r",
"delete_cmd_prefix": "rm(",
"delete_cmd_postfix": ") ",
"varRefreshCmd": "cat(var_dic_list()) "
}
},
"types_to_exclude": [
"module",
"function",
"builtin_function_or_method",
"instance",
"_Feature"
]
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment