pmarcelino/data_cleaning_general_shuffle_data.ipynb

## data_cleaning_general_shuffle_data.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Shuffle dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Datasets should be shuffled before training machine learning models to avoid bias/patterns situations. It's easy to forget about this, but in many cases datasets are ordered in such a way that can bias your analysis.\n",
    "\n",
    "Consider the following example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>age</th>\n",
       "      <th>name</th>\n",
       "      <th>salary</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>18</td>\n",
       "      <td>pussidonio</td>\n",
       "      <td>780</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>19</td>\n",
       "      <td>benquerenca</td>\n",
       "      <td>767</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>20</td>\n",
       "      <td>jorinho</td>\n",
       "      <td>750</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>19</td>\n",
       "      <td>balsagodes</td>\n",
       "      <td>760</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>56</td>\n",
       "      <td>asdrubal</td>\n",
       "      <td>2580</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>58</td>\n",
       "      <td>tamagnini</td>\n",
       "      <td>2750</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   age         name salary\n",
       "0   18   pussidonio    780\n",
       "1   19  benquerenca    767\n",
       "2   20      jorinho    750\n",
       "3   19   balsagodes    760\n",
       "4   56     asdrubal   2580\n",
       "5   58    tamagnini   2750"
      ]
     },
     "execution_count": 45,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Create example \n",
    "df = pd.DataFrame({'name':['pussidonio', 'benquerenca', 'jorinho','balsagodes','asdrubal','tamagnini'],\n",
    "                   'salary':['780', '767', '750','760','2580','2750'],\n",
    "                   'age':[18,19,20,19,56,58]})\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now imagine that you have the following train and test datasets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Create train and test datasets\n",
    "import numpy as np\n",
    "\n",
    "train_size = 0.8\n",
    "n_train = int(np.shape(df)[0] * train_size)\n",
    "\n",
    "df_train = df.iloc[:n_train]\n",
    "df_test = df.iloc[n_train:]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you create such datasets, you're in trouble. Comparing both datasets, we can see that the distribution of values is completely different. While in the train set age is between 18-20 and salaries are around 764, in the test set age is between 56-58 and salaries around 2665. Accordingly, using our train set we can't learn much about our test set.\n",
    "\n",
    "The message here is that we should shuffle our dataset to avoid this type of segregation, which bias our analysis. There are two ways to do so: \n",
    "1. Directly shuffle the dataset.\n",
    "1. Create train and test sets with sklearn-learn."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Directly shuffle the dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>age</th>\n",
       "      <th>name</th>\n",
       "      <th>salary</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>18</td>\n",
       "      <td>pussidonio</td>\n",
       "      <td>780</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>19</td>\n",
       "      <td>balsagodes</td>\n",
       "      <td>760</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>56</td>\n",
       "      <td>asdrubal</td>\n",
       "      <td>2580</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>19</td>\n",
       "      <td>benquerenca</td>\n",
       "      <td>767</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>20</td>\n",
       "      <td>jorinho</td>\n",
       "      <td>750</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>58</td>\n",
       "      <td>tamagnini</td>\n",
       "      <td>2750</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   age         name salary\n",
       "0   18   pussidonio    780\n",
       "3   19   balsagodes    760\n",
       "4   56     asdrubal   2580\n",
       "1   19  benquerenca    767\n",
       "2   20      jorinho    750\n",
       "5   58    tamagnini   2750"
      ]
     },
     "execution_count": 47,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Shuffle dataset\n",
    "from sklearn.utils import shuffle\n",
    "\n",
    "df_shuffled = shuffle(df)\n",
    "df_shuffled"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now the data is mixed. Note that we can still have [imbalanced datasets](https://blog.dominodatalab.com/imbalanced-datasets/), but that's a different problem. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create train and test sets with scikit-learn"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "X_train:\n",
      "    age        name\n",
      "5   58   tamagnini\n",
      "3   19  balsagodes\n",
      "2   20     jorinho\n",
      "4   56    asdrubal\n",
      "X_test:\n",
      "    age         name\n",
      "1   19  benquerenca\n",
      "0   18   pussidonio\n"
     ]
    }
   ],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "X = df.drop('salary', axis=1)\n",
    "y = df['salary']\n",
    "\n",
    "X_train, X_test, y_train, y_test  = train_test_split(X, y, test_size=0.2)\n",
    "print('X_train:\\n', X_train)\n",
    "print('X_test:\\n', X_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Same comment as before: data is mixed but we can have imbalanced datasets."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Shuffle dataset"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Datasets should be shuffled before training machine learning models to avoid bias/patterns situations. It's easy to forget about this, but in many cases datasets are ordered in such a way that can bias your analysis.\n",
	"\n",
	"Consider the following example:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 45,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/html": [
	"<div>\n",
	"<style>\n",
	" .dataframe thead tr:only-child th {\n",
	" text-align: right;\n",
	" }\n",
	"\n",
	" .dataframe thead th {\n",
	" text-align: left;\n",
	" }\n",
	"\n",
	" .dataframe tbody tr th {\n",
	" vertical-align: top;\n",
	" }\n",
	"</style>\n",
	"<table border=\"1\" class=\"dataframe\">\n",
	" <thead>\n",
	" <tr style=\"text-align: right;\">\n",
	" <th></th>\n",
	" <th>age</th>\n",
	" <th>name</th>\n",
	" <th>salary</th>\n",
	" </tr>\n",
	" </thead>\n",
	" <tbody>\n",
	" <tr>\n",
	" <th>0</th>\n",
	" <td>18</td>\n",
	" <td>pussidonio</td>\n",
	" <td>780</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>1</th>\n",
	" <td>19</td>\n",
	" <td>benquerenca</td>\n",
	" <td>767</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>2</th>\n",
	" <td>20</td>\n",
	" <td>jorinho</td>\n",
	" <td>750</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>3</th>\n",
	" <td>19</td>\n",
	" <td>balsagodes</td>\n",
	" <td>760</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>4</th>\n",
	" <td>56</td>\n",
	" <td>asdrubal</td>\n",
	" <td>2580</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>5</th>\n",
	" <td>58</td>\n",
	" <td>tamagnini</td>\n",
	" <td>2750</td>\n",
	" </tr>\n",
	" </tbody>\n",
	"</table>\n",
	"</div>"
	],
	"text/plain": [
	" age name salary\n",
	"0 18 pussidonio 780\n",
	"1 19 benquerenca 767\n",
	"2 20 jorinho 750\n",
	"3 19 balsagodes 760\n",
	"4 56 asdrubal 2580\n",
	"5 58 tamagnini 2750"
	]
	},
	"execution_count": 45,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"# Create example \n",
	"df = pd.DataFrame({'name':['pussidonio', 'benquerenca', 'jorinho','balsagodes','asdrubal','tamagnini'],\n",
	" 'salary':['780', '767', '750','760','2580','2750'],\n",
	" 'age':[18,19,20,19,56,58]})\n",
	"df"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Now imagine that you have the following train and test datasets."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 46,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"# Create train and test datasets\n",
	"import numpy as np\n",
	"\n",
	"train_size = 0.8\n",
	"n_train = int(np.shape(df)[0] * train_size)\n",
	"\n",
	"df_train = df.iloc[:n_train]\n",
	"df_test = df.iloc[n_train:]"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"If you create such datasets, you're in trouble. Comparing both datasets, we can see that the distribution of values is completely different. While in the train set age is between 18-20 and salaries are around 764, in the test set age is between 56-58 and salaries around 2665. Accordingly, using our train set we can't learn much about our test set.\n",
	"\n",
	"The message here is that we should shuffle our dataset to avoid this type of segregation, which bias our analysis. There are two ways to do so: \n",
	"1. Directly shuffle the dataset.\n",
	"1. Create train and test sets with sklearn-learn."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Directly shuffle the dataset"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 47,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/html": [
	"<div>\n",
	"<style>\n",
	" .dataframe thead tr:only-child th {\n",
	" text-align: right;\n",
	" }\n",
	"\n",
	" .dataframe thead th {\n",
	" text-align: left;\n",
	" }\n",
	"\n",
	" .dataframe tbody tr th {\n",
	" vertical-align: top;\n",
	" }\n",
	"</style>\n",
	"<table border=\"1\" class=\"dataframe\">\n",
	" <thead>\n",
	" <tr style=\"text-align: right;\">\n",
	" <th></th>\n",
	" <th>age</th>\n",
	" <th>name</th>\n",
	" <th>salary</th>\n",
	" </tr>\n",
	" </thead>\n",
	" <tbody>\n",
	" <tr>\n",
	" <th>0</th>\n",
	" <td>18</td>\n",
	" <td>pussidonio</td>\n",
	" <td>780</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>3</th>\n",
	" <td>19</td>\n",
	" <td>balsagodes</td>\n",
	" <td>760</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>4</th>\n",
	" <td>56</td>\n",
	" <td>asdrubal</td>\n",
	" <td>2580</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>1</th>\n",
	" <td>19</td>\n",
	" <td>benquerenca</td>\n",
	" <td>767</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>2</th>\n",
	" <td>20</td>\n",
	" <td>jorinho</td>\n",
	" <td>750</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>5</th>\n",
	" <td>58</td>\n",
	" <td>tamagnini</td>\n",
	" <td>2750</td>\n",
	" </tr>\n",
	" </tbody>\n",
	"</table>\n",
	"</div>"
	],
	"text/plain": [
	" age name salary\n",
	"0 18 pussidonio 780\n",
	"3 19 balsagodes 760\n",
	"4 56 asdrubal 2580\n",
	"1 19 benquerenca 767\n",
	"2 20 jorinho 750\n",
	"5 58 tamagnini 2750"
	]
	},
	"execution_count": 47,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"# Shuffle dataset\n",
	"from sklearn.utils import shuffle\n",
	"\n",
	"df_shuffled = shuffle(df)\n",
	"df_shuffled"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Now the data is mixed. Note that we can still have [imbalanced datasets](https://blog.dominodatalab.com/imbalanced-datasets/), but that's a different problem. "
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Create train and test sets with scikit-learn"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 51,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"X_train:\n",
	" age name\n",
	"5 58 tamagnini\n",
	"3 19 balsagodes\n",
	"2 20 jorinho\n",
	"4 56 asdrubal\n",
	"X_test:\n",
	" age name\n",
	"1 19 benquerenca\n",
	"0 18 pussidonio\n"
	]
	}
	],
	"source": [
	"from sklearn.model_selection import train_test_split\n",
	"\n",
	"X = df.drop('salary', axis=1)\n",
	"y = df['salary']\n",
	"\n",
	"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)\n",
	"print('X_train:\\n', X_train)\n",
	"print('X_test:\\n', X_test)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Same comment as before: data is mixed but we can have imbalanced datasets."
	]
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.6.2"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 2
	}