Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save pmarcelino/f8a74e63c9b598a165c8fb7d12244df8 to your computer and use it in GitHub Desktop.
Save pmarcelino/f8a74e63c9b598a165c8fb7d12244df8 to your computer and use it in GitHub Desktop.
Data cleaning - General - Shuffle data
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Shuffle dataset"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Datasets should be shuffled before training machine learning models to avoid bias/patterns situations. It's easy to forget about this, but in many cases datasets are ordered in such a way that can bias your analysis.\n",
"\n",
"Consider the following example:"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style>\n",
" .dataframe thead tr:only-child th {\n",
" text-align: right;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: left;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>age</th>\n",
" <th>name</th>\n",
" <th>salary</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>18</td>\n",
" <td>pussidonio</td>\n",
" <td>780</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>19</td>\n",
" <td>benquerenca</td>\n",
" <td>767</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>20</td>\n",
" <td>jorinho</td>\n",
" <td>750</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>19</td>\n",
" <td>balsagodes</td>\n",
" <td>760</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>56</td>\n",
" <td>asdrubal</td>\n",
" <td>2580</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>58</td>\n",
" <td>tamagnini</td>\n",
" <td>2750</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" age name salary\n",
"0 18 pussidonio 780\n",
"1 19 benquerenca 767\n",
"2 20 jorinho 750\n",
"3 19 balsagodes 760\n",
"4 56 asdrubal 2580\n",
"5 58 tamagnini 2750"
]
},
"execution_count": 45,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Create example \n",
"df = pd.DataFrame({'name':['pussidonio', 'benquerenca', 'jorinho','balsagodes','asdrubal','tamagnini'],\n",
" 'salary':['780', '767', '750','760','2580','2750'],\n",
" 'age':[18,19,20,19,56,58]})\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now imagine that you have the following train and test datasets."
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Create train and test datasets\n",
"import numpy as np\n",
"\n",
"train_size = 0.8\n",
"n_train = int(np.shape(df)[0] * train_size)\n",
"\n",
"df_train = df.iloc[:n_train]\n",
"df_test = df.iloc[n_train:]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you create such datasets, you're in trouble. Comparing both datasets, we can see that the distribution of values is completely different. While in the train set age is between 18-20 and salaries are around 764, in the test set age is between 56-58 and salaries around 2665. Accordingly, using our train set we can't learn much about our test set.\n",
"\n",
"The message here is that we should shuffle our dataset to avoid this type of segregation, which bias our analysis. There are two ways to do so: \n",
"1. Directly shuffle the dataset.\n",
"1. Create train and test sets with sklearn-learn."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Directly shuffle the dataset"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style>\n",
" .dataframe thead tr:only-child th {\n",
" text-align: right;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: left;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>age</th>\n",
" <th>name</th>\n",
" <th>salary</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>18</td>\n",
" <td>pussidonio</td>\n",
" <td>780</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>19</td>\n",
" <td>balsagodes</td>\n",
" <td>760</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>56</td>\n",
" <td>asdrubal</td>\n",
" <td>2580</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>19</td>\n",
" <td>benquerenca</td>\n",
" <td>767</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>20</td>\n",
" <td>jorinho</td>\n",
" <td>750</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>58</td>\n",
" <td>tamagnini</td>\n",
" <td>2750</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" age name salary\n",
"0 18 pussidonio 780\n",
"3 19 balsagodes 760\n",
"4 56 asdrubal 2580\n",
"1 19 benquerenca 767\n",
"2 20 jorinho 750\n",
"5 58 tamagnini 2750"
]
},
"execution_count": 47,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Shuffle dataset\n",
"from sklearn.utils import shuffle\n",
"\n",
"df_shuffled = shuffle(df)\n",
"df_shuffled"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now the data is mixed. Note that we can still have [imbalanced datasets](https://blog.dominodatalab.com/imbalanced-datasets/), but that's a different problem. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Create train and test sets with scikit-learn"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"X_train:\n",
" age name\n",
"5 58 tamagnini\n",
"3 19 balsagodes\n",
"2 20 jorinho\n",
"4 56 asdrubal\n",
"X_test:\n",
" age name\n",
"1 19 benquerenca\n",
"0 18 pussidonio\n"
]
}
],
"source": [
"from sklearn.model_selection import train_test_split\n",
"\n",
"X = df.drop('salary', axis=1)\n",
"y = df['salary']\n",
"\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)\n",
"print('X_train:\\n', X_train)\n",
"print('X_test:\\n', X_test)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Same comment as before: data is mixed but we can have imbalanced datasets."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.2"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment