Skip to content

Instantly share code, notes, and snippets.

@justmarkham
Created April 23, 2020 19:07
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 4 You must be signed in to fork a gist
  • Save justmarkham/543b7429da09ea9daf26517115038c10 to your computer and use it in GitHub Desktop.
Save justmarkham/543b7429da09ea9daf26517115038c10 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Course: Building an Effective ML Workflow with scikit-learn"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Outline:\n",
"\n",
"- Review of the basic Machine Learning workflow\n",
"- Encoding categorical data\n",
"- Using ColumnTransformer and Pipeline\n",
"- Recap\n",
"- Encoding text data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Part 1: Review of the basic Machine Learning workflow"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Check your scikit-learn version:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'0.22.1'"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import sklearn\n",
"sklearn.__version__"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Load 10 rows from the famous Titanic dataset:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"df = pd.read_csv('http://bit.ly/kaggletrain', nrows=10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Basic terminology:\n",
"\n",
"- \"Survived\" is the **target** column\n",
"- Target is categorical, thus it's a **classification problem**\n",
"- All other columns are possible **features**\n",
"- Each row is an **observation**, and represents a passenger\n",
"- This is our **training data** because we know the target values"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>PassengerId</th>\n",
" <th>Survived</th>\n",
" <th>Pclass</th>\n",
" <th>Name</th>\n",
" <th>Sex</th>\n",
" <th>Age</th>\n",
" <th>SibSp</th>\n",
" <th>Parch</th>\n",
" <th>Ticket</th>\n",
" <th>Fare</th>\n",
" <th>Cabin</th>\n",
" <th>Embarked</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Braund, Mr. Owen Harris</td>\n",
" <td>male</td>\n",
" <td>22.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>A/5 21171</td>\n",
" <td>7.2500</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Cumings, Mrs. John Bradley (Florence Briggs Th...</td>\n",
" <td>female</td>\n",
" <td>38.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>PC 17599</td>\n",
" <td>71.2833</td>\n",
" <td>C85</td>\n",
" <td>C</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>Heikkinen, Miss. Laina</td>\n",
" <td>female</td>\n",
" <td>26.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>STON/O2. 3101282</td>\n",
" <td>7.9250</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td>\n",
" <td>female</td>\n",
" <td>35.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>113803</td>\n",
" <td>53.1000</td>\n",
" <td>C123</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Allen, Mr. William Henry</td>\n",
" <td>male</td>\n",
" <td>35.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>373450</td>\n",
" <td>8.0500</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>6</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Moran, Mr. James</td>\n",
" <td>male</td>\n",
" <td>NaN</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>330877</td>\n",
" <td>8.4583</td>\n",
" <td>NaN</td>\n",
" <td>Q</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>7</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>McCarthy, Mr. Timothy J</td>\n",
" <td>male</td>\n",
" <td>54.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>17463</td>\n",
" <td>51.8625</td>\n",
" <td>E46</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>8</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Palsson, Master. Gosta Leonard</td>\n",
" <td>male</td>\n",
" <td>2.0</td>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>349909</td>\n",
" <td>21.0750</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>9</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)</td>\n",
" <td>female</td>\n",
" <td>27.0</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>347742</td>\n",
" <td>11.1333</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>10</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>Nasser, Mrs. Nicholas (Adele Achem)</td>\n",
" <td>female</td>\n",
" <td>14.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>237736</td>\n",
" <td>30.0708</td>\n",
" <td>NaN</td>\n",
" <td>C</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" PassengerId Survived Pclass \\\n",
"0 1 0 3 \n",
"1 2 1 1 \n",
"2 3 1 3 \n",
"3 4 1 1 \n",
"4 5 0 3 \n",
"5 6 0 3 \n",
"6 7 0 1 \n",
"7 8 0 3 \n",
"8 9 1 3 \n",
"9 10 1 2 \n",
"\n",
" Name Sex Age SibSp \\\n",
"0 Braund, Mr. Owen Harris male 22.0 1 \n",
"1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 \n",
"2 Heikkinen, Miss. Laina female 26.0 0 \n",
"3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 \n",
"4 Allen, Mr. William Henry male 35.0 0 \n",
"5 Moran, Mr. James male NaN 0 \n",
"6 McCarthy, Mr. Timothy J male 54.0 0 \n",
"7 Palsson, Master. Gosta Leonard male 2.0 3 \n",
"8 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27.0 0 \n",
"9 Nasser, Mrs. Nicholas (Adele Achem) female 14.0 1 \n",
"\n",
" Parch Ticket Fare Cabin Embarked \n",
"0 0 A/5 21171 7.2500 NaN S \n",
"1 0 PC 17599 71.2833 C85 C \n",
"2 0 STON/O2. 3101282 7.9250 NaN S \n",
"3 0 113803 53.1000 C123 S \n",
"4 0 373450 8.0500 NaN S \n",
"5 0 330877 8.4583 NaN Q \n",
"6 0 17463 51.8625 E46 S \n",
"7 1 349909 21.0750 NaN S \n",
"8 2 347742 11.1333 NaN S \n",
"9 0 237736 30.0708 NaN C "
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We want to use \"Parch\" and \"Fare\" as initial features:\n",
"\n",
"- \"Parch\" is the number of parents or children aboard with each passenger\n",
"- \"Fare\" is the amount they paid\n",
"- Both are numeric\n",
"\n",
"Define X and y:\n",
"\n",
"- X is the feature matrix\n",
"- y is the target"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Parch</th>\n",
" <th>Fare</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0</td>\n",
" <td>7.2500</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>0</td>\n",
" <td>71.2833</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>0</td>\n",
" <td>7.9250</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>0</td>\n",
" <td>53.1000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>0</td>\n",
" <td>8.0500</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>0</td>\n",
" <td>8.4583</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>0</td>\n",
" <td>51.8625</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>1</td>\n",
" <td>21.0750</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>2</td>\n",
" <td>11.1333</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>0</td>\n",
" <td>30.0708</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Parch Fare\n",
"0 0 7.2500\n",
"1 0 71.2833\n",
"2 0 7.9250\n",
"3 0 53.1000\n",
"4 0 8.0500\n",
"5 0 8.4583\n",
"6 0 51.8625\n",
"7 1 21.0750\n",
"8 2 11.1333\n",
"9 0 30.0708"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X = df[['Parch', 'Fare']]\n",
"X"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 0\n",
"1 1\n",
"2 1\n",
"3 1\n",
"4 0\n",
"5 0\n",
"6 0\n",
"7 0\n",
"8 1\n",
"9 1\n",
"Name: Survived, dtype: int64"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"y = df['Survived']\n",
"y"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Check the object shapes:\n",
"\n",
"- X is a pandas DataFrame with 2 columns, thus it has 2 dimensions\n",
"- y is a pandas Series, thus it has 1 dimension"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(10, 2)"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X.shape"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(10,)"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"y.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Create a model object:\n",
"\n",
"- Set the \"solver\" to increase the likelihood that we will all get the same results\n",
"- Set the \"random_state\" for reproducibility"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.linear_model import LogisticRegression\n",
"logreg = LogisticRegression(solver='liblinear', random_state=1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Evaluate the model using cross-validation:\n",
"\n",
"- Our goal is to simulate model performance on future data so that we can choose between models\n",
"- Evaluation metric is classification accuracy\n",
"- \"cross_val_score\" does the dataset splitting, training, predictions, and evaluation\n",
"- Your results may differ based on your scikit-learn version\n",
"- We can't take these results seriously because the dataset is tiny"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.6944444444444443"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.model_selection import cross_val_score\n",
"cross_val_score(logreg, X, y, cv=3, scoring='accuracy').mean()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Train the model on the entire dataset:"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,\n",
" intercept_scaling=1, l1_ratio=None, max_iter=100,\n",
" multi_class='auto', n_jobs=None, penalty='l2',\n",
" random_state=1, solver='liblinear', tol=0.0001, verbose=0,\n",
" warm_start=False)"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"logreg.fit(X, y)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Read in a new dataset for which we don't know the target values:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>PassengerId</th>\n",
" <th>Pclass</th>\n",
" <th>Name</th>\n",
" <th>Sex</th>\n",
" <th>Age</th>\n",
" <th>SibSp</th>\n",
" <th>Parch</th>\n",
" <th>Ticket</th>\n",
" <th>Fare</th>\n",
" <th>Cabin</th>\n",
" <th>Embarked</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>892</td>\n",
" <td>3</td>\n",
" <td>Kelly, Mr. James</td>\n",
" <td>male</td>\n",
" <td>34.5</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>330911</td>\n",
" <td>7.8292</td>\n",
" <td>NaN</td>\n",
" <td>Q</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>893</td>\n",
" <td>3</td>\n",
" <td>Wilkes, Mrs. James (Ellen Needs)</td>\n",
" <td>female</td>\n",
" <td>47.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>363272</td>\n",
" <td>7.0000</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>894</td>\n",
" <td>2</td>\n",
" <td>Myles, Mr. Thomas Francis</td>\n",
" <td>male</td>\n",
" <td>62.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>240276</td>\n",
" <td>9.6875</td>\n",
" <td>NaN</td>\n",
" <td>Q</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>895</td>\n",
" <td>3</td>\n",
" <td>Wirz, Mr. Albert</td>\n",
" <td>male</td>\n",
" <td>27.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>315154</td>\n",
" <td>8.6625</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>896</td>\n",
" <td>3</td>\n",
" <td>Hirvonen, Mrs. Alexander (Helga E Lindqvist)</td>\n",
" <td>female</td>\n",
" <td>22.0</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>3101298</td>\n",
" <td>12.2875</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>897</td>\n",
" <td>3</td>\n",
" <td>Svensson, Mr. Johan Cervin</td>\n",
" <td>male</td>\n",
" <td>14.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>7538</td>\n",
" <td>9.2250</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>898</td>\n",
" <td>3</td>\n",
" <td>Connolly, Miss. Kate</td>\n",
" <td>female</td>\n",
" <td>30.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>330972</td>\n",
" <td>7.6292</td>\n",
" <td>NaN</td>\n",
" <td>Q</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>899</td>\n",
" <td>2</td>\n",
" <td>Caldwell, Mr. Albert Francis</td>\n",
" <td>male</td>\n",
" <td>26.0</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>248738</td>\n",
" <td>29.0000</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>900</td>\n",
" <td>3</td>\n",
" <td>Abrahim, Mrs. Joseph (Sophie Halaut Easu)</td>\n",
" <td>female</td>\n",
" <td>18.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>2657</td>\n",
" <td>7.2292</td>\n",
" <td>NaN</td>\n",
" <td>C</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>901</td>\n",
" <td>3</td>\n",
" <td>Davies, Mr. John Samuel</td>\n",
" <td>male</td>\n",
" <td>21.0</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>A/4 48871</td>\n",
" <td>24.1500</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" PassengerId Pclass Name Sex \\\n",
"0 892 3 Kelly, Mr. James male \n",
"1 893 3 Wilkes, Mrs. James (Ellen Needs) female \n",
"2 894 2 Myles, Mr. Thomas Francis male \n",
"3 895 3 Wirz, Mr. Albert male \n",
"4 896 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) female \n",
"5 897 3 Svensson, Mr. Johan Cervin male \n",
"6 898 3 Connolly, Miss. Kate female \n",
"7 899 2 Caldwell, Mr. Albert Francis male \n",
"8 900 3 Abrahim, Mrs. Joseph (Sophie Halaut Easu) female \n",
"9 901 3 Davies, Mr. John Samuel male \n",
"\n",
" Age SibSp Parch Ticket Fare Cabin Embarked \n",
"0 34.5 0 0 330911 7.8292 NaN Q \n",
"1 47.0 1 0 363272 7.0000 NaN S \n",
"2 62.0 0 0 240276 9.6875 NaN Q \n",
"3 27.0 0 0 315154 8.6625 NaN S \n",
"4 22.0 1 1 3101298 12.2875 NaN S \n",
"5 14.0 0 0 7538 9.2250 NaN S \n",
"6 30.0 0 0 330972 7.6292 NaN Q \n",
"7 26.0 1 1 248738 29.0000 NaN S \n",
"8 18.0 0 0 2657 7.2292 NaN C \n",
"9 21.0 2 0 A/4 48871 24.1500 NaN S "
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_new = pd.read_csv('http://bit.ly/kaggletest', nrows=10)\n",
"df_new"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Define X_new to have the same columns as X:"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Parch</th>\n",
" <th>Fare</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0</td>\n",
" <td>7.8292</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>0</td>\n",
" <td>7.0000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>0</td>\n",
" <td>9.6875</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>0</td>\n",
" <td>8.6625</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>1</td>\n",
" <td>12.2875</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>0</td>\n",
" <td>9.2250</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>0</td>\n",
" <td>7.6292</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>1</td>\n",
" <td>29.0000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>0</td>\n",
" <td>7.2292</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>0</td>\n",
" <td>24.1500</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Parch Fare\n",
"0 0 7.8292\n",
"1 0 7.0000\n",
"2 0 9.6875\n",
"3 0 8.6625\n",
"4 1 12.2875\n",
"5 0 9.2250\n",
"6 0 7.6292\n",
"7 1 29.0000\n",
"8 0 7.2292\n",
"9 0 24.1500"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X_new = df_new[['Parch', 'Fare']]\n",
"X_new"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Use the trained model to make predictions for X_new:"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"scrolled": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([0, 0, 0, 0, 1, 0, 0, 1, 0, 1])"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"logreg.predict(X_new)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Part 2: Encoding categorical data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We want to use \"Embarked\" and \"Sex\" as additional features:\n",
"\n",
"- \"Embarked\" is the port they embarked from\n",
"- \"Sex\" is male or female\n",
"- They are unordered categorical features\n",
"- They can't be directly passed to the model because they aren't numeric"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>PassengerId</th>\n",
" <th>Survived</th>\n",
" <th>Pclass</th>\n",
" <th>Name</th>\n",
" <th>Sex</th>\n",
" <th>Age</th>\n",
" <th>SibSp</th>\n",
" <th>Parch</th>\n",
" <th>Ticket</th>\n",
" <th>Fare</th>\n",
" <th>Cabin</th>\n",
" <th>Embarked</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Braund, Mr. Owen Harris</td>\n",
" <td>male</td>\n",
" <td>22.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>A/5 21171</td>\n",
" <td>7.2500</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Cumings, Mrs. John Bradley (Florence Briggs Th...</td>\n",
" <td>female</td>\n",
" <td>38.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>PC 17599</td>\n",
" <td>71.2833</td>\n",
" <td>C85</td>\n",
" <td>C</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>Heikkinen, Miss. Laina</td>\n",
" <td>female</td>\n",
" <td>26.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>STON/O2. 3101282</td>\n",
" <td>7.9250</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td>\n",
" <td>female</td>\n",
" <td>35.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>113803</td>\n",
" <td>53.1000</td>\n",
" <td>C123</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Allen, Mr. William Henry</td>\n",
" <td>male</td>\n",
" <td>35.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>373450</td>\n",
" <td>8.0500</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>6</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Moran, Mr. James</td>\n",
" <td>male</td>\n",
" <td>NaN</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>330877</td>\n",
" <td>8.4583</td>\n",
" <td>NaN</td>\n",
" <td>Q</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>7</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>McCarthy, Mr. Timothy J</td>\n",
" <td>male</td>\n",
" <td>54.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>17463</td>\n",
" <td>51.8625</td>\n",
" <td>E46</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>8</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Palsson, Master. Gosta Leonard</td>\n",
" <td>male</td>\n",
" <td>2.0</td>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>349909</td>\n",
" <td>21.0750</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>9</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)</td>\n",
" <td>female</td>\n",
" <td>27.0</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>347742</td>\n",
" <td>11.1333</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>10</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>Nasser, Mrs. Nicholas (Adele Achem)</td>\n",
" <td>female</td>\n",
" <td>14.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>237736</td>\n",
" <td>30.0708</td>\n",
" <td>NaN</td>\n",
" <td>C</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" PassengerId Survived Pclass \\\n",
"0 1 0 3 \n",
"1 2 1 1 \n",
"2 3 1 3 \n",
"3 4 1 1 \n",
"4 5 0 3 \n",
"5 6 0 3 \n",
"6 7 0 1 \n",
"7 8 0 3 \n",
"8 9 1 3 \n",
"9 10 1 2 \n",
"\n",
" Name Sex Age SibSp \\\n",
"0 Braund, Mr. Owen Harris male 22.0 1 \n",
"1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 \n",
"2 Heikkinen, Miss. Laina female 26.0 0 \n",
"3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 \n",
"4 Allen, Mr. William Henry male 35.0 0 \n",
"5 Moran, Mr. James male NaN 0 \n",
"6 McCarthy, Mr. Timothy J male 54.0 0 \n",
"7 Palsson, Master. Gosta Leonard male 2.0 3 \n",
"8 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27.0 0 \n",
"9 Nasser, Mrs. Nicholas (Adele Achem) female 14.0 1 \n",
"\n",
" Parch Ticket Fare Cabin Embarked \n",
"0 0 A/5 21171 7.2500 NaN S \n",
"1 0 PC 17599 71.2833 C85 C \n",
"2 0 STON/O2. 3101282 7.9250 NaN S \n",
"3 0 113803 53.1000 C123 S \n",
"4 0 373450 8.0500 NaN S \n",
"5 0 330877 8.4583 NaN Q \n",
"6 0 17463 51.8625 E46 S \n",
"7 1 349909 21.0750 NaN S \n",
"8 2 347742 11.1333 NaN S \n",
"9 0 237736 30.0708 NaN C "
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Encode \"Embarked\" using one-hot encoding:\n",
"\n",
"- This is the same as \"dummy encoding\"\n",
"- Outputs a sparse matrix, which is more efficient and performant when most values in a matrix are zeros\n",
"- Use two brackets around \"Embarked\" to pass a DataFrame instead of a Series"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<10x3 sparse matrix of type '<class 'numpy.float64'>'\n",
"\twith 10 stored elements in Compressed Sparse Row format>"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.preprocessing import OneHotEncoder\n",
"ohe = OneHotEncoder()\n",
"ohe.fit_transform(df[['Embarked']])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Ask for a dense (not sparse) matrix so that we can examine the encoding:\n",
"\n",
"- There are 3 columns because there were 3 unique values in \"Embarked\"\n",
"- Each row contains a single 1\n",
"- 100 means \"C\", 010 means \"Q\", 001 means \"S\"\n",
"- The categories are listed in alphabetical order in the \"categories_\" attribute\n",
"- You can think of \"categories_\" as the column headings for the matrix\n",
"- From each of the three features, the model can learn the relationship between the target value and whether or not a given passenger embarked at that port"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[0., 0., 1.],\n",
" [1., 0., 0.],\n",
" [0., 0., 1.],\n",
" [0., 0., 1.],\n",
" [0., 0., 1.],\n",
" [0., 1., 0.],\n",
" [0., 0., 1.],\n",
" [0., 0., 1.],\n",
" [0., 0., 1.],\n",
" [1., 0., 0.]])"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ohe = OneHotEncoder(sparse=False)\n",
"ohe.fit_transform(df[['Embarked']])"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"[array(['C', 'Q', 'S'], dtype=object)]"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ohe.categories_"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"What's the difference between \"fit\" and \"transform\"?\n",
"\n",
"- OneHotEncoder is a \"transformer\", meaning its role is data transformations\n",
"- Transformers usually have a \"fit\" method and always have a \"transform\" method\n",
"- For all transformers: \"fit\" is when they learn something, and \"transform\" is when they use what they learned to do the transformation\n",
"- For OneHotEncoder: \"fit\" is when it learns the categories, and \"transform\" is when it creates the matrix using those categories\n",
"- If you are going to \"fit\" and \"transform\", then you should do it in a single step using \"fit_transform\"\n",
"\n",
"Encode \"Embarked\" and \"Sex\" at the same time:\n",
"\n",
"- First 3 columns represent \"Embarked\" and last two columns represent \"Sex\"\n",
"- For the \"Sex\" columns: 10 means \"female\", 01 means \"male\""
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[0., 0., 1., 0., 1.],\n",
" [1., 0., 0., 1., 0.],\n",
" [0., 0., 1., 1., 0.],\n",
" [0., 0., 1., 1., 0.],\n",
" [0., 0., 1., 0., 1.],\n",
" [0., 1., 0., 0., 1.],\n",
" [0., 0., 1., 0., 1.],\n",
" [0., 0., 1., 0., 1.],\n",
" [0., 0., 1., 1., 0.],\n",
" [1., 0., 0., 1., 0.]])"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ohe.fit_transform(df[['Embarked', 'Sex']])"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[array(['C', 'Q', 'S'], dtype=object), array(['female', 'male'], dtype=object)]"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ohe.categories_"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"How could we include \"Embarked\" and \"Sex\" in the model along with \"Parch\" and \"Fare\"?\n",
"\n",
"- Stack the 2 numeric features side-by-side with the 5 encoded columns, and then train the model with all 7 columns\n",
"- However, we would need to repeat the same process (encoding and stacking) with the new data before making predictions\n",
"- Doing this manually is inefficient and error-prone, and the complexity will only increase as you preprocess additional columns"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Part 3: Using ColumnTransformer and Pipeline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Goals:\n",
"\n",
"- Use ColumnTransformer to make it easy to apply different preprocessing to different columns\n",
"- Use Pipeline to make it easy to apply the same workflow to training data and new data\n",
"\n",
"Create a list of columns and use that to update X:"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [],
"source": [
"cols = ['Parch', 'Fare', 'Embarked', 'Sex']"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Parch</th>\n",
" <th>Fare</th>\n",
" <th>Embarked</th>\n",
" <th>Sex</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0</td>\n",
" <td>7.2500</td>\n",
" <td>S</td>\n",
" <td>male</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>0</td>\n",
" <td>71.2833</td>\n",
" <td>C</td>\n",
" <td>female</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>0</td>\n",
" <td>7.9250</td>\n",
" <td>S</td>\n",
" <td>female</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>0</td>\n",
" <td>53.1000</td>\n",
" <td>S</td>\n",
" <td>female</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>0</td>\n",
" <td>8.0500</td>\n",
" <td>S</td>\n",
" <td>male</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>0</td>\n",
" <td>8.4583</td>\n",
" <td>Q</td>\n",
" <td>male</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>0</td>\n",
" <td>51.8625</td>\n",
" <td>S</td>\n",
" <td>male</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>1</td>\n",
" <td>21.0750</td>\n",
" <td>S</td>\n",
" <td>male</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>2</td>\n",
" <td>11.1333</td>\n",
" <td>S</td>\n",
" <td>female</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>0</td>\n",
" <td>30.0708</td>\n",
" <td>C</td>\n",
" <td>female</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Parch Fare Embarked Sex\n",
"0 0 7.2500 S male\n",
"1 0 71.2833 C female\n",
"2 0 7.9250 S female\n",
"3 0 53.1000 S female\n",
"4 0 8.0500 S male\n",
"5 0 8.4583 Q male\n",
"6 0 51.8625 S male\n",
"7 1 21.0750 S male\n",
"8 2 11.1333 S female\n",
"9 0 30.0708 C female"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X = df[cols]\n",
"X"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Create an instance of OneHotEncoder with the default options:"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [],
"source": [
"ohe = OneHotEncoder()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Create a ColumnTransformer:\n",
"\n",
"- First argument (a tuple) specifies that we want to one-hot encode the \"Embarked\" and \"Sex\" columns\n",
"- \"remainder\" argument specifies that we want to keep all other columns in the final output (without modifying them)"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.compose import make_column_transformer\n",
"ct = make_column_transformer(\n",
" (ohe, ['Embarked', 'Sex']),\n",
" remainder='passthrough')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Perform the transformation:\n",
"\n",
"- Output contains 7 columns in this order: 3 columns for \"Embarked\", 2 for \"Sex\", 1 for \"Parch\", and 1 for \"Fare\"\n",
"- Column order is the order in which you listed them in the ColumnTransformer followed by any you passthrough"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[ 0. , 0. , 1. , 0. , 1. , 0. , 7.25 ],\n",
" [ 1. , 0. , 0. , 1. , 0. , 0. , 71.2833],\n",
" [ 0. , 0. , 1. , 1. , 0. , 0. , 7.925 ],\n",
" [ 0. , 0. , 1. , 1. , 0. , 0. , 53.1 ],\n",
" [ 0. , 0. , 1. , 0. , 1. , 0. , 8.05 ],\n",
" [ 0. , 1. , 0. , 0. , 1. , 0. , 8.4583],\n",
" [ 0. , 0. , 1. , 0. , 1. , 0. , 51.8625],\n",
" [ 0. , 0. , 1. , 0. , 1. , 1. , 21.075 ],\n",
" [ 0. , 0. , 1. , 1. , 0. , 2. , 11.1333],\n",
" [ 1. , 0. , 0. , 1. , 0. , 0. , 30.0708]])"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ct.fit_transform(X)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Use Pipeline to chain together sequential steps:\n",
"\n",
"- Step 1 is data preprocessing using ColumnTransformer\n",
"- Step 2 is model building using LogisticRegression"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.pipeline import make_pipeline\n",
"pipe = make_pipeline(ct, logreg)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Fit the Pipeline:\n",
"\n",
"- Step 1: X gets transformed from 4 columns to 7 columns by ColumnTransformer\n",
"- Step 2: LogisticRegression model gets fit, thus it learns the relationship between those 7 features and the y values\n",
"- Step 1 is assigned the name \"columntransformer\" (all lowercase), and step 2 is assigned the name \"logisticregression\""
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {
"scrolled": false
},
"outputs": [
{
"data": {
"text/plain": [
"Pipeline(memory=None,\n",
" steps=[('columntransformer',\n",
" ColumnTransformer(n_jobs=None, remainder='passthrough',\n",
" sparse_threshold=0.3,\n",
" transformer_weights=None,\n",
" transformers=[('onehotencoder',\n",
" OneHotEncoder(categories='auto',\n",
" drop=None,\n",
" dtype=<class 'numpy.float64'>,\n",
" handle_unknown='error',\n",
" sparse=True),\n",
" ['Embarked', 'Sex'])],\n",
" verbose=False)),\n",
" ('logisticregression',\n",
" LogisticRegression(C=1.0, class_weight=None, dual=False,\n",
" fit_intercept=True, intercept_scaling=1,\n",
" l1_ratio=None, max_iter=100,\n",
" multi_class='auto', n_jobs=None,\n",
" penalty='l2', random_state=1,\n",
" solver='liblinear', tol=0.0001, verbose=0,\n",
" warm_start=False))],\n",
" verbose=False)"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pipe.fit(X, y)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is what happens \"under the hood\" when you fit the Pipeline:"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,\n",
" intercept_scaling=1, l1_ratio=None, max_iter=100,\n",
" multi_class='auto', n_jobs=None, penalty='l2',\n",
" random_state=1, solver='liblinear', tol=0.0001, verbose=0,\n",
" warm_start=False)"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"logreg.fit(ct.fit_transform(X), y)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can select the steps of a Pipeline by name in order to inspect them:\n",
"\n",
"- These are the 7 coefficients of the logistic regression model"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[ 0.26491287, -0.19848033, -0.22907928, 1.0075062 , -1.17015293,\n",
" 0.20056557, 0.01597307]])"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pipe.named_steps.logisticregression.coef_"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Update X_new to have the same columns as X:"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Parch</th>\n",
" <th>Fare</th>\n",
" <th>Embarked</th>\n",
" <th>Sex</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0</td>\n",
" <td>7.8292</td>\n",
" <td>Q</td>\n",
" <td>male</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>0</td>\n",
" <td>7.0000</td>\n",
" <td>S</td>\n",
" <td>female</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>0</td>\n",
" <td>9.6875</td>\n",
" <td>Q</td>\n",
" <td>male</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>0</td>\n",
" <td>8.6625</td>\n",
" <td>S</td>\n",
" <td>male</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>1</td>\n",
" <td>12.2875</td>\n",
" <td>S</td>\n",
" <td>female</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>0</td>\n",
" <td>9.2250</td>\n",
" <td>S</td>\n",
" <td>male</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>0</td>\n",
" <td>7.6292</td>\n",
" <td>Q</td>\n",
" <td>female</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>1</td>\n",
" <td>29.0000</td>\n",
" <td>S</td>\n",
" <td>male</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>0</td>\n",
" <td>7.2292</td>\n",
" <td>C</td>\n",
" <td>female</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>0</td>\n",
" <td>24.1500</td>\n",
" <td>S</td>\n",
" <td>male</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Parch Fare Embarked Sex\n",
"0 0 7.8292 Q male\n",
"1 0 7.0000 S female\n",
"2 0 9.6875 Q male\n",
"3 0 8.6625 S male\n",
"4 1 12.2875 S female\n",
"5 0 9.2250 S male\n",
"6 0 7.6292 Q female\n",
"7 1 29.0000 S male\n",
"8 0 7.2292 C female\n",
"9 0 24.1500 S male"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X_new = df_new[cols]\n",
"X_new"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Use the fitted Pipeline to make predictions for X_new:"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0])"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pipe.predict(X_new)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is what happens \"under the hood\" when you make predictions using the Pipeline:\n",
"\n",
"- It uses \"transform\" rather than \"fit_transform\" so that the exact encoding scheme learned from the training data (during the \"fit\" step) will be applied to the new data"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0])"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"logreg.predict(ct.transform(X_new))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Recap\n",
"\n",
"This is all of the code that is necessary to recreate our workflow up to this point:\n",
"\n",
"- You can copy/paste this code from http://bit.ly/basic-pipeline\n",
"- There are no calls to \"fit_transform\" or \"transform\" because all of that functionality is encapsulated by the Pipeline"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"from sklearn.preprocessing import OneHotEncoder\n",
"from sklearn.linear_model import LogisticRegression\n",
"from sklearn.compose import make_column_transformer\n",
"from sklearn.pipeline import make_pipeline"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [],
"source": [
"cols = ['Parch', 'Fare', 'Embarked', 'Sex']"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [],
"source": [
"df = pd.read_csv('http://bit.ly/kaggletrain', nrows=10)\n",
"X = df[cols]\n",
"y = df['Survived']"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [],
"source": [
"df_new = pd.read_csv('http://bit.ly/kaggletest', nrows=10)\n",
"X_new = df_new[cols]"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [],
"source": [
"ohe = OneHotEncoder()"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [],
"source": [
"ct = make_column_transformer(\n",
" (ohe, ['Embarked', 'Sex']),\n",
" remainder='passthrough')"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [],
"source": [
"logreg = LogisticRegression(solver='liblinear', random_state=1)"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0])"
]
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pipe = make_pipeline(ct, logreg)\n",
"pipe.fit(X, y)\n",
"pipe.predict(X_new)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Summary of our ColumnTransformer:\n",
"\n",
"- It selected 2 categorical columns and transformed them, resulting in 5 columns\n",
"- It selected 2 numerical columns and did nothing to them, resulting in 2 columns\n",
"- It stacked the 7 columns side-by-side\n",
"\n",
"Summary of our Pipeline:\n",
"\n",
"- Step 1 transformed the data from 4 columns to 7 columns using ColumnTransformer\n",
"- Step 2 used a LogisticRegression model for fitting and predicting\n",
"\n",
"Comparing Pipeline and ColumnTransformer:\n",
"\n",
"- ColumnTransformer pulls out subsets of columns and transforms them, and then stacks the results side-by-side\n",
"- Pipeline is a series of steps that occur in order, and the output of each step passes to the next step"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![](https://user-images.githubusercontent.com/6509492/80138958-aee1fa80-8573-11ea-9533-340e87d135b1.jpg)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Part 4: Encoding text data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We want to use \"Name\" as an additional feature:\n",
"\n",
"- It can't be directly passed to the model because it isn't numeric"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>PassengerId</th>\n",
" <th>Survived</th>\n",
" <th>Pclass</th>\n",
" <th>Name</th>\n",
" <th>Sex</th>\n",
" <th>Age</th>\n",
" <th>SibSp</th>\n",
" <th>Parch</th>\n",
" <th>Ticket</th>\n",
" <th>Fare</th>\n",
" <th>Cabin</th>\n",
" <th>Embarked</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Braund, Mr. Owen Harris</td>\n",
" <td>male</td>\n",
" <td>22.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>A/5 21171</td>\n",
" <td>7.2500</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Cumings, Mrs. John Bradley (Florence Briggs Th...</td>\n",
" <td>female</td>\n",
" <td>38.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>PC 17599</td>\n",
" <td>71.2833</td>\n",
" <td>C85</td>\n",
" <td>C</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>Heikkinen, Miss. Laina</td>\n",
" <td>female</td>\n",
" <td>26.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>STON/O2. 3101282</td>\n",
" <td>7.9250</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td>\n",
" <td>female</td>\n",
" <td>35.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>113803</td>\n",
" <td>53.1000</td>\n",
" <td>C123</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Allen, Mr. William Henry</td>\n",
" <td>male</td>\n",
" <td>35.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>373450</td>\n",
" <td>8.0500</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>6</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Moran, Mr. James</td>\n",
" <td>male</td>\n",
" <td>NaN</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>330877</td>\n",
" <td>8.4583</td>\n",
" <td>NaN</td>\n",
" <td>Q</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>7</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>McCarthy, Mr. Timothy J</td>\n",
" <td>male</td>\n",
" <td>54.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>17463</td>\n",
" <td>51.8625</td>\n",
" <td>E46</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>8</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Palsson, Master. Gosta Leonard</td>\n",
" <td>male</td>\n",
" <td>2.0</td>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>349909</td>\n",
" <td>21.0750</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>9</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)</td>\n",
" <td>female</td>\n",
" <td>27.0</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>347742</td>\n",
" <td>11.1333</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>10</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>Nasser, Mrs. Nicholas (Adele Achem)</td>\n",
" <td>female</td>\n",
" <td>14.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>237736</td>\n",
" <td>30.0708</td>\n",
" <td>NaN</td>\n",
" <td>C</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" PassengerId Survived Pclass \\\n",
"0 1 0 3 \n",
"1 2 1 1 \n",
"2 3 1 3 \n",
"3 4 1 1 \n",
"4 5 0 3 \n",
"5 6 0 3 \n",
"6 7 0 1 \n",
"7 8 0 3 \n",
"8 9 1 3 \n",
"9 10 1 2 \n",
"\n",
" Name Sex Age SibSp \\\n",
"0 Braund, Mr. Owen Harris male 22.0 1 \n",
"1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 \n",
"2 Heikkinen, Miss. Laina female 26.0 0 \n",
"3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 \n",
"4 Allen, Mr. William Henry male 35.0 0 \n",
"5 Moran, Mr. James male NaN 0 \n",
"6 McCarthy, Mr. Timothy J male 54.0 0 \n",
"7 Palsson, Master. Gosta Leonard male 2.0 3 \n",
"8 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27.0 0 \n",
"9 Nasser, Mrs. Nicholas (Adele Achem) female 14.0 1 \n",
"\n",
" Parch Ticket Fare Cabin Embarked \n",
"0 0 A/5 21171 7.2500 NaN S \n",
"1 0 PC 17599 71.2833 C85 C \n",
"2 0 STON/O2. 3101282 7.9250 NaN S \n",
"3 0 113803 53.1000 C123 S \n",
"4 0 373450 8.0500 NaN S \n",
"5 0 330877 8.4583 NaN Q \n",
"6 0 17463 51.8625 E46 S \n",
"7 1 349909 21.0750 NaN S \n",
"8 2 347742 11.1333 NaN S \n",
"9 0 237736 30.0708 NaN C "
]
},
"execution_count": 40,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Use CountVectorizer to convert text into a matrix of token counts:\n",
"\n",
"- Use single brackets around \"Name\" to pass a Series, because CountVectorizer expects 1-dimensional input\n",
"- Outputs a document-term matrix containing 10 rows (one for each name) and 40 columns (one for each unique word)"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<10x40 sparse matrix of type '<class 'numpy.int64'>'\n",
"\twith 46 stored elements in Compressed Sparse Row format>"
]
},
"execution_count": 41,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.feature_extraction.text import CountVectorizer\n",
"vect = CountVectorizer()\n",
"dtm = vect.fit_transform(df['Name'])\n",
"dtm"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Examine the feature names:\n",
"\n",
"- It found 40 unique words in the \"Name\" Series after lowercasing the words, removing punctuation, and removing words that were only 1 character long"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['achem', 'adele', 'allen', 'berg', 'bradley', 'braund', 'briggs', 'cumings', 'elisabeth', 'florence', 'futrelle', 'gosta', 'harris', 'heath', 'heikkinen', 'henry', 'jacques', 'james', 'john', 'johnson', 'laina', 'leonard', 'lily', 'master', 'may', 'mccarthy', 'miss', 'moran', 'mr', 'mrs', 'nasser', 'nicholas', 'oscar', 'owen', 'palsson', 'peel', 'thayer', 'timothy', 'vilhelmina', 'william']\n"
]
}
],
"source": [
"print(vect.get_feature_names())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Examine the document-term matrix as a DataFrame:\n",
"\n",
"- In each row, CountVectorizer counted how many times each word appeared\n",
"- For example, the first row contains 36 zeros and 4 ones (under \"braund\", \"mr\", \"owen\", and \"harris\")\n",
"- This encoding is known as the \"Bag of Words\" representation\n",
"- From each of the 40 features, the model can learn the relationship between the target value and how many times that word appeared in each passenger's name"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>achem</th>\n",
" <th>adele</th>\n",
" <th>allen</th>\n",
" <th>berg</th>\n",
" <th>bradley</th>\n",
" <th>braund</th>\n",
" <th>briggs</th>\n",
" <th>cumings</th>\n",
" <th>elisabeth</th>\n",
" <th>florence</th>\n",
" <th>...</th>\n",
" <th>nasser</th>\n",
" <th>nicholas</th>\n",
" <th>oscar</th>\n",
" <th>owen</th>\n",
" <th>palsson</th>\n",
" <th>peel</th>\n",
" <th>thayer</th>\n",
" <th>timothy</th>\n",
" <th>vilhelmina</th>\n",
" <th>william</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>10 rows × 40 columns</p>\n",
"</div>"
],
"text/plain": [
" achem adele allen berg bradley braund briggs cumings elisabeth \\\n",
"0 0 0 0 0 0 1 0 0 0 \n",
"1 0 0 0 0 1 0 1 1 0 \n",
"2 0 0 0 0 0 0 0 0 0 \n",
"3 0 0 0 0 0 0 0 0 0 \n",
"4 0 0 1 0 0 0 0 0 0 \n",
"5 0 0 0 0 0 0 0 0 0 \n",
"6 0 0 0 0 0 0 0 0 0 \n",
"7 0 0 0 0 0 0 0 0 0 \n",
"8 0 0 0 1 0 0 0 0 1 \n",
"9 1 1 0 0 0 0 0 0 0 \n",
"\n",
" florence ... nasser nicholas oscar owen palsson peel thayer \\\n",
"0 0 ... 0 0 0 1 0 0 0 \n",
"1 1 ... 0 0 0 0 0 0 1 \n",
"2 0 ... 0 0 0 0 0 0 0 \n",
"3 0 ... 0 0 0 0 0 1 0 \n",
"4 0 ... 0 0 0 0 0 0 0 \n",
"5 0 ... 0 0 0 0 0 0 0 \n",
"6 0 ... 0 0 0 0 0 0 0 \n",
"7 0 ... 0 0 0 0 1 0 0 \n",
"8 0 ... 0 0 1 0 0 0 0 \n",
"9 0 ... 1 1 0 0 0 0 0 \n",
"\n",
" timothy vilhelmina william \n",
"0 0 0 0 \n",
"1 0 0 0 \n",
"2 0 0 0 \n",
"3 0 0 0 \n",
"4 0 0 1 \n",
"5 0 0 0 \n",
"6 1 0 0 \n",
"7 0 0 0 \n",
"8 0 1 0 \n",
"9 0 0 0 \n",
"\n",
"[10 rows x 40 columns]"
]
},
"execution_count": 43,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.DataFrame(dtm.toarray(), columns=vect.get_feature_names())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Update X to include the \"Name\" column:"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Parch</th>\n",
" <th>Fare</th>\n",
" <th>Embarked</th>\n",
" <th>Sex</th>\n",
" <th>Name</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0</td>\n",
" <td>7.2500</td>\n",
" <td>S</td>\n",
" <td>male</td>\n",
" <td>Braund, Mr. Owen Harris</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>0</td>\n",
" <td>71.2833</td>\n",
" <td>C</td>\n",
" <td>female</td>\n",
" <td>Cumings, Mrs. John Bradley (Florence Briggs Th...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>0</td>\n",
" <td>7.9250</td>\n",
" <td>S</td>\n",
" <td>female</td>\n",
" <td>Heikkinen, Miss. Laina</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>0</td>\n",
" <td>53.1000</td>\n",
" <td>S</td>\n",
" <td>female</td>\n",
" <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>0</td>\n",
" <td>8.0500</td>\n",
" <td>S</td>\n",
" <td>male</td>\n",
" <td>Allen, Mr. William Henry</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>0</td>\n",
" <td>8.4583</td>\n",
" <td>Q</td>\n",
" <td>male</td>\n",
" <td>Moran, Mr. James</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>0</td>\n",
" <td>51.8625</td>\n",
" <td>S</td>\n",
" <td>male</td>\n",
" <td>McCarthy, Mr. Timothy J</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>1</td>\n",
" <td>21.0750</td>\n",
" <td>S</td>\n",
" <td>male</td>\n",
" <td>Palsson, Master. Gosta Leonard</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>2</td>\n",
" <td>11.1333</td>\n",
" <td>S</td>\n",
" <td>female</td>\n",
" <td>Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>0</td>\n",
" <td>30.0708</td>\n",
" <td>C</td>\n",
" <td>female</td>\n",
" <td>Nasser, Mrs. Nicholas (Adele Achem)</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Parch Fare Embarked Sex \\\n",
"0 0 7.2500 S male \n",
"1 0 71.2833 C female \n",
"2 0 7.9250 S female \n",
"3 0 53.1000 S female \n",
"4 0 8.0500 S male \n",
"5 0 8.4583 Q male \n",
"6 0 51.8625 S male \n",
"7 1 21.0750 S male \n",
"8 2 11.1333 S female \n",
"9 0 30.0708 C female \n",
"\n",
" Name \n",
"0 Braund, Mr. Owen Harris \n",
"1 Cumings, Mrs. John Bradley (Florence Briggs Th... \n",
"2 Heikkinen, Miss. Laina \n",
"3 Futrelle, Mrs. Jacques Heath (Lily May Peel) \n",
"4 Allen, Mr. William Henry \n",
"5 Moran, Mr. James \n",
"6 McCarthy, Mr. Timothy J \n",
"7 Palsson, Master. Gosta Leonard \n",
"8 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) \n",
"9 Nasser, Mrs. Nicholas (Adele Achem) "
]
},
"execution_count": 44,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cols = ['Parch', 'Fare', 'Embarked', 'Sex', 'Name']\n",
"X = df[cols]\n",
"X"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Update the ColumnTransformer:\n",
"\n",
"- Add another tuple to specify that CountVectorizer should be applied to the \"Name\" column\n",
"- There are no brackets around \"Name\" because CountVectorizer expects 1-dimensional input"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {},
"outputs": [],
"source": [
"ct = make_column_transformer(\n",
" (ohe, ['Embarked', 'Sex']),\n",
" (vect, 'Name'),\n",
" remainder='passthrough')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Perform the transformation:\n",
"\n",
"- Output contains 47 columns in this order: 3 columns for \"Embarked\", 2 for \"Sex\", 40 for \"Name\", 1 for \"Parch\", and 1 for \"Fare\""
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<10x47 sparse matrix of type '<class 'numpy.float64'>'\n",
"\twith 78 stored elements in Compressed Sparse Row format>"
]
},
"execution_count": 46,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ct.fit_transform(X)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Update the Pipeline to contain the modified ColumnTransformer:"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {},
"outputs": [],
"source": [
"pipe = make_pipeline(ct, logreg)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Fit the Pipeline and examine the steps:"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'columntransformer': ColumnTransformer(n_jobs=None, remainder='passthrough', sparse_threshold=0.3,\n",
" transformer_weights=None,\n",
" transformers=[('onehotencoder',\n",
" OneHotEncoder(categories='auto', drop=None,\n",
" dtype=<class 'numpy.float64'>,\n",
" handle_unknown='error',\n",
" sparse=True),\n",
" ['Embarked', 'Sex']),\n",
" ('countvectorizer',\n",
" CountVectorizer(analyzer='word', binary=False,\n",
" decode_error='strict',\n",
" dtype=<class 'numpy.int64'>,\n",
" encoding='utf-8',\n",
" input='content',\n",
" lowercase=True, max_df=1.0,\n",
" max_features=None, min_df=1,\n",
" ngram_range=(1, 1),\n",
" preprocessor=None,\n",
" stop_words=None,\n",
" strip_accents=None,\n",
" token_pattern='(?u)\\\\b\\\\w\\\\w+\\\\b',\n",
" tokenizer=None,\n",
" vocabulary=None),\n",
" 'Name')],\n",
" verbose=False),\n",
" 'logisticregression': LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,\n",
" intercept_scaling=1, l1_ratio=None, max_iter=100,\n",
" multi_class='auto', n_jobs=None, penalty='l2',\n",
" random_state=1, solver='liblinear', tol=0.0001, verbose=0,\n",
" warm_start=False)}"
]
},
"execution_count": 48,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pipe.fit(X, y)\n",
"pipe.named_steps"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Update X_new to include the \"Name\" column:"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {},
"outputs": [],
"source": [
"X_new = df_new[cols]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Use the fitted Pipeline to make predictions for X_new:"
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0])"
]
},
"execution_count": 50,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pipe.predict(X_new)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.2"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment