lukemerrick/p2p_loans_470k_demo.ipynb

## p2p_loans_470k_demo.ipynb
{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pathlib\n",
    "\n",
    "import category_encoders\n",
    "import pandas as pd\n",
    "import sklearn.impute\n",
    "import sklearn.linear_model\n",
    "import sklearn.metrics\n",
    "import sklearn.pipeline\n",
    "import sklearn.preprocessing\n",
    "import yaml"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Intro\n",
    "This notebook walks through a quick example of using the p2p_lending_470k dataset. We breifly show how to import the dataset into Pandas, remove a few fields that do not play nicely with fitting ML models out-of-the box, and fit a logistic regression model to predict likelihood of loan default.\n",
    "\n",
    "We use the [`category_encoders`](https://github.com/scikit-learn-contrib/categorical-encoding) package to seamlessly translate categorical features imported using Pandas' native support for categorical variables into one-hot encoded variables that will play nicely with the scikit-learn implementation of logistic regression which we use."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# define paths to all the files\n",
    "data_dir = pathlib.Path('p2p_loans_470k/')\n",
    "\n",
    "feature_schema_yaml = data_dir / 'feature_schema.yaml'\n",
    "label_schema_yaml = data_dir / 'label_schema.yaml'\n",
    "\n",
    "train_feature_csv = data_dir / 'train' / 'train_features.csv.gz'\n",
    "train_label_csv = data_dir / 'train' / 'train_labels.csv.gz'\n",
    "test_feature_csv = data_dir / 'test' / 'test_features.csv.gz'\n",
    "test_label_csv = data_dir / 'test' / 'test_labels.csv.gz'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Step 1: Importing the data\n",
    "We take advantage of the Pandas-focused schema files provided with the dataset to import the dataset quickly with all datatype information correctly inferred. To load the schema files, we use the [`pyyaml`](https://github.com/yaml/pyyaml) package. The schema objects are dictionaries with keys that match the exact convention of the Pandas `read_csv` function, so we can simply pass the schemas directly into the `read_csv` call via the python dictionary-expansion syntax."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 5.47 s, sys: 420 ms, total: 5.89 s\n",
      "Wall time: 5.9 s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "# load the schemas\n",
    "with feature_schema_yaml.open() as yaml_file:\n",
    "    feature_schema = yaml.load(yaml_file)\n",
    "with label_schema_yaml.open() as yaml_file:\n",
    "    label_schema = yaml.load(yaml_file)\n",
    "\n",
    "# use them to intelligently import the data\n",
    "train_features = pd.read_csv(train_feature_csv, **feature_schema)\n",
    "train_labels = pd.read_csv(train_label_csv, **label_schema)\n",
    "test_features = pd.read_csv(test_feature_csv, **feature_schema)\n",
    "test_labels = pd.read_csv(test_label_csv, **label_schema)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Step 2: Preparing the data\n",
    "Although the p2p_loans_470k dataset is constructed to be ML-friendly, it does contain a few features that would require some preprocessing to use in a model. For the sake of this example, we will remove these features instead of trying to engineer features from them that can be fed directly into ML models.\n",
    "\n",
    "Additionally, since we intend to use a logistic regression model, we have to be cognizant of our features that contain a large number of missing values. Since logistic regression does not support missing values, we choose to drop additional features which are almost always missing. For the features with less frequent missing values, we use the standard approach of mean imputation to fill them in.\n",
    "\n",
    "Lastly, we split off our target variable 'loan status' from the rest of our labels. The p2p_loans_470k dataset contains a rich body of information on the outcome of the loans it tracks, but for the purpose of creditworthiness modeling we are primarily concerned with the ultimate status of the loan: whether it was fully paid or not."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Dropping:  ['emp_title', 'desc', 'title', 'earliest_cr_line']\n"
     ]
    }
   ],
   "source": [
    "# drop labels that aren't useful out-of-the-box\n",
    "# i.e. those that are strings or dates\n",
    "drop_features = train_features.select_dtypes(['datetime', 'O']).columns.tolist()\n",
    "print('Dropping: ', drop_features)\n",
    "train_features.drop(columns=drop_features, inplace=True)\n",
    "test_features.drop(columns=drop_features, inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "mths_since_last_record            0.844162\n",
       "mths_since_last_major_derog       0.745195\n",
       "mths_since_recent_revol_delinq    0.657223\n",
       "mths_since_last_delinq            0.515181\n",
       "mths_since_recent_inq             0.104034\n",
       "dtype: float64"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# identify fields that are almost always missing\n",
    "most_missing_cols = train_features.isna().mean().sort_values(ascending=False)\n",
    "most_missing_cols.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Dropping:  ['mths_since_last_record', 'mths_since_last_major_derog', 'mths_since_recent_revol_delinq', 'mths_since_last_delinq']\n"
     ]
    }
   ],
   "source": [
    "drop_features = most_missing_cols.index.tolist()[:4]\n",
    "print('Dropping: ', drop_features)\n",
    "train_features.drop(columns=drop_features, inplace=True)\n",
    "test_features.drop(columns=drop_features, inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "# define target variable: loan status\n",
    "train_target = train_labels['loan_status']\n",
    "test_target = test_labels['loan_status']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Step 3: Training a model\n",
    "Here we define our model as a scikit-learn pipeline. The core of the model is a simple logistic regression with default hyperparameters, but in order to get our data into a format appropriate to feed into this model, we need to run a few data-dependent preprocessing steps.\n",
    "1. **One-hot encoding of categorical variables:** A logistic regression model has no way of handling categorical features. It only deals with continuous variables. In order to translate our categories into something the model understands, we one-hot encode them.\n",
    "1. **Mean imputation of missing values:** A logistic regression model has no way of dealing with missing values. We use mean imputation to fill in missing values with the mean of that feature from the training set.\n",
    "1. **Standard scaling:** Rescaling the model inputs to have zero mean and unit variance is an important step to ensure fast convergence in the model fitting step as well as appropriate regularization. Since the scikit-learn implementation of logistic regression by default comes with l2 regularization, we must be very careful that the model inputs are all roughly on the same scale as one another. \n",
    "\n",
    "Thanks to the clean scikit-learn API, we can combine all of these preprocessing steps with the model to build a single pipeline object. We then fit this model on the training data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = sklearn.pipeline.Pipeline([\n",
    "    ('onehot_encoding', category_encoders.OneHotEncoder(\n",
    "        cols=train_features.select_dtypes('category').columns.tolist(),\n",
    "        handle_unknown='ignore')),\n",
    "    ('mean_imputation', sklearn.impute.SimpleImputer()),\n",
    "    ('standard_scaler', sklearn.preprocessing.StandardScaler()),\n",
    "    ('logistic_regression', sklearn.linear_model.LogisticRegression(\n",
    "        solver='lbfgs', max_iter=500))\n",
    "])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 47.5 s, sys: 5.67 s, total: 53.2 s\n",
      "Wall time: 13.6 s\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "Pipeline(memory=None,\n",
       "     steps=[('onehot_encoding', OneHotEncoder(cols=['emp_length', 'home_ownership', 'purpose', 'addr_state'],\n",
       "       drop_invariant=False, handle_unknown='ignore', impute_missing=True,\n",
       "       return_df=True, use_cat_names=False, verbose=0)), ('mean_imputation', SimpleImputer(copy=True, fill_value=None, m...enalty='l2', random_state=None, solver='lbfgs',\n",
       "          tol=0.0001, verbose=0, warm_start=False))])"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%time\n",
    "model.fit(train_features, train_target)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Step 4: Evaluation\n",
    "Now that we have fit our model, we can measure the model's performance on out-of-sample data. Here we choose the area under the ROC curve as our performance metric (ROC-AUC)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Our logistic regression scores an ROC-AUC of 0.6838\n"
     ]
    }
   ],
   "source": [
    "predictions = model.predict_proba(test_features)[:, 1]\n",
    "roc_auc = sklearn.metrics.roc_auc_score(test_target, predictions)\n",
    "print(f'Our logistic regression scores an ROC-AUC of {roc_auc:.4f}')"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
	{
	"cells": [
	{
	"cell_type": "code",
	"execution_count": 1,
	"metadata": {},
	"outputs": [],
	"source": [
	"import pathlib\n",
	"\n",
	"import category_encoders\n",
	"import pandas as pd\n",
	"import sklearn.impute\n",
	"import sklearn.linear_model\n",
	"import sklearn.metrics\n",
	"import sklearn.pipeline\n",
	"import sklearn.preprocessing\n",
	"import yaml"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Intro\n",
	"This notebook walks through a quick example of using the p2p_lending_470k dataset. We breifly show how to import the dataset into Pandas, remove a few fields that do not play nicely with fitting ML models out-of-the box, and fit a logistic regression model to predict likelihood of loan default.\n",
	"\n",
	"We use the [`category_encoders`](https://github.com/scikit-learn-contrib/categorical-encoding) package to seamlessly translate categorical features imported using Pandas' native support for categorical variables into one-hot encoded variables that will play nicely with the scikit-learn implementation of logistic regression which we use."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 2,
	"metadata": {},
	"outputs": [],
	"source": [
	"# define paths to all the files\n",
	"data_dir = pathlib.Path('p2p_loans_470k/')\n",
	"\n",
	"feature_schema_yaml = data_dir / 'feature_schema.yaml'\n",
	"label_schema_yaml = data_dir / 'label_schema.yaml'\n",
	"\n",
	"train_feature_csv = data_dir / 'train' / 'train_features.csv.gz'\n",
	"train_label_csv = data_dir / 'train' / 'train_labels.csv.gz'\n",
	"test_feature_csv = data_dir / 'test' / 'test_features.csv.gz'\n",
	"test_label_csv = data_dir / 'test' / 'test_labels.csv.gz'"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Step 1: Importing the data\n",
	"We take advantage of the Pandas-focused schema files provided with the dataset to import the dataset quickly with all datatype information correctly inferred. To load the schema files, we use the [`pyyaml`](https://github.com/yaml/pyyaml) package. The schema objects are dictionaries with keys that match the exact convention of the Pandas `read_csv` function, so we can simply pass the schemas directly into the `read_csv` call via the python dictionary-expansion syntax."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 3,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"CPU times: user 5.47 s, sys: 420 ms, total: 5.89 s\n",
	"Wall time: 5.9 s\n"
	]
	}
	],
	"source": [
	"%%time\n",
	"# load the schemas\n",
	"with feature_schema_yaml.open() as yaml_file:\n",
	" feature_schema = yaml.load(yaml_file)\n",
	"with label_schema_yaml.open() as yaml_file:\n",
	" label_schema = yaml.load(yaml_file)\n",
	"\n",
	"# use them to intelligently import the data\n",
	"train_features = pd.read_csv(train_feature_csv, **feature_schema)\n",
	"train_labels = pd.read_csv(train_label_csv, **label_schema)\n",
	"test_features = pd.read_csv(test_feature_csv, **feature_schema)\n",
	"test_labels = pd.read_csv(test_label_csv, **label_schema)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Step 2: Preparing the data\n",
	"Although the p2p_loans_470k dataset is constructed to be ML-friendly, it does contain a few features that would require some preprocessing to use in a model. For the sake of this example, we will remove these features instead of trying to engineer features from them that can be fed directly into ML models.\n",
	"\n",
	"Additionally, since we intend to use a logistic regression model, we have to be cognizant of our features that contain a large number of missing values. Since logistic regression does not support missing values, we choose to drop additional features which are almost always missing. For the features with less frequent missing values, we use the standard approach of mean imputation to fill them in.\n",
	"\n",
	"Lastly, we split off our target variable 'loan status' from the rest of our labels. The p2p_loans_470k dataset contains a rich body of information on the outcome of the loans it tracks, but for the purpose of creditworthiness modeling we are primarily concerned with the ultimate status of the loan: whether it was fully paid or not."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 4,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"Dropping: ['emp_title', 'desc', 'title', 'earliest_cr_line']\n"
	]
	}
	],
	"source": [
	"# drop labels that aren't useful out-of-the-box\n",
	"# i.e. those that are strings or dates\n",
	"drop_features = train_features.select_dtypes(['datetime', 'O']).columns.tolist()\n",
	"print('Dropping: ', drop_features)\n",
	"train_features.drop(columns=drop_features, inplace=True)\n",
	"test_features.drop(columns=drop_features, inplace=True)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 5,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"mths_since_last_record 0.844162\n",
	"mths_since_last_major_derog 0.745195\n",
	"mths_since_recent_revol_delinq 0.657223\n",
	"mths_since_last_delinq 0.515181\n",
	"mths_since_recent_inq 0.104034\n",
	"dtype: float64"
	]
	},
	"execution_count": 5,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"# identify fields that are almost always missing\n",
	"most_missing_cols = train_features.isna().mean().sort_values(ascending=False)\n",
	"most_missing_cols.head()"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 6,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"Dropping: ['mths_since_last_record', 'mths_since_last_major_derog', 'mths_since_recent_revol_delinq', 'mths_since_last_delinq']\n"
	]
	}
	],
	"source": [
	"drop_features = most_missing_cols.index.tolist()[:4]\n",
	"print('Dropping: ', drop_features)\n",
	"train_features.drop(columns=drop_features, inplace=True)\n",
	"test_features.drop(columns=drop_features, inplace=True)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 7,
	"metadata": {},
	"outputs": [],
	"source": [
	"# define target variable: loan status\n",
	"train_target = train_labels['loan_status']\n",
	"test_target = test_labels['loan_status']"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Step 3: Training a model\n",
	"Here we define our model as a scikit-learn pipeline. The core of the model is a simple logistic regression with default hyperparameters, but in order to get our data into a format appropriate to feed into this model, we need to run a few data-dependent preprocessing steps.\n",
	"1. One-hot encoding of categorical variables: A logistic regression model has no way of handling categorical features. It only deals with continuous variables. In order to translate our categories into something the model understands, we one-hot encode them.\n",
	"1. Mean imputation of missing values: A logistic regression model has no way of dealing with missing values. We use mean imputation to fill in missing values with the mean of that feature from the training set.\n",
	"1. Standard scaling: Rescaling the model inputs to have zero mean and unit variance is an important step to ensure fast convergence in the model fitting step as well as appropriate regularization. Since the scikit-learn implementation of logistic regression by default comes with l2 regularization, we must be very careful that the model inputs are all roughly on the same scale as one another. \n",
	"\n",
	"Thanks to the clean scikit-learn API, we can combine all of these preprocessing steps with the model to build a single pipeline object. We then fit this model on the training data."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 8,
	"metadata": {},
	"outputs": [],
	"source": [
	"model = sklearn.pipeline.Pipeline([\n",
	" ('onehot_encoding', category_encoders.OneHotEncoder(\n",
	" cols=train_features.select_dtypes('category').columns.tolist(),\n",
	" handle_unknown='ignore')),\n",
	" ('mean_imputation', sklearn.impute.SimpleImputer()),\n",
	" ('standard_scaler', sklearn.preprocessing.StandardScaler()),\n",
	" ('logistic_regression', sklearn.linear_model.LogisticRegression(\n",
	" solver='lbfgs', max_iter=500))\n",
	"])"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 9,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"CPU times: user 47.5 s, sys: 5.67 s, total: 53.2 s\n",
	"Wall time: 13.6 s\n"
	]
	},
	{
	"data": {
	"text/plain": [
	"Pipeline(memory=None,\n",
	" steps=[('onehot_encoding', OneHotEncoder(cols=['emp_length', 'home_ownership', 'purpose', 'addr_state'],\n",
	" drop_invariant=False, handle_unknown='ignore', impute_missing=True,\n",
	" return_df=True, use_cat_names=False, verbose=0)), ('mean_imputation', SimpleImputer(copy=True, fill_value=None, m...enalty='l2', random_state=None, solver='lbfgs',\n",
	" tol=0.0001, verbose=0, warm_start=False))])"
	]
	},
	"execution_count": 9,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"%%time\n",
	"model.fit(train_features, train_target)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Step 4: Evaluation\n",
	"Now that we have fit our model, we can measure the model's performance on out-of-sample data. Here we choose the area under the ROC curve as our performance metric (ROC-AUC)."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 10,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"Our logistic regression scores an ROC-AUC of 0.6838\n"
	]
	}
	],
	"source": [
	"predictions = model.predict_proba(test_features)[:, 1]\n",
	"roc_auc = sklearn.metrics.roc_auc_score(test_target, predictions)\n",
	"print(f'Our logistic regression scores an ROC-AUC of {roc_auc:.4f}')"
	]
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.7.2"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 2
	}