wphicks/Preprocessing.ipynb

## Preprocessing.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# cuML Preprocessing\n",
    "Users of cuML are certainly familiar with its ability to run machine learning models on GPUs and the significant training and inference speedup that can entail, but the models themselves are only part of the story. In this notebook, we will demonstrate how cuML allows you to develop an entire machine learning _pipeline_ in order to preprocess and prepare your data without _ever_ leaving the GPU.\n",
    "\n",
    "We will use the [BNP Paribas Cardif Claims Management dataset](https://www.kaggle.com/c/bnp-paribas-cardif-claims-management) to showcase a few of the many methods that cuML offers for GPU-accelerated feature engineering. This dataset offers an interesting challenge because:\n",
    "1. It is somewhat messy, including missing data of various kinds.\n",
    "2. It includes both quantitative data (represented as floating point values) and categorical data (represented as both integers and strings).\n",
    "3. It is anonymized, so we cannot use _a priori_ domain-specific knowledge to guide our approach.\n",
    "\n",
    "Our goal here is not necessarily to achieve the best possible model performance but to showcase the cuML features that you could use to improve model performance on your own. For a deeper dive into how to maximize performance on this dataset, check out the solutions and associated discussion for [the top Kaggle entries](https://www.kaggle.com/c/bnp-paribas-cardif-claims-management/leaderboard)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Data Ingest\n",
    "Our first step is to acquire the data and read it into a data frame for subsequent processing. This process should be quite familiar for Pandas users, though we will be making use of cuDF, the equivalent GPU-accelerated module."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# To acquire the dataset, we will make use of the Kaggle CLI tool.\n",
    "# If you do not have this tool set up, you can download the data directly\n",
    "# from the Kaggle competition page: https://www.kaggle.com/c/bnp-paribas-cardif-claims-management/data\n",
    "# Note that you may still need to visit this page even if you have the CLI\n",
    "# tool in order to agree to the terms of data usage.\n",
    "\n",
    "!kaggle competitions download -c bnp-paribas-cardif-claims-management\n",
    "!unzip -o bnp-paribas-cardif-claims-management.zip"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import cudf\n",
    "\n",
    "data_cudf = cudf.read_csv(\"./train.csv.zip\")\n",
    "data_pd = data_cudf.to_pandas()\n",
    "\n",
    "data_cudf.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Looking at the first few rows of these data, we can already understand some of the problems we might expect in working with the full dataset. We have a \"target\" column representing a binary classification target that we would like to predict with our model. As input to that model, we have over a hundred features, some represented as floats, some as ints, and some as strings. We can also see that quite a bit of the data is missing, as denoted by the numerous \"\\<NA\\>\" entries."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Evaluation Procedure\n",
    "As a general principle, it is helpful to clearly define an evaluation procedure before jumping into model building and training. In this case, we are interested in finding a robust preprocessing protocol to apply to unseen data, so we will perform [k-fold cross-validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics)#k-fold_cross-validation) and average performance across folds.\n",
    "\n",
    "Because the RAPIDS packages have maintained such close compatibility with their non-GPU-accelerated counterparts, sklearn's k-fold cross-validation implementation can be directly applied to our data on the GPU. Moreover, this is one of several sklearn algorithms that can be applied without incurring any device-to-host copies, so we will use it directly in our evaluation protocol.\n",
    "\n",
    "For demonstrations purposes, we will use accuracy (the default scoring metric for random forest models in sklearn) as our metric, but remember that accuracy [should](https://en.wikipedia.org/wiki/Accuracy_paradox) [not](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0084217) [be](https://www.fharrell.com/post/class-damage/) [used](https://medium.com/@limavallantin/why-you-should-not-trust-only-in-accuracy-to-measure-machine-learning-performance-a72cf00b4516) as a model-selection metric for any serious application."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import warnings\n",
    "\n",
    "import numpy\n",
    "from sklearn.model_selection import KFold\n",
    "\n",
    "def evaluate(pipeline, data, n_splits=5, target_col='target'):\n",
    "    \"\"\"\"\"\"\n",
    "    x = data[data.columns.difference([target_col])]\n",
    "    y = data[[target_col]]\n",
    "\n",
    "    folds = KFold(n_splits=n_splits, shuffle=False)\n",
    "    scores = numpy.empty(folds.get_n_splits(x), dtype=numpy.float32)\n",
    "    for i, (train_indices, test_indices) in enumerate(folds.split(x)):\n",
    "        x_train, x_test = x.iloc[train_indices], x.iloc[test_indices]\n",
    "        y_train, y_test = y.iloc[train_indices], y.iloc[test_indices]\n",
    "        pipeline.fit(x_train, y_train)\n",
    "        scores[i] = pipeline.score(x_test, y_test)\n",
    "\n",
    "    return numpy.average(scores)\n",
    "\n",
    "def cu_evaluate(pipeline):\n",
    "    \"\"\"Convenience wrapper for evaluating cuML-based pipelines\"\"\"\n",
    "    with warnings.catch_warnings():\n",
    "        warnings.simplefilter(\"ignore\")\n",
    "        return evaluate(pipeline, data_cudf)\n",
    "\n",
    "def sk_evaluate(pipeline):\n",
    "    \"\"\"Convenience wrapper for evaluating sklearn-based pipelines\"\"\"\n",
    "    # Suppress sklearn data conversion warnings\n",
    "    with warnings.catch_warnings():\n",
    "        warnings.simplefilter(\"ignore\")\n",
    "        return evaluate(pipeline, data_pd)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With these two convenience functions, we can quickly assess performance of full processing-and-classification pipelines with a single call."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. The Model\n",
    "For the moment, we are focusing on the preprocessing portion of our pipeline, so we will stick with a random forest model with a fixed set of hyperparameters. We will set `n_jobs` to `-1` for the sklearn model in order to make use of all available CPU processors, but we will otherwise stick with defaults.\n",
    "\n",
    "You will probably notice a small difference in the accuracy achieved by the cuML random forest implementation and that achieved by sklearn. RAPIDS is in the process of transitioning to a new random forest implementation that performs much more comparably to sklearn. If you'd like to try out this (currently experimental) implementation, uncommon the indicated lines below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from cuml.ensemble import RandomForestClassifier as cuRandomForestClassifier\n",
    "from sklearn.ensemble import RandomForestClassifier as skRandomForestClassifier\n",
    "\n",
    "cu_classifier = cuRandomForestClassifier()\n",
    "sk_classifier = skRandomForestClassifier(n_jobs=-1)\n",
    "\n",
    "\n",
    "# Uncomment the following lines to try out the new experimental RF\n",
    "# implementation in cuML\n",
    "\n",
    "# cu_classifier = cuRandomForestClassifier(max_features=1.0,\n",
    "#                                          max_depth=13,\n",
    "#                                          use_experimental_backend=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Intermezzo: Helper Code\n",
    "\n",
    "One of the standout features of sklearn is its consistent API for algorithms that fill the same role. Introducing a new algorithm that can be slotted into an sklearn pipeline is as easy as defining a class that fits that API. In this section, we'll define a few helper classes that will help us easily apply whatever preprocessing transformations we desire as part of our pipeline.\n",
    "\n",
    "Feel free to skip over the details of these implementations; the docstrings should give a sufficient sense of their purpose and usage."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "hide-cell"
    ]
   },
   "outputs": [],
   "source": [
    "import pandas\n",
    "from sklearn.base import BaseEstimator, TransformerMixin\n",
    "\n",
    "class LambdaTransformer(BaseEstimator, TransformerMixin):\n",
    "    \"\"\"An sklearn-compatible class for simple transformation functions\n",
    "    \n",
    "    This helper class is useful for transforming data with a straightforward\n",
    "    function requiring no fitting\n",
    "    \"\"\"\n",
    "    def __init__(self, transform_function):\n",
    "        self.transform_function = transform_function\n",
    "\n",
    "    def fit(self, X, y=None):\n",
    "        return self\n",
    "\n",
    "    def transform(self, X, y=None):\n",
    "        return self.transform_function(X)\n",
    "\n",
    "\n",
    "# Workaround for https://github.com/rapidsai/cuml/issues/3041\n",
    "\n",
    "class PerFeatureTransformer(BaseEstimator, TransformerMixin):\n",
    "    \"\"\"An sklearn-compatible class for fitting and transforming on\n",
    "    each feature independently\n",
    "    \n",
    "    Some preprocessing algorithms need to be applied independently to\n",
    "    each feature. This wrapper facilitates that process.\n",
    "    \"\"\"\n",
    "    def __init__(self,\n",
    "                 transformer_class,\n",
    "                 transformer_args=(),\n",
    "                 transformer_kwargs={},\n",
    "                 copy=True):\n",
    "        self.transformer_class = transformer_class\n",
    "        self.transformer_args = transformer_args\n",
    "        self.transformer_kwargs = transformer_kwargs\n",
    "        self.transformers = {}\n",
    "        self.copy = copy\n",
    "        \n",
    "    def fit(self, X, y=None):\n",
    "        for col in X.columns:\n",
    "            self.transformers[col] = self.transformer_class(\n",
    "                *self.transformer_args,\n",
    "                **self.transformer_kwargs\n",
    "            )\n",
    "            try:\n",
    "                self.transformers[col].fit(X[col], y=y)\n",
    "            except TypeError:  # https://github.com/rapidsai/cuml/issues/3053\n",
    "                self.transformers[col].fit(X[col])\n",
    "        return self\n",
    "    \n",
    "    def transform(self, X, y=None):\n",
    "        if self.copy:\n",
    "            X = X.copy()\n",
    "        for col in X.columns:\n",
    "            try:\n",
    "                X[col] = self.transformers[col].transform(X[col], y=y)\n",
    "            except TypeError:  # https://github.com/rapidsai/cuml/issues/3053\n",
    "                X[col] = self.transformers[col].transform(X[col])\n",
    "            \n",
    "        return X\n",
    "    \n",
    "    def fit_transform(self, X, y=None):\n",
    "        for col in X.columns:\n",
    "            self.transformers[col] = self.transformer_class(\n",
    "                *self.transformer_args,\n",
    "                **self.transformer_kwargs\n",
    "            )\n",
    "            try:\n",
    "                X[col] = self.transformers[col].fit_transform(X[col], y=y)\n",
    "            except TypeError:  # https://github.com/rapidsai/cuml/issues/3053\n",
    "                X[col] = self.transformers[col].fit_transform(X[col])\n",
    "        return X\n",
    "\n",
    "\n",
    "\n",
    "class FeatureGenerator(BaseEstimator, TransformerMixin):\n",
    "    \"\"\"An sklearn-compatible class for adding new features to existing\n",
    "    data\n",
    "    \"\"\"\n",
    "    def __init__(self,\n",
    "                 generator,\n",
    "                 include_dtypes=None,\n",
    "                 exclude_dtypes=None,\n",
    "                 columns=None,\n",
    "                 copy=True):\n",
    "        self.include_dtypes = include_dtypes\n",
    "        self.exclude_dtypes = exclude_dtypes\n",
    "        self.columns = columns\n",
    "        self.copy = copy\n",
    "        self.generator = generator\n",
    "        \n",
    "    def _get_subset(self, X):\n",
    "        subset = X\n",
    "        if self.columns is not None:\n",
    "            subset = X[self.columns]\n",
    "        if self.include_dtypes or self.exclude_dtypes:\n",
    "            subset = subset.select_dtypes(\n",
    "                include=self.include_dtypes,\n",
    "                exclude=self.exclude_dtypes\n",
    "            )\n",
    "        return subset\n",
    "    \n",
    "    def fit(self, X, y=None):\n",
    "        subset = self._get_subset(X)\n",
    "        try:\n",
    "            self.generator.fit(subset, y=y)\n",
    "        except TypeError:  # https://github.com/rapidsai/cuml/issues/3053\n",
    "            self.generator.fit(subset)\n",
    "    \n",
    "    def transform(self, X, y=None):\n",
    "        subset = self._get_subset(X)\n",
    "        try:\n",
    "            new_features = self.generator.transform(subset, y=y)\n",
    "        except TypeError:  # https://github.com/rapidsai/cuml/issues/3053\n",
    "            new_features = self.generator.transform(subset)\n",
    "        if isinstance(X, cudf.DataFrame):\n",
    "            return cudf.concat((X.reset_index(), new_features), axis=1)\n",
    "        else:\n",
    "            new_features = pandas.DataFrame(\n",
    "                new_features,\n",
    "                columns=[\"new_{}\".format(i) for i in range(new_features.shape[1])]\n",
    "            )\n",
    "            return pandas.concat((X.reset_index(), new_features), axis=1)\n",
    "    \n",
    "    def fit_transform(self, X, y=None):\n",
    "        subset = self._get_subset(X)\n",
    "        try:\n",
    "            new_features = self.generator.fit_transform(subset, y=y)\n",
    "        except TypeError:  # https://github.com/rapidsai/cuml/issues/3053\n",
    "            new_features = self.generator.fit_transform(subset)\n",
    "        if isinstance(X, cudf.DataFrame):\n",
    "            return cudf.concat((X.reset_index(), new_features), axis=1)\n",
    "        else:\n",
    "            new_features = pandas.DataFrame(\n",
    "                new_features,\n",
    "                columns=[\"new_{}\".format(i) for i in range(new_features.shape[1])]\n",
    "            )\n",
    "            return pandas.concat((X.reset_index(), new_features), axis=1)\n",
    "\n",
    "\n",
    "class SubsetTransformer(BaseEstimator, TransformerMixin):\n",
    "    \"\"\"An sklearn-compatible class for fitting and transforming on\n",
    "    a subset of features\n",
    "    \n",
    "    This allows a transformation to be applied to only data in a\n",
    "    specific column of a dataframe or only data of a particular dtype.\n",
    "    \"\"\"\n",
    "    def __init__(self,\n",
    "                 transformer,\n",
    "                 include_dtypes=None,\n",
    "                 exclude_dtypes=None,\n",
    "                 columns=None,\n",
    "                 copy=True):\n",
    "        self.transformer = transformer\n",
    "        self.include_dtypes = include_dtypes\n",
    "        self.exclude_dtypes = exclude_dtypes\n",
    "        self.columns = columns\n",
    "        self.copy = copy\n",
    "        \n",
    "    def _get_subset(self, X):\n",
    "        subset = X\n",
    "        if self.columns is not None:\n",
    "            subset = X[self.columns]\n",
    "        if self.include_dtypes or self.exclude_dtypes:\n",
    "            subset = subset.select_dtypes(\n",
    "                include=self.include_dtypes,\n",
    "                exclude=self.exclude_dtypes\n",
    "            )\n",
    "        return subset\n",
    "        \n",
    "    def fit(self, X, y=None):\n",
    "        subset = self._get_subset(X)\n",
    "        try:\n",
    "            self.transformer.fit(subset, y=y)\n",
    "        except TypeError:  # https://github.com/rapidsai/cuml/issues/3053\n",
    "            self.transformer.fit(subset)\n",
    "        return self\n",
    "    \n",
    "    def transform(self, X, y=None):\n",
    "        if self.copy:\n",
    "            X = X.copy()\n",
    "        subset = self._get_subset(X)\n",
    "        try:\n",
    "            X[subset.columns] = self.transformer.transform(subset, y=y)\n",
    "        except TypeError:  # https://github.com/rapidsai/cuml/issues/3053\n",
    "            X[subset.columns] = self.transformer.transform(subset)\n",
    "        \n",
    "        return X\n",
    "    \n",
    "    def fit_transform(self, X, y=None):\n",
    "        if self.copy:\n",
    "            X = X.copy()\n",
    "        subset = self._get_subset(X)\n",
    "        try:\n",
    "            X[subset.columns] = self.transformer.fit_transform(subset, y=y)\n",
    "        except TypeError:  # https://github.com/rapidsai/cuml/issues/3053\n",
    "            X[subset.columns] = self.transformer.fit_transform(subset)\n",
    "        \n",
    "        return X\n",
    "\n",
    "\n",
    "class DeviceSpecificTransformer(BaseEstimator, TransformerMixin):\n",
    "    \"\"\"An sklearn-compatible class for performing different\n",
    "    transformations based on whether it receives a cuDF or Pandas\n",
    "    dataframe\"\"\"\n",
    "    def __init__(self, pandas_transformer, cudf_transformer):\n",
    "        self.pandas_transformer = pandas_transformer\n",
    "        self.cudf_transformer = cudf_transformer\n",
    "        self.transformer = None\n",
    "        self.is_cuml_transformer = None\n",
    "\n",
    "    def fit(self, X, y=None):\n",
    "        if hasattr(X, 'to_pandas'):\n",
    "            self.transformer = self.cudf_transformer\n",
    "            self.is_cuml_transformer = True\n",
    "        else:\n",
    "            self.transformer = self.pandas_transformer\n",
    "            self.is_cuml_transformer = False\n",
    "            \n",
    "        try:\n",
    "            self.transformer.fit(X, y=y)\n",
    "        except TypeError:  # https://github.com/rapidsai/cuml/issues/3053\n",
    "            self.transformer.fit(X)\n",
    "        return self\n",
    "\n",
    "    def transform(self, X, y=None):\n",
    "        try:\n",
    "            return self.transformer.transform(X, y=y)\n",
    "        except TypeError:  # https://github.com/rapidsai/cuml/issues/3053\n",
    "            return self.transformer.transform(X)\n",
    "\n",
    "    def fit_transform(self, X, y=None):\n",
    "        if hasattr(X, 'to_pandas'):\n",
    "            self.transformer = self.cudf_transformer\n",
    "        else:\n",
    "            self.transformer = self.pandas_transformer\n",
    "\n",
    "        try:\n",
    "            return self.transformer.fit_transform(X, y=y)\n",
    "        except TypeError:  # https://github.com/rapidsai/cuml/issues/3053\n",
    "            return self.transformer.fit_transform(X)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that much of the logic here is necessary because of the relative messiness of the dataset we intend to work with or because we will be using these transformers in both cuML and sklearn pipelines. Simpler, cleaner datasets may not require any of this helper logic if they are processed solely with cuML"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Feature Engineering\n",
    "With an evaluation protocol in place, a fixed model defined, and helper classes written, we can now turn to the actual task of cleaning up our data and exploring the available tools for creating useful features."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4.1 A Naive Approach\n",
    "\n",
    "We'll start by defining a few cleaning steps that will be needed simply to pass off the data to our classifiers. Specifically, we will:\n",
    "1. Drop the `ID` column, since we do not want to take the arbitrarily-assigned ID into account in our training.\n",
    "2. Replace null and NaN values with something our classifier can work with.\n",
    "3. Drop any non-numeric features, since our classifier does not currently support such data.\n",
    "4. Convert remaining (numeric) features to 32-bit floats, since cuML's random forest implementation requires this.\n",
    "\n",
    "This approach is quite naive. Categorical integer data is treated in the same way as quantitative float data. Categorical strings are ignored entirely, and missing data is replaced with a constant value that may not be appropriate in the context of the full dataframe. We will address all of these concerns and more as we build up more complex preprocessing pipelines."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "drop_id = LambdaTransformer(lambda x: x[x.columns.difference(['ID'])])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "replace_numeric_na = SubsetTransformer(\n",
    "    LambdaTransformer(lambda x: x.fillna(0)),\n",
    "    include_dtypes=['integer', 'floating']\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "replace_string_na = SubsetTransformer(\n",
    "    LambdaTransformer(lambda x: x.fillna('UNKNOWN')),\n",
    "    include_dtypes=['object']\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "filter_numeric = LambdaTransformer(lambda x: x.select_dtypes('number'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "convert_to_float32 = LambdaTransformer(lambda x: x.astype('float32'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "preprocessing_steps = [\n",
    "    (\"Drop ID\", drop_id),\n",
    "    (\"Replace numeric NA\", replace_numeric_na),\n",
    "    (\"Replace string NA\", replace_string_na),\n",
    "    (\"Numeric filter\", filter_numeric),\n",
    "    (\"32-bit Conversion\", convert_to_float32)\n",
    "]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With these naive preprocessing steps defined, let's create an sklearn `Pipeline` for both the cuML classifier and the sklearn classifier. We can then apply our previously-defined evaluation protocol to each and assess both runtime and accuracy performance."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.pipeline import Pipeline\n",
    "cuml_pipeline = Pipeline(\n",
    "    preprocessing_steps + [(\"Classifier\", cu_classifier)],\n",
    "    verbose=1  # Detailed timing information\n",
    ")\n",
    "sklearn_pipeline = Pipeline(\n",
    "    preprocessing_steps + [(\"Classifier\", sk_classifier)],\n",
    "    verbose=1  # Detailed timing information\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%time cu_evaluate(cuml_pipeline)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%script false --no-raise-error\n",
    "# WARNING: Takes several minutes\n",
    "%time sk_evaluate(sklearn_pipeline)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Given the known runtime improvement of cuML's GPU-accelerated random forest implementation, it is no surprise that the cuML pipeline executed faster than its CPU-only equivalent. Digging into the timings of individual pipeline steps, we do indeed see that the majority of our performance gain with cuML comes from the classifier itself, but we also see some improvement in runtimes for the preprocessing steps. We'll take a closer look at that once we have a slightly more interesting pipeline in place.\n",
    "\n",
    "Since the sklearn pipeline takes several minutes to run and the observed accuracy is similar to what we see with cuML, most of the remaining sklearn cells in this notebook will be disabled with the `%%script false --no-raise-error` magic tag. You can simply delete this tag from the cell if you wish to run the sklearn version of a particular section of code."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4.2 Data Imputation\n",
    "\n",
    "As a marginal improvement on our initial approach, let's use a slightly more sophisticated method for dealing with missing values. Specifically, let's fill in missing quantitative features with the mean value for that feature in our training data. For this, we will make use of the `SimpleImputer` class, newly available in RAPIDS v0.16 through the `cuml.experimental.preprocessing` module.\n",
    "\n",
    "#### Aside: cuML's Experimental Preprocessing\n",
    "It is no secret that cuML stands on the shoulders of the sklearn giant and benefits enormously from sklearn's brilliant design, thoughtful implementation, and enthusiastic community. In v0.16, cuML has benefitted even more directly through its new (and currently experimental) preprocessing features.\n",
    "\n",
    "Because cuML has maintained such strong compatibility with sklearn, the RAPIDS team was able to incorporate sklearn code (still distributed under the terms of the sklearn license, of course) directly into cuML with only minor modifications. This became cuML's experimental preprocessing module. So if you appreciate having these features available in cuML, remember that it is thanks to the consistently stellar work of the sklearn developers and community, and be sure to [cite sklearn](https://scikit-learn.org/stable/about.html#citing-scikit-learn) in any scientific publications based on these features.\n",
    "\n",
    "As an experimental feature, we are actively seeking feedback on these newly-introduced preprocessing algorithms. Please do report any problems you encounter via the [cuML issue tracker](https://github.com/rapidsai/cuml/issues)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.impute import SimpleImputer as skSimpleImputer\n",
    "from cuml.experimental.preprocessing import SimpleImputer as cuSimpleImputer\n",
    "\n",
    "sk_mean_imputer = SubsetTransformer(\n",
    "    skSimpleImputer(missing_values=numpy.nan, strategy='mean'),\n",
    "    include_dtypes=['floating']\n",
    ")\n",
    "cu_mean_imputer = SubsetTransformer(\n",
    "    cuSimpleImputer(missing_values=numpy.nan, strategy='mean'),\n",
    "    include_dtypes=['floating']\n",
    ")\n",
    "mean_imputer = DeviceSpecificTransformer(sk_mean_imputer, cu_mean_imputer)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Because cupy does not currently support null values, we will need to add one other step to our pipeline: converting null data to NaNs or another known invalid value before performing imputation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def _replace_nulls(data):\n",
    "    data = data.copy()\n",
    "    replacements = [\n",
    "        (numpy.floating, numpy.nan),\n",
    "        (numpy.integer, -1),\n",
    "        (object, 'UNKNOWN')\n",
    "    ]\n",
    "    for col_type, value in replacements:\n",
    "        subset = data.select_dtypes(col_type)\n",
    "        data[subset.columns] = subset.fillna(value)\n",
    "    return data\n",
    "\n",
    "null_filler = LambdaTransformer(_replace_nulls)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "preprocessing_steps = [\n",
    "    (\"Drop ID\", drop_id),\n",
    "    (\"Replace nulls\", null_filler),\n",
    "    (\"Imputation\", mean_imputer),\n",
    "    (\"Numeric filter\", filter_numeric),\n",
    "    (\"32-bit Conversion\", convert_to_float32)\n",
    "]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cuml_pipeline = Pipeline(preprocessing_steps + [(\"Classifier\", cu_classifier)])\n",
    "sklearn_pipeline = Pipeline(preprocessing_steps + [(\"Classifier\", sk_classifier)])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cu_evaluate(cuml_pipeline)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%script false --no-raise-error\n",
    "sk_evaluate(sklearn_pipeline)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see an almost negligible increase in accuracy using mean imputation, but you can try experimenting with other imputation strategies, including \"median\" and \"most_frequent\" to see what impact it has on performance."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4.3 Scaling\n",
    "For some machine learning algorithms, it is helpful to adjust the average value of a feature and scale it so that its \"spread\" is comparable to other features. There are a few strategies for doing this, but one of the most common is to subtract off the mean and then divide by the variance. We can do precisely this using the `StandardScaler` algorithm."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.preprocessing import StandardScaler as skStandardScaler\n",
    "from cuml.experimental.preprocessing import StandardScaler as cuStandardScaler\n",
    "\n",
    "sk_scaler = SubsetTransformer(\n",
    "    skStandardScaler(),\n",
    "    include_dtypes=['floating']\n",
    ")\n",
    "cu_scaler = SubsetTransformer(\n",
    "    cuStandardScaler(),\n",
    "    include_dtypes=['floating']\n",
    ")\n",
    "scaler = DeviceSpecificTransformer(sk_scaler, cu_scaler)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "preprocessing_steps = [\n",
    "    (\"Drop ID\", drop_id),\n",
    "    (\"Replace nulls\", null_filler),\n",
    "    (\"Imputation\", mean_imputer),\n",
    "    (\"Scaling\", scaler),\n",
    "    (\"Numeric filter\", filter_numeric),\n",
    "    (\"32-bit Conversion\", convert_to_float32)\n",
    "]\n",
    "cuml_pipeline = Pipeline(preprocessing_steps + [(\"Classifier\", cu_classifier)])\n",
    "sklearn_pipeline = Pipeline(preprocessing_steps + [(\"Classifier\", sk_classifier)])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cu_evaluate(cuml_pipeline)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%script false --no-raise-error\n",
    "sk_evaluate(sklearn_pipeline)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In general, random forest models do not benefit from this kind of scaling, but other model types, especially logistic regression and neural networks can see improved accuracy or better convergence with this sort of preprocessing."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4.4 Encoding Categorical Data\n",
    "Up to this point, we have not taken advantage of the categorical features in our data at all. In order to do so, we must encode them in some numeric representation. cuML offers a number of strategies for doing this, including one-hot encoding, label encoding, and target encoding. We will demonstrate just one of these algorithms (label encoding) here.\n",
    "\n",
    "Using encoders on different training and testing data can be tricky because our training split may be missing some labels from our testing split. cuML's `LabelEncoder` includes the `handle_unknown` param which allows us to mark previously-unseen categories as null. Since all integer entries in our dataset are whole numbers, we can then replace these nulls with a value of -1 using two quick helper transformations.\n",
    "\n",
    "In sklearn, we must use a slightly different workaround."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from cuml.preprocessing import LabelEncoder as cuLabelEncoder\n",
    "from sklearn.preprocessing import LabelEncoder as skLabelEncoder\n",
    "\n",
    "cu_encoder = SubsetTransformer(\n",
    "    PerFeatureTransformer(cuLabelEncoder, transformer_kwargs={'handle_unknown': 'ignore'}),\n",
    "    include_dtypes=['integer', 'object']\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# cuML workarounds for unseen data\n",
    "def standard_ints(data):\n",
    "    subset = data.select_dtypes('integer')\n",
    "    data[subset.columns] = subset.astype('int32')\n",
    "    return data\n",
    "\n",
    "int_standardizer = LambdaTransformer(standard_ints)\n",
    "replace_unknown_labels = LambdaTransformer(lambda x: x.fillna(-1))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# sklearn workarounds for unseen data\n",
    "class SKUnknownEncoder(BaseEstimator, TransformerMixin):\n",
    "    UNKNOWN = 'UNKNOWN'\n",
    "    \n",
    "    def __init__(self, base_encoder, copy=True):\n",
    "        self.base_encoder = base_encoder\n",
    "        self.copy = copy\n",
    "        \n",
    "    def fit(self, X, y=None):\n",
    "        self.base_encoder.fit(list(X) + [self.UNKNOWN])\n",
    "    \n",
    "    def transform(self, X):\n",
    "        if self.copy:\n",
    "            X = X.copy()\n",
    "        missing = set(X.unique()) - set(self.base_encoder.classes_)\n",
    "        X = X.replace(list(missing), self.UNKNOWN)\n",
    "        return self.base_encoder.transform(X)\n",
    "    \n",
    "    def fit_transform(self, X, y=None):\n",
    "        return self.base_encoder.fit_transform(X)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sk_encoder = SubsetTransformer(\n",
    "    PerFeatureTransformer(SKUnknownEncoder, transformer_args=(skLabelEncoder(),)),\n",
    "    include_dtypes=['integer', 'object']\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "label_encoder = DeviceSpecificTransformer(sk_encoder, cu_encoder)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "preprocessing_steps = [\n",
    "    (\"Drop ID\", drop_id),\n",
    "    (\"Replace nulls\", null_filler),\n",
    "    (\"Encoding\", label_encoder),\n",
    "    (\"Imputation\", mean_imputer),\n",
    "    (\"Standardize ints\", int_standardizer),\n",
    "    (\"Handle unknown labels\", replace_unknown_labels),\n",
    "    (\"Scaling\", scaler),\n",
    "    (\"Numeric filter\", filter_numeric),\n",
    "    (\"32-bit Conversion\", convert_to_float32)\n",
    "]\n",
    "cuml_pipeline = Pipeline(preprocessing_steps + [(\"Classifier\", cu_classifier)])\n",
    "sklearn_pipeline = Pipeline(preprocessing_steps + [(\"Classifier\", sk_classifier)])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cu_evaluate(cuml_pipeline)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%script false --no-raise-error\n",
    "sk_evaluate(sklearn_pipeline)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4.5 Discretization\n",
    "While encoding gives us a way of converting discrete labels into numeric values, it is sometimes useful to do the reverse. When quantitative data falls into obviously useful categories (like \"zero\" vs \"non-zero\") or when the noise in quantitative data does not yield meaningful information about our prediction target, it can help our model to preprocess that quantitative data by converting it into categorical \"bins\". We will give just one example of this (`KBinsDiscretizer`), which we will naively apply across all categorical data. For more serious feature engineering, we would perform a more careful analysis of the meaning and distribution of each quantitative feature."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.preprocessing import KBinsDiscretizer as skKBinsDiscretizer\n",
    "from cuml.experimental.preprocessing import KBinsDiscretizer as cuKBinsDiscretizer\n",
    "\n",
    "sk_discretizer = SubsetTransformer(\n",
    "    skKBinsDiscretizer(encode='ordinal'),\n",
    "    include_dtypes=['floating']\n",
    ")\n",
    "cu_discretizer = SubsetTransformer(\n",
    "    cuKBinsDiscretizer(encode='ordinal'),\n",
    "    include_dtypes=['floating']\n",
    ")\n",
    "discretizer = DeviceSpecificTransformer(sk_discretizer, cu_discretizer)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "preprocessing_steps = [\n",
    "    (\"Drop ID\", drop_id),\n",
    "    (\"Replace nulls\", null_filler),\n",
    "    (\"Encoding\", label_encoder),\n",
    "    (\"Imputation\", mean_imputer),\n",
    "    (\"Standardize ints\", int_standardizer),\n",
    "    (\"Handle unknown labels\", replace_unknown_labels),\n",
    "    (\"Scaling\", scaler),\n",
    "    (\"Discretization\", discretizer),\n",
    "    (\"Numeric filter\", filter_numeric),\n",
    "    (\"32-bit Conversion\", convert_to_float32)\n",
    "]\n",
    "cuml_pipeline = Pipeline(preprocessing_steps + [(\"Classifier\", cu_classifier)])\n",
    "sklearn_pipeline = Pipeline(preprocessing_steps + [(\"Classifier\", sk_classifier)])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cu_evaluate(cuml_pipeline)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%script false --no-raise-error\n",
    "sk_evaluate(sklearn_pipeline)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4.7 Generating New Features\n",
    "We have looked at several ways of processing existing features that may help a machine learning model converge faster or perform better, but we can also generate new features from the existing data to help create the best possible representation of those data.\n",
    "\n",
    "One of the most straightforward examples of this technique is expemplified by the `PolynomialFeatureGenerator` algorithm. This algorithm works by looking at the products of existing features up to a certain order. Thus, if we have features `a`, `b`, and `c`, it might be useful to let the model see `ab`, `ac`, `bc` and potentially even `a**2`, `b**2`, and `c**2`.\n",
    "\n",
    "In our case, we will again take a fairly naive approach, adding all of the interaction terms of order 2 (corresponding to `ab`, `ac`, and `bc` in the above example) as new features."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from cuml.experimental.preprocessing import PolynomialFeatures as cuPolynomialFeatures\n",
    "from sklearn.preprocessing import PolynomialFeatures as skPolynomialFeatures\n",
    "\n",
    "sk_generator = FeatureGenerator(\n",
    "    skPolynomialFeatures(interaction_only=True, degree=2),\n",
    "    include_dtypes=['integer']\n",
    ")\n",
    "cu_generator = FeatureGenerator(\n",
    "    cuPolynomialFeatures(interaction_only=True, degree=2),\n",
    "    include_dtypes=['integer']\n",
    ")\n",
    "generator = DeviceSpecificTransformer(sk_generator, cu_generator)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "preprocessing_steps = [\n",
    "    (\"Drop ID\", drop_id),\n",
    "    (\"Replace nulls\", null_filler),\n",
    "    (\"Encoding\", label_encoder),\n",
    "    (\"Imputation\", mean_imputer),\n",
    "    (\"Standardize ints\", int_standardizer),\n",
    "    (\"Handle unknown labels\", replace_unknown_labels),\n",
    "    (\"Generate products\", generator),\n",
    "    (\"Scaling\", scaler),\n",
    "    (\"Discretization\", discretizer),\n",
    "    (\"Numeric filter\", filter_numeric),\n",
    "    (\"32-bit Conversion\", convert_to_float32)\n",
    "]\n",
    "cuml_pipeline = Pipeline(preprocessing_steps + [(\"Classifier\", cu_classifier)])\n",
    "sklearn_pipeline = Pipeline(preprocessing_steps + [(\"Classifier\", sk_classifier)])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cu_evaluate(cuml_pipeline)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%script false --no-raise-error\n",
    "sk_evaluate(sklearn_pipeline)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Final Assessment\n",
    "Blindly applying the techniques presented thus far, we have seen a very modest increase in accuracy due solely to preprocessing. As evidenced by the ingenious solutions presented for the Kaggle competition associated with this dataset, a more careful and thorough exploration of preprocessing can yield much more impressive performance.\n",
    "\n",
    "A key factor in finding an effective preprocessing protocol is how long it takes to iterate through possibilities and assess their impact. Indeed, this is one of the key benefits of cuML's new preprocessing tools. Using them, we can load data onto the GPU then tweak, transform, and use it for training and inference without ever incurring the cost of device-to-host transfers.\n",
    "\n",
    "With this in mind, let's take one final look at execution time for our final pipeline, breaking it down and analyzing the specific benefits of GPU-accelerated preprocessing."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Increase verbosity to provide timing details\n",
    "cuml_pipeline = Pipeline(\n",
    "    preprocessing_steps + [(\"Classifier\", cu_classifier)],\n",
    "    verbose=1\n",
    ")\n",
    "sklearn_pipeline = Pipeline(\n",
    "    preprocessing_steps + [(\"Classifier\", sk_classifier)],\n",
    "    verbose=1\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%time cu_evaluate(cuml_pipeline)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "preprocessing_steps = [\n",
    "    (\"Drop ID\", drop_id),\n",
    "    (\"Replace nulls\", null_filler),\n",
    "    (\"Encoding\", label_encoder),\n",
    "    (\"Imputation\", mean_imputer),\n",
    "    (\"Standardize ints\", int_standardizer),\n",
    "    (\"Handle unknown labels\", replace_unknown_labels),\n",
    "    (\"Generate products\", generator),\n",
    "    (\"Scaling\", scaler),\n",
    "    (\"Discretization\", discretizer),\n",
    "    (\"Numeric filter\", filter_numeric),\n",
    "    (\"32-bit Conversion\", convert_to_float32)\n",
    "]\n",
    "sklearn_pipeline = Pipeline(\n",
    "    preprocessing_steps + [(\"Classifier\", sk_classifier)],\n",
    "    verbose=1\n",
    ")\n",
    "%time sk_evaluate(sklearn_pipeline)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "preproc_only_pipeline = Pipeline(preprocessing_steps)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "# Suppress warnings from naive application of discretizer to\n",
    "# all features\n",
    "with warnings.catch_warnings():\n",
    "    warnings.simplefilter(\"ignore\")\n",
    "    preproc_only_pipeline.fit_transform(data_cudf[data_cudf.columns.difference(['target'])], data_cudf.target)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "# Suppress warnings from naive application of discretizer to\n",
    "# all features\n",
    "with warnings.catch_warnings():\n",
    "    warnings.simplefilter(\"ignore\")\n",
    "    preproc_only_pipeline.fit_transform(data_pd[data_pd.columns.difference(['target'])], data_pd.target)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Looking at these results, we can see the runtime benefit of GPU acceleration in both the entire preprocessing and classification pipeline and the preprocessing portion alone. For feature engineering, this means faster iteration, lower compute costs, and the possibility of conducting more systematic hyper-parameter optimization over even the preprocessing steps themselves. Those with an interest in HPO might check out our [detailed walkthroughs](https://rapids.ai/hpo) on performing HPO with RAPIDS in the cloud. The techniques explored there could easily be combined with those demonstrated in this notebook to rapidly search the space of available preprocessing and model hyperparameters."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6. Conclusions\n",
    "Thanks to the newly-expanded cuML preprocessing features in RAPIDS v0.16, it is now possible to keep your entire machine learning pipeline on the GPU, without copying data back to the host to make use of CPU-only algorithms. This offers substantial benefits in terms of runtime, which can in turn lead to more thorough exploration of the feature engineering space and dramatically lower compute times and costs.\n",
    "\n",
    "While this notebook primarily offers a high-level demonstration of available preprocessing features rather than an in-depth optimization of features on a particular dataset, you may be interested in using it to play more with the BNP dataset yourself to engineer the perfect combination of curated features. Or better yet, try it with your own data.\n",
    "\n",
    "If you like what you see here, there is plenty more to explore in our [other demo notebooks](https://github.com/rapidsai/notebooks). Please feel free to report any problems you find or ask questions via [the cuML issue tracker](https://github.com/rapidsai/cuml/issues), and keep an eye out for the next release of cuML (v0.17), which we expect to have an even smoother preprocessing experience as we start to transition the new preprocessing features out of experimental."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}