Skip to content

Instantly share code, notes, and snippets.

@bafonso
Created November 26, 2019 18:01
Show Gist options
  • Save bafonso/11db41be78df152e657bc3268b68e40d to your computer and use it in GitHub Desktop.
Save bafonso/11db41be78df152e657bc3268b68e40d to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Random Forest and Pickling\n",
"The Random Forest algorithm is a classification algorithm which builds several decision trees, and aggregates each of their outputs to make a prediction.\n",
"\n",
"In this notebook we will train a scikit-learn and a cuML Random Forest Classification model. Then the cuML model is pickled and loaded for prediction. We also compare the results of the scikit-learn, non-pickled and pickled cuML models.\n",
"\n",
"Note that the underlying algorithm for tree node splits differs from that used in scikit-learn.\n",
"\n",
"For information on converting your dataset to cuDF format, refer to the [cuDF documentation](https://rapidsai.github.io/projects/cudf/en/latest/)\n",
"\n",
"For additional information cuML's random forest model: https://rapidsai.github.io/projects/cuml/en/0.10.0/api.html#random-forest"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import cudf\n",
"import numpy as np\n",
"import pandas as pd\n",
"import pickle\n",
"\n",
"from cuml.ensemble import RandomForestClassifier as curfc\n",
"\n",
"from sklearn.ensemble import RandomForestClassifier as skrfc\n",
"from sklearn.metrics import accuracy_score\n",
"from sklearn.datasets import make_classification\n",
"from sklearn.model_selection import train_test_split"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Define Parameters"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"n_samples = 2**13\n",
"n_features = 399\n",
"n_info = 300\n",
"data_type = np.float32"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Generate Data\n",
"\n",
"### Host"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 261 ms, sys: 538 ms, total: 799 ms\n",
"Wall time: 165 ms\n"
]
}
],
"source": [
"%%time\n",
"X,y = make_classification(n_samples=n_samples,\n",
" n_features=n_features,\n",
" n_informative=n_info,\n",
" random_state=123, n_classes=2)\n",
"\n",
"X = pd.DataFrame(X.astype(data_type))\n",
"y = pd.Series(y.astype(np.int32))\n",
"\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y,\n",
" test_size = 0.2,\n",
" random_state=0)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### GPU"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 928 ms, sys: 194 ms, total: 1.12 s\n",
"Wall time: 1.12 s\n"
]
}
],
"source": [
"%%time\n",
"X_cudf_train = cudf.DataFrame.from_pandas(X_train)\n",
"X_cudf_test = cudf.DataFrame.from_pandas(X_test)\n",
"\n",
"y_cudf_train = cudf.Series(y_train.values)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Scikit-learn Model\n",
"\n",
"### Fit"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 50.9 s, sys: 226 ms, total: 51.2 s\n",
"Wall time: 7.43 s\n"
]
}
],
"source": [
"%%time\n",
"sk_model = skrfc(n_estimators=40,\n",
" max_depth=12,\n",
" min_samples_split=2,\n",
" max_features=1.0,\n",
" random_state=10,\n",
" n_jobs = -1 )\n",
"\n",
"sk_model.fit(X_train, y_train)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Evaluate"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 19.6 ms, sys: 7.84 ms, total: 27.4 ms\n",
"Wall time: 107 ms\n"
]
}
],
"source": [
"%%time\n",
"sk_predict = sk_model.predict(X_test)\n",
"sk_acc = accuracy_score(y_test, sk_predict)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## cuML Model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Fit"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/opt/conda/envs/rapids/lib/python3.6/site-packages/ipykernel_launcher.py:5: UserWarning: Random seed requires n_streams=1.\n",
" \"\"\"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 5.71 s, sys: 499 ms, total: 6.21 s\n",
"Wall time: 1.75 s\n"
]
}
],
"source": [
"%%time\n",
"cuml_model = curfc(n_estimators=40,\n",
" max_depth=12,\n",
" min_rows_per_node=2,\n",
" max_features=1.0,\n",
" seed=10)\n",
"\n",
"cuml_model.fit(X_cudf_train, y_cudf_train)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Evaluate"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 484 ms, sys: 12.2 ms, total: 496 ms\n",
"Wall time: 485 ms\n"
]
}
],
"source": [
"%%time\n",
"fil_preds_orig = cuml_model.predict(X_cudf_test)\n",
"\n",
"fil_acc_orig = accuracy_score(y_test, fil_preds_orig)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Dask GridSearchCV on CPU"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"GridSearchCV(cache_cv=True, cv=None, error_score='raise',\n",
" estimator=RandomForestClassifier(bootstrap=True, class_weight=None,\n",
" criterion='gini', max_depth=12,\n",
" max_features=1.0,\n",
" max_leaf_nodes=None,\n",
" min_impurity_decrease=0.0,\n",
" min_impurity_split=None,\n",
" min_samples_leaf=1,\n",
" min_samples_split=2,\n",
" min_weight_fraction_leaf=0.0,\n",
" n_estimators=40, n_jobs=-1,\n",
" oob_score=False, random_state=10,\n",
" verbose=0, warm_start=False),\n",
" iid=True, n_jobs=-1,\n",
" param_grid={'max_depth': [12, 13], 'n_estimators': [40, 41]},\n",
" refit=True, return_train_score=False, scheduler=None,\n",
" scoring=None)"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import dask_ml.model_selection as dcv\n",
"\n",
"dask_parameters = {\n",
" 'n_estimators': [40, 41],\n",
" 'max_depth': [12,13],\n",
" }\n",
"\n",
"skmodel_dask_grid = dcv.GridSearchCV(\n",
" sk_model,\n",
" dask_parameters,\n",
" )\n",
"\n",
"skmodel_dask_grid.fit(X,y)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Dask GridSearchCV on GPU"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"ename": "TypeError",
"evalue": "X matrix format <class 'pandas.core.series.Series'> not supported",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m<ipython-input-11-17b12c53bb0e>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m 10\u001b[0m )\n\u001b[1;32m 11\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 12\u001b[0;31m \u001b[0mcurf_dask_grid\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfit\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mX\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0my\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
"\u001b[0;32m/opt/conda/envs/rapids/lib/python3.6/site-packages/dask_ml/model_selection/_search.py\u001b[0m in \u001b[0;36mfit\u001b[0;34m(self, X, y, groups, **fit_params)\u001b[0m\n\u001b[1;32m 1260\u001b[0m \u001b[0mout\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0mresult_map\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mk\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mk\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mkeys\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1261\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1262\u001b[0;31m \u001b[0mout\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mscheduler\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdsk\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mkeys\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnum_workers\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mn_jobs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1263\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1264\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0miid\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m/opt/conda/envs/rapids/lib/python3.6/site-packages/dask/threaded.py\u001b[0m in \u001b[0;36mget\u001b[0;34m(dsk, result, cache, num_workers, pool, **kwargs)\u001b[0m\n\u001b[1;32m 79\u001b[0m \u001b[0mget_id\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0m_thread_get_id\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 80\u001b[0m \u001b[0mpack_exception\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mpack_exception\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 81\u001b[0;31m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 82\u001b[0m )\n\u001b[1;32m 83\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m/opt/conda/envs/rapids/lib/python3.6/site-packages/dask/local.py\u001b[0m in \u001b[0;36mget_async\u001b[0;34m(apply_async, num_workers, dsk, result, cache, get_id, rerun_exceptions_locally, pack_exception, raise_exception, callbacks, dumps, loads, **kwargs)\u001b[0m\n\u001b[1;32m 484\u001b[0m \u001b[0m_execute_task\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtask\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdata\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;31m# Re-execute locally\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 485\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 486\u001b[0;31m \u001b[0mraise_exception\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mexc\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mtb\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 487\u001b[0m \u001b[0mres\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mworker_id\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mloads\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mres_info\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 488\u001b[0m \u001b[0mstate\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m\"cache\"\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mres\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m/opt/conda/envs/rapids/lib/python3.6/site-packages/dask/local.py\u001b[0m in \u001b[0;36mreraise\u001b[0;34m(exc, tb)\u001b[0m\n\u001b[1;32m 314\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mexc\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__traceback__\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mtb\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 315\u001b[0m \u001b[0;32mraise\u001b[0m \u001b[0mexc\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mwith_traceback\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtb\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 316\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mexc\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 317\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 318\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m/opt/conda/envs/rapids/lib/python3.6/site-packages/dask/local.py\u001b[0m in \u001b[0;36mexecute_task\u001b[0;34m(key, task_info, dumps, loads, get_id, pack_exception)\u001b[0m\n\u001b[1;32m 220\u001b[0m \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 221\u001b[0m \u001b[0mtask\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdata\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mloads\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtask_info\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 222\u001b[0;31m \u001b[0mresult\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0m_execute_task\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtask\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdata\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 223\u001b[0m \u001b[0mid\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mget_id\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 224\u001b[0m \u001b[0mresult\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mdumps\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mresult\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mid\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m/opt/conda/envs/rapids/lib/python3.6/site-packages/dask/core.py\u001b[0m in \u001b[0;36m_execute_task\u001b[0;34m(arg, cache, dsk)\u001b[0m\n\u001b[1;32m 117\u001b[0m \u001b[0mfunc\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0margs\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0marg\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0marg\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 118\u001b[0m \u001b[0margs2\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0m_execute_task\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0ma\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcache\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0ma\u001b[0m \u001b[0;32min\u001b[0m \u001b[0margs\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 119\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mfunc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs2\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 120\u001b[0m \u001b[0;32melif\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mishashable\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0marg\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 121\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0marg\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m/opt/conda/envs/rapids/lib/python3.6/site-packages/dask_ml/model_selection/methods.py\u001b[0m in \u001b[0;36mfit_and_score\u001b[0;34m(est, cv, X, y, n, scorer, error_score, fields, params, fit_params, return_train_score)\u001b[0m\n\u001b[1;32m 315\u001b[0m \u001b[0mX_test\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mcv\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mextract\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mX\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0my\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mn\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;32mTrue\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;32mFalse\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 316\u001b[0m \u001b[0my_test\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mcv\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mextract\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mX\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0my\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mn\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;32mFalse\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;32mFalse\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 317\u001b[0;31m \u001b[0mest_and_time\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mfit\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mest\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mX_train\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0my_train\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0merror_score\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfields\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mparams\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfit_params\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 318\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mreturn_train_score\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 319\u001b[0m \u001b[0mX_train\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0my_train\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m/opt/conda/envs/rapids/lib/python3.6/site-packages/dask_ml/model_selection/methods.py\u001b[0m in \u001b[0;36mfit\u001b[0;34m(est, X, y, error_score, fields, params, fit_params)\u001b[0m\n\u001b[1;32m 235\u001b[0m \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 236\u001b[0m \u001b[0mest\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mset_params\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mest\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfields\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mparams\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 237\u001b[0;31m \u001b[0mest\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfit\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mX\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0my\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mfit_params\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 238\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0mException\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0me\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 239\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0merror_score\u001b[0m \u001b[0;34m==\u001b[0m \u001b[0;34m\"raise\"\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32mcuml/ensemble/randomforestclassifier.pyx\u001b[0m in \u001b[0;36mcuml.ensemble.randomforestclassifier.RandomForestClassifier.fit\u001b[0;34m()\u001b[0m\n",
"\u001b[0;32m/opt/conda/envs/rapids/lib/python3.6/site-packages/cuml/utils/input_utils.py\u001b[0m in \u001b[0;36minput_to_dev_array\u001b[0;34m(X, order, deepcopy, check_dtype, convert_to_dtype, check_cols, check_rows, fail_on_order)\u001b[0m\n\u001b[1;32m 178\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 179\u001b[0m \u001b[0mmsg\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m\"X matrix format \"\u001b[0m \u001b[0;34m+\u001b[0m \u001b[0mstr\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mX\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__class__\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;34m+\u001b[0m \u001b[0;34m\" not supported\"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 180\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mTypeError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mmsg\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 181\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 182\u001b[0m \u001b[0mdtype\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mX_m\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdtype\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;31mTypeError\u001b[0m: X matrix format <class 'pandas.core.series.Series'> not supported"
]
}
],
"source": [
"\n",
"dask_parameters = {\n",
" 'n_estimators': [40, 41],\n",
" 'max_depth': [12,13],\n",
" }\n",
"\n",
"curf_dask_grid = dcv.GridSearchCV(\n",
" cuml_model,\n",
" dask_parameters,\n",
" )\n",
"\n",
"curf_dask_grid.fit(X,y)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Pickle the cuML random forest classification model"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"dump_orig_model = pickle.dumps(cuml_model)\n",
"# delete the previous model to ensure there is no leakage of pointers\n",
"del cuml_model\n",
"# load the pickled model \n",
"pickled_cuml_model = pickle.loads(dump_orig_model)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Predict using the loaded model"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 747 ms, sys: 29.7 ms, total: 777 ms\n",
"Wall time: 789 ms\n"
]
}
],
"source": [
"%%time\n",
"pred_after_pickling = pickled_cuml_model.predict(X_cudf_test)\n",
"\n",
"fil_acc_after_pickling = accuracy_score(y_test, pred_after_pickling)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Compare Results"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(\"CUML accuracy of the RF model before pickling: %s\" % fil_acc_orig)\n",
"print(\"CUML accuracy of the RF model after pickling: %s\" % fil_acc_after_pickling)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(\"SKL accuracy: %s\" % sk_acc)\n",
"print(\"CUML accuracy: %s\" % fil_acc_orig)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.7"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment