iaroslav-ai/Fit transformer on train partition only.ipynb

## Fit transformer on train partition only.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Why fit data transformer on training dataset only\n",
    "\n",
    "It is important to report accurate test performance estimate of your model. Inaccurate claims of precision of predictive model can result in rather negative results. For example, [inaccurate claims of accuracy can even result in a lawsuit](https://www.cnbc.com/video/2016/05/23/fitbit-faces-a-lawsuit-over-highly-inaccurate-trackers.html).\n",
    "\n",
    "In order to obtain accurate test performance estimate, one needs to simulate training and model use as close as possible to real world application of the model. One aspect that sometimes is overlooked in practice is fitting of data preprocessing models. Such models include [missing value imputation models](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.impute), as well as scaling models that [subtract mean](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) from feature values, and divide by variance. It is important to calculate parameters for such models on training dataset only, see reasoning [here](https://stats.stackexchange.com/questions/319514/why-feature-scaling-only-to-training-set) and [here](https://stackoverflow.com/questions/43675665/when-scale-the-data-why-the-train-dataset-use-fit-and-transform-but-the-te). In this notebook, a concrete example is given which shows this effect on Boston Housing dataset."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## What accurate test set performance means\n",
    "\n",
    "Lets introduce some notation. Let `X_train`, `y_train` be training data, and `X_test`, `y_test` be testing data. A performance of model trained on `*_train` partition can be estimated on the `*_test` partition, by calling `.score` of trained model. Let such score be `score`. \n",
    "\n",
    "If a data for new sample of houses will be collected, the test accuracy on such new data should be very close to the previously estimated test accuracy; Otherwise, performance is estimated inaccurately. Specifically, if there is a new data `X_new`, `y_new` coming, the corresponding score on such data should be close to original `score`. \n",
    "\n",
    "Such situation is simulated below, by splitting Boston Housing dataset into multiple partitions. Accuracy of the estimation is compared when a `StandardScaler` is fitted on whole dataset, and then on training part only. Results for subsets of Boston Housing dataset of different size are summarized below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from sklearn.datasets import load_boston\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "from sklearn.svm import LinearSVR\n",
    "from sklearn.model_selection import train_test_split, GridSearchCV\n",
    "\n",
    "# Boston housing price estimation dataset\n",
    "Xf, yf = load_boston(True)\n",
    "\n",
    "def experiment(fit_all):        \n",
    "    # select randomly 20% of 506 samples ~ 100 samples\n",
    "    _, X, _, y = train_test_split(Xf, yf, test_size=0.2)\n",
    "    \n",
    "    # 50 samples train / test, 50 samples final evaluation\n",
    "    X, X_new, y, y_new = train_test_split(X, y, test_size=0.5)\n",
    "    sc = StandardScaler()\n",
    "    \n",
    "    if fit_all:\n",
    "        # fit on whole dataset\n",
    "        X = sc.fit_transform(X, y)\n",
    "    \n",
    "    # Usual train test split\n",
    "    X_train, X_test, y_train, y_test = train_test_split(X, y)\n",
    "    \n",
    "    if not fit_all:\n",
    "        # a more proper fit: only on train part\n",
    "        sc.fit(X_train)\n",
    "        X_train = sc.transform(X_train)\n",
    "        X_test = sc.transform(X_test)\n",
    "    \n",
    "    model = GridSearchCV(\n",
    "        estimator=LinearSVR(),\n",
    "        param_grid={\n",
    "            'C': [10 ** i for i in [-3, -2, -1, 0]]\n",
    "        }\n",
    "    )\n",
    "    \n",
    "    model.fit(X_train, y_train)\n",
    "    score = model.score(X_test, y_test)\n",
    "    \n",
    "    score_new = model.score(sc.transform(X_new), y_new)\n",
    "    return abs(score - score_new)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Uncomment the script below to reproduce the results."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "\"\"\"\n",
    "from joblib import Parallel, delayed\n",
    "import bootstrapped.bootstrap as bs\n",
    "import bootstrapped.stats_functions as bs_stats\n",
    "\n",
    "# How many times to repeat experiment, in order to average out randomness\n",
    "N_reps = 10000\n",
    "\n",
    "for fit_on_test in [False, True]:        \n",
    "    # run experiments in parallel\n",
    "    errors = Parallel(n_jobs=-1, verbose=1)(delayed(experiment)(fit_on_test) for _ in range(N_reps))\n",
    "    \n",
    "    # estimate confidence bounds\n",
    "    conf = bs.bootstrap(np.array(errors), stat_func=bs_stats.mean)\n",
    "    \n",
    "    # communicate results\n",
    "    print(\"Average error of test estimate,\",'fit on all' if fit_on_test else 'fit train only', \":\")\n",
    "    print(conf)\n",
    "\"\"\"\n",
    "pass"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Reproduce the results with script above\n",
    "sizes = [50, 100, 200, 400]  # sizes of dataset, controlled with test_size=0.1\n",
    "fit_on_train = [\n",
    "    # mean, lower_bound, upper_bound\n",
    "    [1.090, 0.992, 1.184],\n",
    "    [0.269, 0.262, 0.275],\n",
    "    [0.171, 0.168, 0.174],\n",
    "    [0.117, 0.115, 0.119]\n",
    "]\n",
    "fit_on_all = [\n",
    "    [3.318, 3.134, 3.480],\n",
    "    [0.343, 0.333, 0.352],\n",
    "    [0.173, 0.170, 0.176],\n",
    "    [0.118, 0.116, 0.120]\n",
    "]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Conclusion\n",
    "\n",
    "It is shown that fitting preprocessing models on training and testing partitions can lead to inaccurate estimations of test performance. This is especially important for `StandardScaler` model with relatively small data size. It is important to avoid fitting with any model on test set, as it can lead to overfitting. Even if such overfitting is small, with multiple models it could potentially avalanche to highly inaccurate estimate."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Why fit data transformer on training dataset only\n",
	"\n",
	"It is important to report accurate test performance estimate of your model. Inaccurate claims of precision of predictive model can result in rather negative results. For example, [inaccurate claims of accuracy can even result in a lawsuit](https://www.cnbc.com/video/2016/05/23/fitbit-faces-a-lawsuit-over-highly-inaccurate-trackers.html).\n",
	"\n",
	"In order to obtain accurate test performance estimate, one needs to simulate training and model use as close as possible to real world application of the model. One aspect that sometimes is overlooked in practice is fitting of data preprocessing models. Such models include [missing value imputation models](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.impute), as well as scaling models that [subtract mean](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) from feature values, and divide by variance. It is important to calculate parameters for such models on training dataset only, see reasoning [here](https://stats.stackexchange.com/questions/319514/why-feature-scaling-only-to-training-set) and [here](https://stackoverflow.com/questions/43675665/when-scale-the-data-why-the-train-dataset-use-fit-and-transform-but-the-te). In this notebook, a concrete example is given which shows this effect on Boston Housing dataset."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## What accurate test set performance means\n",
	"\n",
	"Lets introduce some notation. Let `X_train`, `y_train` be training data, and `X_test`, `y_test` be testing data. A performance of model trained on `_train` partition can be estimated on the `_test` partition, by calling `.score` of trained model. Let such score be `score`. \n",
	"\n",
	"If a data for new sample of houses will be collected, the test accuracy on such new data should be very close to the previously estimated test accuracy; Otherwise, performance is estimated inaccurately. Specifically, if there is a new data `X_new`, `y_new` coming, the corresponding score on such data should be close to original `score`. \n",
	"\n",
	"Such situation is simulated below, by splitting Boston Housing dataset into multiple partitions. Accuracy of the estimation is compared when a `StandardScaler` is fitted on whole dataset, and then on training part only. Results for subsets of Boston Housing dataset of different size are summarized below."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 1,
	"metadata": {},
	"outputs": [],
	"source": [
	"import numpy as np\n",
	"from sklearn.datasets import load_boston\n",
	"from sklearn.preprocessing import StandardScaler\n",
	"from sklearn.svm import LinearSVR\n",
	"from sklearn.model_selection import train_test_split, GridSearchCV\n",
	"\n",
	"# Boston housing price estimation dataset\n",
	"Xf, yf = load_boston(True)\n",
	"\n",
	"def experiment(fit_all): \n",
	" # select randomly 20% of 506 samples ~ 100 samples\n",
	" _, X, _, y = train_test_split(Xf, yf, test_size=0.2)\n",
	" \n",
	" # 50 samples train / test, 50 samples final evaluation\n",
	" X, X_new, y, y_new = train_test_split(X, y, test_size=0.5)\n",
	" sc = StandardScaler()\n",
	" \n",
	" if fit_all:\n",
	" # fit on whole dataset\n",
	" X = sc.fit_transform(X, y)\n",
	" \n",
	" # Usual train test split\n",
	" X_train, X_test, y_train, y_test = train_test_split(X, y)\n",
	" \n",
	" if not fit_all:\n",
	" # a more proper fit: only on train part\n",
	" sc.fit(X_train)\n",
	" X_train = sc.transform(X_train)\n",
	" X_test = sc.transform(X_test)\n",
	" \n",
	" model = GridSearchCV(\n",
	" estimator=LinearSVR(),\n",
	" param_grid={\n",
	" 'C': [10 ** i for i in [-3, -2, -1, 0]]\n",
	" }\n",
	" )\n",
	" \n",
	" model.fit(X_train, y_train)\n",
	" score = model.score(X_test, y_test)\n",
	" \n",
	" score_new = model.score(sc.transform(X_new), y_new)\n",
	" return abs(score - score_new)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Uncomment the script below to reproduce the results."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 2,
	"metadata": {},
	"outputs": [],
	"source": [
	"\"\"\"\n",
	"from joblib import Parallel, delayed\n",
	"import bootstrapped.bootstrap as bs\n",
	"import bootstrapped.stats_functions as bs_stats\n",
	"\n",
	"# How many times to repeat experiment, in order to average out randomness\n",
	"N_reps = 10000\n",
	"\n",
	"for fit_on_test in [False, True]: \n",
	" # run experiments in parallel\n",
	" errors = Parallel(n_jobs=-1, verbose=1)(delayed(experiment)(fit_on_test) for _ in range(N_reps))\n",
	" \n",
	" # estimate confidence bounds\n",
	" conf = bs.bootstrap(np.array(errors), stat_func=bs_stats.mean)\n",
	" \n",
	" # communicate results\n",
	" print(\"Average error of test estimate,\",'fit on all' if fit_on_test else 'fit train only', \":\")\n",
	" print(conf)\n",
	"\"\"\"\n",
	"pass"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 3,
	"metadata": {},
	"outputs": [],
	"source": [
	"# Reproduce the results with script above\n",
	"sizes = [50, 100, 200, 400] # sizes of dataset, controlled with test_size=0.1\n",
	"fit_on_train = [\n",
	" # mean, lower_bound, upper_bound\n",
	" [1.090, 0.992, 1.184],\n",
	" [0.269, 0.262, 0.275],\n",
	" [0.171, 0.168, 0.174],\n",
	" [0.117, 0.115, 0.119]\n",
	"]\n",
	"fit_on_all = [\n",
	" [3.318, 3.134, 3.480],\n",
	" [0.343, 0.333, 0.352],\n",
	" [0.173, 0.170, 0.176],\n",
	" [0.118, 0.116, 0.120]\n",
	"]"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Conclusion\n",
	"\n",
	"It is shown that fitting preprocessing models on training and testing partitions can lead to inaccurate estimations of test performance. This is especially important for `StandardScaler` model with relatively small data size. It is important to avoid fitting with any model on test set, as it can lead to overfitting. Even if such overfitting is small, with multiple models it could potentially avalanche to highly inaccurate estimate."
	]
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.5.2"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 2
	}