Skip to content

Instantly share code, notes, and snippets.

@Orbifold
Created November 3, 2016 08:12
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save Orbifold/e7d52aec36fa468e9c4a49cdf7e8a275 to your computer and use it in GitHub Desktop.
Save Orbifold/e7d52aec36fa468e9c4a49cdf7e8a275 to your computer and use it in GitHub Desktop.
Synthetic Minority Over-sampling Technique.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Some more SMOTE\n",
"\n",
"This is a short excusrion on the SMOTE variations I found and which allow to manipulate in various ways the creation of synthetic samples. \n",
"\n",
"See [this Github work](https://github.com/fmfn/UnbalancedDataset) and note that the code below is Python 3.5."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import sys\n",
" \n",
"import sklearn.datasets\n",
" \n",
"from unbalanced_dataset import SMOTE\n",
"from sklearn import tree\n",
"from sklearn.ensemble import RandomForestClassifier\n",
"from sklearn import decomposition\n",
"import time, os\n",
"import pandas as pd, numpy as np\n",
"from ggplot import *\n",
"import matplotlib.pyplot as plt\n",
"from sklearn.metrics import confusion_matrix, roc_auc_score\n",
"%matplotlib inline\n",
"%load_ext autoreload\n",
"%autoreload 2\n",
"\n",
"import sklearn.datasets\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"import datetime\n",
"from sklearn.decomposition import PCA\n",
"import pandas as pd, numpy as np, os, time\n",
"\n",
"sns.set()\n",
"\n",
"\n",
"def plotClassificationData(x, y, title=\"\"):\n",
" palette = sns.color_palette()\n",
" plt.scatter(x[y == 0, 0], x[y == 0, 1], label=\"Class #0\", alpha=0.5,\n",
" facecolor=palette[0], linewidth=0.15)\n",
" plt.scatter(x[y == 1, 0], x[y == 1, 1], label=\"Class #1\", alpha=0.5,\n",
" facecolor=palette[2], linewidth=0.15)\n",
" plt.title(title)\n",
" plt.legend()\n",
" plt.show()\n",
"\n",
"\n",
"def linePlot(x, title=\"\"):\n",
" palette = sns.color_palette()\n",
" plt.plot(x, alpha=0.5, label=title, linewidth=0.2)\n",
" plt.legend()\n",
" plt.show()\n",
"\n",
"\n",
"def savePlotClassificationData(x, y):\n",
" palette = sns.color_palette()\n",
" plt.scatter(x[y == 0, 0], x[y == 0, 1], label=\"Class #0\", alpha=0.5,\n",
" facecolor=palette[0], linewidth=0.15)\n",
" plt.scatter(x[y == 1, 0], x[y == 1, 1], label=\"Class #1\", alpha=0.5,\n",
" facecolor=palette[2], linewidth=0.15)\n",
"\n",
" plt.legend()\n",
" # plt.show()\n",
" filePath = \"/Users/Swa/Desktop/\" + str(datetime.datetime.now(datetime.timezone.utc).timestamp()) + \".png\"\n",
" plt.savefig(filePath)\n",
"\n",
"\n",
"def plotHistogram(x, bins=10):\n",
" plt.hist(x, bins=bins)\n",
" plt.show()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's create some classification data and we take 200 informative features. \n",
"The usage of PCA to turn it into a 2D dataset is simply a projection technique so things can be plotted."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"x,y = sklearn.datasets.make_classification(n_classes=2, class_sep=2, weights=[0.1, 0.9],\n",
" n_informative=200, n_redundant=0, flip_y=0,\n",
" n_features=200, n_clusters_per_class=1,\n",
" n_samples=5000, random_state=10)\n",
"\n",
"y = 1 - y\n",
"\n",
"def count_classifieds(z): return sum(z)\n",
"def count_unclassifieds(z): return len(z) - sum(z)\n",
"def imbalance_ratio(z): return round(count_classifieds(z)/count_unclassifieds(z),1)\n",
"num_classified = count_classifieds(y)\n",
"num_unclassified = count_unclassifieds(y)\n",
"print(\"Number of classified clients: %s\"%num_classified)\n",
"print(\"Number of unclassified clients: %s\"%num_unclassified )\n",
"print(\"Imbalance ratio: %s\"%imbalance_ratio(y))\n",
"pca = decomposition.PCA(n_components=2)\n",
"xv = pca.fit_transform(x)\n",
"plotClassificationData(xv,y)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In order to measure how the synthetic samples influence the classification we will use a **random forest** and **naive Bayes** classifiers.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def makeForest(X:np.array, Y:np.array, treecount=10): \n",
" if treecount <= 1: raise ValueError(\"The forest should not have less than one tree.\")\n",
" model = RandomForestClassifier(n_estimators=treecount)\n",
" return model.fit(X, Y)\n",
"\n",
"def getForestArea(x, y, treecount=10): \n",
" forest = makeForest(x, y, treecount)\n",
" predicted = forest.predict(x)\n",
" try:\n",
" area = roc_auc_score(y, predicted)\n",
" except Exception as exc:\n",
" area = 0 \n",
" return round(area,2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So, in the default setup without any SMOTE we have"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"print(\"Baseline AUC area: %s\"%getForestArea(x,y))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Standard SMOTE algorithm\n",
"\n",
"The idea of the algorithm is to take k-nearest neighbors which define through the barycenter a direction and use a random factor in this direction. The larger the value of k the more the synthetic sample blur the existing ones. By default the value is 5.\n",
"\n",
"Let's first consider how the amount/ratio influence the accuracy of the predictions."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"scrolled": false
},
"outputs": [],
"source": [
"areas = []\n",
"division = np.arange(0.2,4,0.1)\n",
"for k in division:\n",
" smote = SMOTE(kind=\"regular\", ratio=k)\n",
" sx, sy = smote.fit_transform(x, y)\n",
" areas.append(getForestArea(sx,sy))\n",
"plt.plot(division, areas)\n",
"plt.xlabel('Ratio vs. area.')\n",
"plt.ylabel('Area')\n",
"plt.title('AUC')\n",
"plt.ylim([0.97,1.005])\n",
"plt.show() \n",
"print(\"Ended with %s classified and %s unclassifieds.\"%(count_classifieds(sy),count_unclassifieds(sy))) "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"No surprise here, the more samples the more the accuracy increases. We end up with an almost 1:1 ratio.\n",
"\n",
"Let's assume now a fixed ration but increase the amount of nearest neighbors."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"areas = []\n",
"division = np.arange(2,20,1)\n",
"for nn in division:\n",
" smote = SMOTE(kind=\"regular\", k=nn, ratio=0.5)\n",
" sx, sy = smote.fit_transform(x, y)\n",
" areas.append(getForestArea(sx,sy))\n",
"plt.plot(division, areas)\n",
"plt.xlabel('Neighbors vs. area.')\n",
"plt.ylabel('Area')\n",
"plt.title('AUC')\n",
"plt.ylim([0.93,1.005])\n",
"plt.show() \n",
"print(\"At each turn we had %s classified and %s unclassifieds.\"%(count_classifieds(sy),count_unclassifieds(sy))) "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So, the amount of neighbors diminishes the accuracy somewhat but there is no clear effect. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Borderline 1 variation\n",
"\n",
"The core idea of SMOTE is to use nearest neighbors to create new samples. However, if a minority point is close to another class then that point should rather not be considered since it would pull towards more noise and a less clear distinction between classes. So, the basic premise of the borderline SMOTE method is to identify points which potentially increase the confusion and not include these in the vectors creating new samples.\n",
"It's clear that this method will have no effect if the classes are well separated and mostly effective when mixture is moderate.\n",
"[The algorithm is described in this article. ](http://sci2s.ugr.es/keel/keel-dataset/pdfs/2005-Han-LNCS.pdf)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"areas = []\n",
"division = np.arange(0.2,4,0.1)\n",
"for k in division:\n",
" smote = SMOTE(kind=\"borderline1\", ratio=k)\n",
" sx, sy = smote.fit_transform(x, y)\n",
" areas.append(getForestArea(sx,sy))\n",
"plt.plot(division, areas)\n",
"plt.xlabel('Ratio vs. area.')\n",
"plt.ylabel('Area')\n",
"plt.title('AUC')\n",
"plt.ylim([0.97,1.005])\n",
"plt.show() \n",
"print(\"Ended with %s classified and %s unclassifieds.\"%(count_classifieds(sy),count_unclassifieds(sy))) "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that the accuracy goes more directly to its max due to the fact that our sample has indeed some noisy overlap between the clases and that the borderline SMOTE is really ideal in this case."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# SVM variation\n",
"\n",
"This approach is similar to the borderline idea but one uses a support vector machine to detect boundary points and separate them for the creation of synthetic samples."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import warnings\n",
"warnings.filterwarnings('ignore')\n",
"areas = []\n",
"division = np.arange(0.2,4,0.1)\n",
"for k in division:\n",
" smote = SMOTE(kind=\"svm\", ratio=k)\n",
" sx, sy = smote.fit_transform(x, y)\n",
" areas.append(getForestArea(sx,sy))\n",
"plt.plot(division, areas)\n",
"plt.xlabel('Ratio vs. area.')\n",
"plt.ylabel('Area')\n",
"plt.title('AUC')\n",
"plt.ylim([0.97,1.005])\n",
"plt.show() \n",
"print(\"Ended with %s classified and %s unclassifieds.\"%(count_classifieds(sy),count_unclassifieds(sy))) "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This gives a slightly better speed towards maximization (around 1.4 vs 1.6 with the borderline approach) but at the cost of a more lengthy computing time."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
}
],
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"display_name": "Python [Root]",
"language": "python",
"name": "Python [Root]"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.2"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment