corochann/dataset_indexer_exp.ipynb

## dataset_indexer_exp.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "# Features extraction\n",
    "\n",
    "Extract features easily by indexer."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Prepare data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "from IPython.display import display\n",
    "from IPython.core.interactiveshell import InteractiveShell\n",
    "InteractiveShell.ast_node_interactivity = \"all\"\n",
    "\n",
    "import numpy as np\n",
    "import chainer\n",
    "from chainer.datasets import TupleDataset\n",
    "\n",
    "train, test = chainer.datasets.get_mnist()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Extract data easily without explicitly call concat_examples\n",
    "\n",
    "imgs, labels information can be extracted by `features` indexer.\n",
    "\n",
    "Note that `features` itself is an indexer, the actual data is extracted after accessed by index.\n",
    "\n",
    " - axis=0 denotes dataset index\n",
    " - axis=1 denotes feature index (0 is imgs, 1 is labels for MNIST dataset)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "imgs (60000, 784)\nlabels (60000,)\nlabels[2] 4\n"
     ]
    }
   ],
   "source": [
    "imgs, labels = train.features[:, :]\n",
    "#imgs, labels = train.features[:]  # same\n",
    "\n",
    "print('imgs', imgs.shape)\n",
    "print('labels', labels.shape)\n",
    "print('labels[2]', labels[2])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Extract only labels, without extracting unnecessary imgs data\n",
    "\n",
    "Set 1 to axis=1, it will extract only labels data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "labels, len 60000\nlabels [5 0 4 1 9 2 1 3 1 4]\n"
     ]
    }
   ],
   "source": [
    "labels = train.features[:, 1]\n",
    "print('labels, len', len(labels))\n",
    "print('labels', labels[:10])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Data selection according to the label\n",
    "\n",
    "Even slightly complicated process can be written in few lines of code"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "label3_index [  7  10  12  27  30  44  49  50  74  86  98 107 111 130 135 136 149 157\n 179 181 198 203 207 215 228 235 242 250 254 255 279 281 291 298 321 327\n 330 341 356 361 392 405 425 433 452 459 479 486 490 495 500 509 540 546\n 549 557 561 574 581 613 629 643 645 659 670 675 695 715 731 752 760 767\n 789 808 811 840 843 856 857 861 867 874 875 878 890 895 909 953 966 975\n 983 992 998]\nimgs3 (93, 784)\n"
     ]
    }
   ],
   "source": [
    "# Extract index whose label is 3, only for first 1000 dataset.\n",
    "label3_index = np.argwhere(train.features[:1000, 1] == 3).ravel()\n",
    "print('label3_index', label3_index)\n",
    "\n",
    "# Get image whose label is 3\n",
    "imgs3 = train.features[label3_index, 0]\n",
    "print('imgs3', imgs3.shape)  # 93 images hits!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It also supports boolean flag indexing, like pandas"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "imgs5 (5421, 784)\n"
     ]
    }
   ],
   "source": [
    "# Extract images whose label is 5 (Test for bool index access at axis=0)\n",
    "imgs5 = train.features[train.features[:, 1] == 5, 0]\n",
    "print('imgs5', imgs5.shape)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    ""
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2.0
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {
	"collapsed": true
	},
	"source": [
	"# Features extraction\n",
	"\n",
	"Extract features easily by indexer."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Prepare data"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 18,
	"metadata": {},
	"outputs": [],
	"source": [
	"from IPython.display import display\n",
	"from IPython.core.interactiveshell import InteractiveShell\n",
	"InteractiveShell.ast_node_interactivity = \"all\"\n",
	"\n",
	"import numpy as np\n",
	"import chainer\n",
	"from chainer.datasets import TupleDataset\n",
	"\n",
	"train, test = chainer.datasets.get_mnist()"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## 1. Extract data easily without explicitly call concat_examples\n",
	"\n",
	"imgs, labels information can be extracted by `features` indexer.\n",
	"\n",
	"Note that `features` itself is an indexer, the actual data is extracted after accessed by index.\n",
	"\n",
	" - axis=0 denotes dataset index\n",
	" - axis=1 denotes feature index (0 is imgs, 1 is labels for MNIST dataset)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 19,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"imgs (60000, 784)\nlabels (60000,)\nlabels[2] 4\n"
	]
	}
	],
	"source": [
	"imgs, labels = train.features[:, :]\n",
	"#imgs, labels = train.features[:] # same\n",
	"\n",
	"print('imgs', imgs.shape)\n",
	"print('labels', labels.shape)\n",
	"print('labels[2]', labels[2])"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## 2. Extract only labels, without extracting unnecessary imgs data\n",
	"\n",
	"Set 1 to axis=1, it will extract only labels data."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 20,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"labels, len 60000\nlabels [5 0 4 1 9 2 1 3 1 4]\n"
	]
	}
	],
	"source": [
	"labels = train.features[:, 1]\n",
	"print('labels, len', len(labels))\n",
	"print('labels', labels[:10])"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## 3. Data selection according to the label\n",
	"\n",
	"Even slightly complicated process can be written in few lines of code"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 21,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"label3_index [ 7 10 12 27 30 44 49 50 74 86 98 107 111 130 135 136 149 157\n 179 181 198 203 207 215 228 235 242 250 254 255 279 281 291 298 321 327\n 330 341 356 361 392 405 425 433 452 459 479 486 490 495 500 509 540 546\n 549 557 561 574 581 613 629 643 645 659 670 675 695 715 731 752 760 767\n 789 808 811 840 843 856 857 861 867 874 875 878 890 895 909 953 966 975\n 983 992 998]\nimgs3 (93, 784)\n"
	]
	}
	],
	"source": [
	"# Extract index whose label is 3, only for first 1000 dataset.\n",
	"label3_index = np.argwhere(train.features[:1000, 1] == 3).ravel()\n",
	"print('label3_index', label3_index)\n",
	"\n",
	"# Get image whose label is 3\n",
	"imgs3 = train.features[label3_index, 0]\n",
	"print('imgs3', imgs3.shape) # 93 images hits!"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"It also supports boolean flag indexing, like pandas"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 22,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"imgs5 (5421, 784)\n"
	]
	}
	],
	"source": [
	"# Extract images whose label is 5 (Test for bool index access at axis=0)\n",
	"imgs5 = train.features[train.features[:, 1] == 5, 0]\n",
	"print('imgs5', imgs5.shape)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	""
	]
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 2",
	"language": "python",
	"name": "python2"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 2.0
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython2",
	"version": "2.7.6"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 0
	}