Neeratyoy/using openml.ipynb

## using openml.ipynb
{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import openml\n",
    "\n",
    "import numpy as np\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We want to work with the Iris data. Notice that we are not importing scikit-learn here. Since, the idea of OpenML is to provide a repository of datasets and associated performances on them, we shall first try to look up the OpenML database to see if we can find what we are looking for."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(2958, 16)\n",
      "<class 'pandas.core.frame.DataFrame'>\n",
      "did\n",
      "name\n",
      "version\n",
      "uploader\n",
      "status\n",
      "format\n",
      "MajorityClassSize\n",
      "MaxNominalAttDistinctValues\n",
      "MinorityClassSize\n",
      "NumberOfClasses\n",
      "NumberOfFeatures\n",
      "NumberOfInstances\n",
      "NumberOfInstancesWithMissingValues\n",
      "NumberOfMissingValues\n",
      "NumberOfNumericFeatures\n",
      "NumberOfSymbolicFeatures\n"
     ]
    }
   ],
   "source": [
    "# Fetching the list of all available datasets on OpenML\n",
    "d = openml.datasets.list_datasets(output_format='dataframe')\n",
    "print(d.shape)\n",
    "for name in d.columns:\n",
    "    print(name)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see that the returned dataframe contains some basic information about the dataset, i.e., it's ID, name, version, format, etc. Along with that, it also contains some additional features which provide meta information, on the Iris dataset. Such features can be useful when comparing the Iris dataset across other datasets. In this workout, we shall ignore them for now.\n",
    "\n",
    "We shall now filter datasets based on 'iris' being present in the dataset name. <br>\n",
    "We shall also sort the results based on their version number."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>did</th>\n",
       "      <th>name</th>\n",
       "      <th>version</th>\n",
       "      <th>uploader</th>\n",
       "      <th>status</th>\n",
       "      <th>format</th>\n",
       "      <th>MajorityClassSize</th>\n",
       "      <th>MaxNominalAttDistinctValues</th>\n",
       "      <th>MinorityClassSize</th>\n",
       "      <th>NumberOfClasses</th>\n",
       "      <th>NumberOfFeatures</th>\n",
       "      <th>NumberOfInstances</th>\n",
       "      <th>NumberOfInstancesWithMissingValues</th>\n",
       "      <th>NumberOfMissingValues</th>\n",
       "      <th>NumberOfNumericFeatures</th>\n",
       "      <th>NumberOfSymbolicFeatures</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>61</th>\n",
       "      <td>61</td>\n",
       "      <td>iris</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>active</td>\n",
       "      <td>ARFF</td>\n",
       "      <td>50.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>50.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>150.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>4.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>41950</th>\n",
       "      <td>41950</td>\n",
       "      <td>iris_test_upload</td>\n",
       "      <td>1</td>\n",
       "      <td>4030</td>\n",
       "      <td>active</td>\n",
       "      <td>ARFF</td>\n",
       "      <td>50.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>50.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>150.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>4.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>451</th>\n",
       "      <td>451</td>\n",
       "      <td>irish</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>active</td>\n",
       "      <td>ARFF</td>\n",
       "      <td>278.0</td>\n",
       "      <td>10.0</td>\n",
       "      <td>222.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>6.0</td>\n",
       "      <td>500.0</td>\n",
       "      <td>32.0</td>\n",
       "      <td>32.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>4.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>969</th>\n",
       "      <td>969</td>\n",
       "      <td>iris</td>\n",
       "      <td>3</td>\n",
       "      <td>2</td>\n",
       "      <td>active</td>\n",
       "      <td>ARFF</td>\n",
       "      <td>100.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>50.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>150.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>4.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>41510</th>\n",
       "      <td>41510</td>\n",
       "      <td>iris</td>\n",
       "      <td>9</td>\n",
       "      <td>348</td>\n",
       "      <td>active</td>\n",
       "      <td>ARFF</td>\n",
       "      <td>NaN</td>\n",
       "      <td>3.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>5.0</td>\n",
       "      <td>150.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>4.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         did              name  version uploader  status format  \\\n",
       "61        61              iris        1        1  active   ARFF   \n",
       "41950  41950  iris_test_upload        1     4030  active   ARFF   \n",
       "451      451             irish        1        2  active   ARFF   \n",
       "969      969              iris        3        2  active   ARFF   \n",
       "41510  41510              iris        9      348  active   ARFF   \n",
       "\n",
       "       MajorityClassSize  MaxNominalAttDistinctValues  MinorityClassSize  \\\n",
       "61                  50.0                          3.0               50.0   \n",
       "41950               50.0                          3.0               50.0   \n",
       "451                278.0                         10.0              222.0   \n",
       "969                100.0                          2.0               50.0   \n",
       "41510                NaN                          3.0                NaN   \n",
       "\n",
       "       NumberOfClasses  NumberOfFeatures  NumberOfInstances  \\\n",
       "61                 3.0               5.0              150.0   \n",
       "41950              3.0               5.0              150.0   \n",
       "451                2.0               6.0              500.0   \n",
       "969                2.0               5.0              150.0   \n",
       "41510              NaN               5.0              150.0   \n",
       "\n",
       "       NumberOfInstancesWithMissingValues  NumberOfMissingValues  \\\n",
       "61                                    0.0                    0.0   \n",
       "41950                                 0.0                    0.0   \n",
       "451                                  32.0                   32.0   \n",
       "969                                   0.0                    0.0   \n",
       "41510                                 0.0                    0.0   \n",
       "\n",
       "       NumberOfNumericFeatures  NumberOfSymbolicFeatures  \n",
       "61                         4.0                       1.0  \n",
       "41950                      4.0                       1.0  \n",
       "451                        2.0                       4.0  \n",
       "969                        4.0                       1.0  \n",
       "41510                      4.0                       1.0  "
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "d[d['name'].str.contains('iris')].sort_values(by='version').head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
	{
	"cells": [
	{
	"cell_type": "code",
	"execution_count": 2,
	"metadata": {},
	"outputs": [],
	"source": [
	"import openml\n",
	"\n",
	"import numpy as np\n",
	"import pandas as pd"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"We want to work with the Iris data. Notice that we are not importing scikit-learn here. Since, the idea of OpenML is to provide a repository of datasets and associated performances on them, we shall first try to look up the OpenML database to see if we can find what we are looking for."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 8,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"(2958, 16)\n",
	"<class 'pandas.core.frame.DataFrame'>\n",
	"did\n",
	"name\n",
	"version\n",
	"uploader\n",
	"status\n",
	"format\n",
	"MajorityClassSize\n",
	"MaxNominalAttDistinctValues\n",
	"MinorityClassSize\n",
	"NumberOfClasses\n",
	"NumberOfFeatures\n",
	"NumberOfInstances\n",
	"NumberOfInstancesWithMissingValues\n",
	"NumberOfMissingValues\n",
	"NumberOfNumericFeatures\n",
	"NumberOfSymbolicFeatures\n"
	]
	}
	],
	"source": [
	"# Fetching the list of all available datasets on OpenML\n",
	"d = openml.datasets.list_datasets(output_format='dataframe')\n",
	"print(d.shape)\n",
	"for name in d.columns:\n",
	" print(name)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"We see that the returned dataframe contains some basic information about the dataset, i.e., it's ID, name, version, format, etc. Along with that, it also contains some additional features which provide meta information, on the Iris dataset. Such features can be useful when comparing the Iris dataset across other datasets. In this workout, we shall ignore them for now.\n",
	"\n",
	"We shall now filter datasets based on 'iris' being present in the dataset name. <br>\n",
	"We shall also sort the results based on their version number."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 7,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/html": [
	"<div>\n",
	"<style scoped>\n",
	" .dataframe tbody tr th:only-of-type {\n",
	" vertical-align: middle;\n",
	" }\n",
	"\n",
	" .dataframe tbody tr th {\n",
	" vertical-align: top;\n",
	" }\n",
	"\n",
	" .dataframe thead th {\n",
	" text-align: right;\n",
	" }\n",
	"</style>\n",
	"<table border=\"1\" class=\"dataframe\">\n",
	" <thead>\n",
	" <tr style=\"text-align: right;\">\n",
	" <th></th>\n",
	" <th>did</th>\n",
	" <th>name</th>\n",
	" <th>version</th>\n",
	" <th>uploader</th>\n",
	" <th>status</th>\n",
	" <th>format</th>\n",
	" <th>MajorityClassSize</th>\n",
	" <th>MaxNominalAttDistinctValues</th>\n",
	" <th>MinorityClassSize</th>\n",
	" <th>NumberOfClasses</th>\n",
	" <th>NumberOfFeatures</th>\n",
	" <th>NumberOfInstances</th>\n",
	" <th>NumberOfInstancesWithMissingValues</th>\n",
	" <th>NumberOfMissingValues</th>\n",
	" <th>NumberOfNumericFeatures</th>\n",
	" <th>NumberOfSymbolicFeatures</th>\n",
	" </tr>\n",
	" </thead>\n",
	" <tbody>\n",
	" <tr>\n",
	" <th>61</th>\n",
	" <td>61</td>\n",
	" <td>iris</td>\n",
	" <td>1</td>\n",
	" <td>1</td>\n",
	" <td>active</td>\n",
	" <td>ARFF</td>\n",
	" <td>50.0</td>\n",
	" <td>3.0</td>\n",
	" <td>50.0</td>\n",
	" <td>3.0</td>\n",
	" <td>5.0</td>\n",
	" <td>150.0</td>\n",
	" <td>0.0</td>\n",
	" <td>0.0</td>\n",
	" <td>4.0</td>\n",
	" <td>1.0</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>41950</th>\n",
	" <td>41950</td>\n",
	" <td>iris_test_upload</td>\n",
	" <td>1</td>\n",
	" <td>4030</td>\n",
	" <td>active</td>\n",
	" <td>ARFF</td>\n",
	" <td>50.0</td>\n",
	" <td>3.0</td>\n",
	" <td>50.0</td>\n",
	" <td>3.0</td>\n",
	" <td>5.0</td>\n",
	" <td>150.0</td>\n",
	" <td>0.0</td>\n",
	" <td>0.0</td>\n",
	" <td>4.0</td>\n",
	" <td>1.0</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>451</th>\n",
	" <td>451</td>\n",
	" <td>irish</td>\n",
	" <td>1</td>\n",
	" <td>2</td>\n",
	" <td>active</td>\n",
	" <td>ARFF</td>\n",
	" <td>278.0</td>\n",
	" <td>10.0</td>\n",
	" <td>222.0</td>\n",
	" <td>2.0</td>\n",
	" <td>6.0</td>\n",
	" <td>500.0</td>\n",
	" <td>32.0</td>\n",
	" <td>32.0</td>\n",
	" <td>2.0</td>\n",
	" <td>4.0</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>969</th>\n",
	" <td>969</td>\n",
	" <td>iris</td>\n",
	" <td>3</td>\n",
	" <td>2</td>\n",
	" <td>active</td>\n",
	" <td>ARFF</td>\n",
	" <td>100.0</td>\n",
	" <td>2.0</td>\n",
	" <td>50.0</td>\n",
	" <td>2.0</td>\n",
	" <td>5.0</td>\n",
	" <td>150.0</td>\n",
	" <td>0.0</td>\n",
	" <td>0.0</td>\n",
	" <td>4.0</td>\n",
	" <td>1.0</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>41510</th>\n",
	" <td>41510</td>\n",
	" <td>iris</td>\n",
	" <td>9</td>\n",
	" <td>348</td>\n",
	" <td>active</td>\n",
	" <td>ARFF</td>\n",
	" <td>NaN</td>\n",
	" <td>3.0</td>\n",
	" <td>NaN</td>\n",
	" <td>NaN</td>\n",
	" <td>5.0</td>\n",
	" <td>150.0</td>\n",
	" <td>0.0</td>\n",
	" <td>0.0</td>\n",
	" <td>4.0</td>\n",
	" <td>1.0</td>\n",
	" </tr>\n",
	" </tbody>\n",
	"</table>\n",
	"</div>"
	],
	"text/plain": [
	" did name version uploader status format \\\n",
	"61 61 iris 1 1 active ARFF \n",
	"41950 41950 iris_test_upload 1 4030 active ARFF \n",
	"451 451 irish 1 2 active ARFF \n",
	"969 969 iris 3 2 active ARFF \n",
	"41510 41510 iris 9 348 active ARFF \n",
	"\n",
	" MajorityClassSize MaxNominalAttDistinctValues MinorityClassSize \\\n",
	"61 50.0 3.0 50.0 \n",
	"41950 50.0 3.0 50.0 \n",
	"451 278.0 10.0 222.0 \n",
	"969 100.0 2.0 50.0 \n",
	"41510 NaN 3.0 NaN \n",
	"\n",
	" NumberOfClasses NumberOfFeatures NumberOfInstances \\\n",
	"61 3.0 5.0 150.0 \n",
	"41950 3.0 5.0 150.0 \n",
	"451 2.0 6.0 500.0 \n",
	"969 2.0 5.0 150.0 \n",
	"41510 NaN 5.0 150.0 \n",
	"\n",
	" NumberOfInstancesWithMissingValues NumberOfMissingValues \\\n",
	"61 0.0 0.0 \n",
	"41950 0.0 0.0 \n",
	"451 32.0 32.0 \n",
	"969 0.0 0.0 \n",
	"41510 0.0 0.0 \n",
	"\n",
	" NumberOfNumericFeatures NumberOfSymbolicFeatures \n",
	"61 4.0 1.0 \n",
	"41950 4.0 1.0 \n",
	"451 2.0 4.0 \n",
	"969 4.0 1.0 \n",
	"41510 4.0 1.0 "
	]
	},
	"execution_count": 7,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"d[d['name'].str.contains('iris')].sort_values(by='version').head()"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": []
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.6.8"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 2
	}