Skip to content

Instantly share code, notes, and snippets.

@Neeratyoy
Created October 17, 2019 15:56
Show Gist options
  • Save Neeratyoy/84a6d047fd8088cfccae2445c3952fe2 to your computer and use it in GitHub Desktop.
Save Neeratyoy/84a6d047fd8088cfccae2445c3952fe2 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"import openml\n",
"\n",
"import numpy as np\n",
"import pandas as pd"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We want to work with the Iris data. Notice that we are not importing scikit-learn here. Since, the idea of OpenML is to provide a repository of datasets and associated performances on them, we shall first try to look up the OpenML database to see if we can find what we are looking for."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(2958, 16)\n",
"<class 'pandas.core.frame.DataFrame'>\n",
"did\n",
"name\n",
"version\n",
"uploader\n",
"status\n",
"format\n",
"MajorityClassSize\n",
"MaxNominalAttDistinctValues\n",
"MinorityClassSize\n",
"NumberOfClasses\n",
"NumberOfFeatures\n",
"NumberOfInstances\n",
"NumberOfInstancesWithMissingValues\n",
"NumberOfMissingValues\n",
"NumberOfNumericFeatures\n",
"NumberOfSymbolicFeatures\n"
]
}
],
"source": [
"# Fetching the list of all available datasets on OpenML\n",
"d = openml.datasets.list_datasets(output_format='dataframe')\n",
"print(d.shape)\n",
"for name in d.columns:\n",
" print(name)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We see that the returned dataframe contains some basic information about the dataset, i.e., it's ID, name, version, format, etc. Along with that, it also contains some additional features which provide meta information, on the Iris dataset. Such features can be useful when comparing the Iris dataset across other datasets. In this workout, we shall ignore them for now.\n",
"\n",
"We shall now filter datasets based on 'iris' being present in the dataset name. <br>\n",
"We shall also sort the results based on their version number."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>did</th>\n",
" <th>name</th>\n",
" <th>version</th>\n",
" <th>uploader</th>\n",
" <th>status</th>\n",
" <th>format</th>\n",
" <th>MajorityClassSize</th>\n",
" <th>MaxNominalAttDistinctValues</th>\n",
" <th>MinorityClassSize</th>\n",
" <th>NumberOfClasses</th>\n",
" <th>NumberOfFeatures</th>\n",
" <th>NumberOfInstances</th>\n",
" <th>NumberOfInstancesWithMissingValues</th>\n",
" <th>NumberOfMissingValues</th>\n",
" <th>NumberOfNumericFeatures</th>\n",
" <th>NumberOfSymbolicFeatures</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>61</th>\n",
" <td>61</td>\n",
" <td>iris</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>active</td>\n",
" <td>ARFF</td>\n",
" <td>50.0</td>\n",
" <td>3.0</td>\n",
" <td>50.0</td>\n",
" <td>3.0</td>\n",
" <td>5.0</td>\n",
" <td>150.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>4.0</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>41950</th>\n",
" <td>41950</td>\n",
" <td>iris_test_upload</td>\n",
" <td>1</td>\n",
" <td>4030</td>\n",
" <td>active</td>\n",
" <td>ARFF</td>\n",
" <td>50.0</td>\n",
" <td>3.0</td>\n",
" <td>50.0</td>\n",
" <td>3.0</td>\n",
" <td>5.0</td>\n",
" <td>150.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>4.0</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>451</th>\n",
" <td>451</td>\n",
" <td>irish</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>active</td>\n",
" <td>ARFF</td>\n",
" <td>278.0</td>\n",
" <td>10.0</td>\n",
" <td>222.0</td>\n",
" <td>2.0</td>\n",
" <td>6.0</td>\n",
" <td>500.0</td>\n",
" <td>32.0</td>\n",
" <td>32.0</td>\n",
" <td>2.0</td>\n",
" <td>4.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>969</th>\n",
" <td>969</td>\n",
" <td>iris</td>\n",
" <td>3</td>\n",
" <td>2</td>\n",
" <td>active</td>\n",
" <td>ARFF</td>\n",
" <td>100.0</td>\n",
" <td>2.0</td>\n",
" <td>50.0</td>\n",
" <td>2.0</td>\n",
" <td>5.0</td>\n",
" <td>150.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>4.0</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>41510</th>\n",
" <td>41510</td>\n",
" <td>iris</td>\n",
" <td>9</td>\n",
" <td>348</td>\n",
" <td>active</td>\n",
" <td>ARFF</td>\n",
" <td>NaN</td>\n",
" <td>3.0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>5.0</td>\n",
" <td>150.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>4.0</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" did name version uploader status format \\\n",
"61 61 iris 1 1 active ARFF \n",
"41950 41950 iris_test_upload 1 4030 active ARFF \n",
"451 451 irish 1 2 active ARFF \n",
"969 969 iris 3 2 active ARFF \n",
"41510 41510 iris 9 348 active ARFF \n",
"\n",
" MajorityClassSize MaxNominalAttDistinctValues MinorityClassSize \\\n",
"61 50.0 3.0 50.0 \n",
"41950 50.0 3.0 50.0 \n",
"451 278.0 10.0 222.0 \n",
"969 100.0 2.0 50.0 \n",
"41510 NaN 3.0 NaN \n",
"\n",
" NumberOfClasses NumberOfFeatures NumberOfInstances \\\n",
"61 3.0 5.0 150.0 \n",
"41950 3.0 5.0 150.0 \n",
"451 2.0 6.0 500.0 \n",
"969 2.0 5.0 150.0 \n",
"41510 NaN 5.0 150.0 \n",
"\n",
" NumberOfInstancesWithMissingValues NumberOfMissingValues \\\n",
"61 0.0 0.0 \n",
"41950 0.0 0.0 \n",
"451 32.0 32.0 \n",
"969 0.0 0.0 \n",
"41510 0.0 0.0 \n",
"\n",
" NumberOfNumericFeatures NumberOfSymbolicFeatures \n",
"61 4.0 1.0 \n",
"41950 4.0 1.0 \n",
"451 2.0 4.0 \n",
"969 4.0 1.0 \n",
"41510 4.0 1.0 "
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"d[d['name'].str.contains('iris')].sort_values(by='version').head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.8"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment