Skip to content

Instantly share code, notes, and snippets.

@WassimBenzarti
Created May 16, 2020 15:55
Show Gist options
  • Save WassimBenzarti/4b65253832ea4e5a0d4e17c74cf355f6 to your computer and use it in GitHub Desktop.
Save WassimBenzarti/4b65253832ea4e5a0d4e17c74cf355f6 to your computer and use it in GitHub Desktop.
Data Mining TP3
Display the source blob
Display the rendered blob
Raw
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"name": "Data Mining TP3",
"provenance": [],
"collapsed_sections": [],
"include_colab_link": true
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
"<a href=\"https://colab.research.google.com/gist/WassimBenzarti/4b65253832ea4e5a0d4e17c74cf355f6/data-mining-tp3.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "nFEGHDEZdfa3",
"colab_type": "text"
},
"source": [
"### Loading data"
]
},
{
"cell_type": "code",
"metadata": {
"id": "iQgvKD-eYFqF",
"colab_type": "code",
"outputId": "24c215fe-0905-449b-a06a-9063ea36a927",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 1000
}
},
"source": [
"from sklearn import datasets\n",
"# Load the dataset\n",
"irisData = datasets.load_iris()\n",
"# Display the data\n",
"print (irisData.data)\n",
"print (irisData.target)\n",
"# Show the dimensions of the dataset\n",
"irisData.data.shape, irisData.target.shape\n",
"# Show the description\n",
"print(irisData.DESCR)\n",
"\n",
"# Uppercase letters -> Vectors\n",
"# Lowercase letters -> Scalars\n",
"X = irisData.data\n",
"y = irisData.target"
],
"execution_count": 1,
"outputs": [
{
"output_type": "stream",
"text": [
"[[5.1 3.5 1.4 0.2]\n",
" [4.9 3. 1.4 0.2]\n",
" [4.7 3.2 1.3 0.2]\n",
" [4.6 3.1 1.5 0.2]\n",
" [5. 3.6 1.4 0.2]\n",
" [5.4 3.9 1.7 0.4]\n",
" [4.6 3.4 1.4 0.3]\n",
" [5. 3.4 1.5 0.2]\n",
" [4.4 2.9 1.4 0.2]\n",
" [4.9 3.1 1.5 0.1]\n",
" [5.4 3.7 1.5 0.2]\n",
" [4.8 3.4 1.6 0.2]\n",
" [4.8 3. 1.4 0.1]\n",
" [4.3 3. 1.1 0.1]\n",
" [5.8 4. 1.2 0.2]\n",
" [5.7 4.4 1.5 0.4]\n",
" [5.4 3.9 1.3 0.4]\n",
" [5.1 3.5 1.4 0.3]\n",
" [5.7 3.8 1.7 0.3]\n",
" [5.1 3.8 1.5 0.3]\n",
" [5.4 3.4 1.7 0.2]\n",
" [5.1 3.7 1.5 0.4]\n",
" [4.6 3.6 1. 0.2]\n",
" [5.1 3.3 1.7 0.5]\n",
" [4.8 3.4 1.9 0.2]\n",
" [5. 3. 1.6 0.2]\n",
" [5. 3.4 1.6 0.4]\n",
" [5.2 3.5 1.5 0.2]\n",
" [5.2 3.4 1.4 0.2]\n",
" [4.7 3.2 1.6 0.2]\n",
" [4.8 3.1 1.6 0.2]\n",
" [5.4 3.4 1.5 0.4]\n",
" [5.2 4.1 1.5 0.1]\n",
" [5.5 4.2 1.4 0.2]\n",
" [4.9 3.1 1.5 0.2]\n",
" [5. 3.2 1.2 0.2]\n",
" [5.5 3.5 1.3 0.2]\n",
" [4.9 3.6 1.4 0.1]\n",
" [4.4 3. 1.3 0.2]\n",
" [5.1 3.4 1.5 0.2]\n",
" [5. 3.5 1.3 0.3]\n",
" [4.5 2.3 1.3 0.3]\n",
" [4.4 3.2 1.3 0.2]\n",
" [5. 3.5 1.6 0.6]\n",
" [5.1 3.8 1.9 0.4]\n",
" [4.8 3. 1.4 0.3]\n",
" [5.1 3.8 1.6 0.2]\n",
" [4.6 3.2 1.4 0.2]\n",
" [5.3 3.7 1.5 0.2]\n",
" [5. 3.3 1.4 0.2]\n",
" [7. 3.2 4.7 1.4]\n",
" [6.4 3.2 4.5 1.5]\n",
" [6.9 3.1 4.9 1.5]\n",
" [5.5 2.3 4. 1.3]\n",
" [6.5 2.8 4.6 1.5]\n",
" [5.7 2.8 4.5 1.3]\n",
" [6.3 3.3 4.7 1.6]\n",
" [4.9 2.4 3.3 1. ]\n",
" [6.6 2.9 4.6 1.3]\n",
" [5.2 2.7 3.9 1.4]\n",
" [5. 2. 3.5 1. ]\n",
" [5.9 3. 4.2 1.5]\n",
" [6. 2.2 4. 1. ]\n",
" [6.1 2.9 4.7 1.4]\n",
" [5.6 2.9 3.6 1.3]\n",
" [6.7 3.1 4.4 1.4]\n",
" [5.6 3. 4.5 1.5]\n",
" [5.8 2.7 4.1 1. ]\n",
" [6.2 2.2 4.5 1.5]\n",
" [5.6 2.5 3.9 1.1]\n",
" [5.9 3.2 4.8 1.8]\n",
" [6.1 2.8 4. 1.3]\n",
" [6.3 2.5 4.9 1.5]\n",
" [6.1 2.8 4.7 1.2]\n",
" [6.4 2.9 4.3 1.3]\n",
" [6.6 3. 4.4 1.4]\n",
" [6.8 2.8 4.8 1.4]\n",
" [6.7 3. 5. 1.7]\n",
" [6. 2.9 4.5 1.5]\n",
" [5.7 2.6 3.5 1. ]\n",
" [5.5 2.4 3.8 1.1]\n",
" [5.5 2.4 3.7 1. ]\n",
" [5.8 2.7 3.9 1.2]\n",
" [6. 2.7 5.1 1.6]\n",
" [5.4 3. 4.5 1.5]\n",
" [6. 3.4 4.5 1.6]\n",
" [6.7 3.1 4.7 1.5]\n",
" [6.3 2.3 4.4 1.3]\n",
" [5.6 3. 4.1 1.3]\n",
" [5.5 2.5 4. 1.3]\n",
" [5.5 2.6 4.4 1.2]\n",
" [6.1 3. 4.6 1.4]\n",
" [5.8 2.6 4. 1.2]\n",
" [5. 2.3 3.3 1. ]\n",
" [5.6 2.7 4.2 1.3]\n",
" [5.7 3. 4.2 1.2]\n",
" [5.7 2.9 4.2 1.3]\n",
" [6.2 2.9 4.3 1.3]\n",
" [5.1 2.5 3. 1.1]\n",
" [5.7 2.8 4.1 1.3]\n",
" [6.3 3.3 6. 2.5]\n",
" [5.8 2.7 5.1 1.9]\n",
" [7.1 3. 5.9 2.1]\n",
" [6.3 2.9 5.6 1.8]\n",
" [6.5 3. 5.8 2.2]\n",
" [7.6 3. 6.6 2.1]\n",
" [4.9 2.5 4.5 1.7]\n",
" [7.3 2.9 6.3 1.8]\n",
" [6.7 2.5 5.8 1.8]\n",
" [7.2 3.6 6.1 2.5]\n",
" [6.5 3.2 5.1 2. ]\n",
" [6.4 2.7 5.3 1.9]\n",
" [6.8 3. 5.5 2.1]\n",
" [5.7 2.5 5. 2. ]\n",
" [5.8 2.8 5.1 2.4]\n",
" [6.4 3.2 5.3 2.3]\n",
" [6.5 3. 5.5 1.8]\n",
" [7.7 3.8 6.7 2.2]\n",
" [7.7 2.6 6.9 2.3]\n",
" [6. 2.2 5. 1.5]\n",
" [6.9 3.2 5.7 2.3]\n",
" [5.6 2.8 4.9 2. ]\n",
" [7.7 2.8 6.7 2. ]\n",
" [6.3 2.7 4.9 1.8]\n",
" [6.7 3.3 5.7 2.1]\n",
" [7.2 3.2 6. 1.8]\n",
" [6.2 2.8 4.8 1.8]\n",
" [6.1 3. 4.9 1.8]\n",
" [6.4 2.8 5.6 2.1]\n",
" [7.2 3. 5.8 1.6]\n",
" [7.4 2.8 6.1 1.9]\n",
" [7.9 3.8 6.4 2. ]\n",
" [6.4 2.8 5.6 2.2]\n",
" [6.3 2.8 5.1 1.5]\n",
" [6.1 2.6 5.6 1.4]\n",
" [7.7 3. 6.1 2.3]\n",
" [6.3 3.4 5.6 2.4]\n",
" [6.4 3.1 5.5 1.8]\n",
" [6. 3. 4.8 1.8]\n",
" [6.9 3.1 5.4 2.1]\n",
" [6.7 3.1 5.6 2.4]\n",
" [6.9 3.1 5.1 2.3]\n",
" [5.8 2.7 5.1 1.9]\n",
" [6.8 3.2 5.9 2.3]\n",
" [6.7 3.3 5.7 2.5]\n",
" [6.7 3. 5.2 2.3]\n",
" [6.3 2.5 5. 1.9]\n",
" [6.5 3. 5.2 2. ]\n",
" [6.2 3.4 5.4 2.3]\n",
" [5.9 3. 5.1 1.8]]\n",
"[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
" 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1\n",
" 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2\n",
" 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2\n",
" 2 2]\n",
".. _iris_dataset:\n",
"\n",
"Iris plants dataset\n",
"--------------------\n",
"\n",
"**Data Set Characteristics:**\n",
"\n",
" :Number of Instances: 150 (50 in each of three classes)\n",
" :Number of Attributes: 4 numeric, predictive attributes and the class\n",
" :Attribute Information:\n",
" - sepal length in cm\n",
" - sepal width in cm\n",
" - petal length in cm\n",
" - petal width in cm\n",
" - class:\n",
" - Iris-Setosa\n",
" - Iris-Versicolour\n",
" - Iris-Virginica\n",
" \n",
" :Summary Statistics:\n",
"\n",
" ============== ==== ==== ======= ===== ====================\n",
" Min Max Mean SD Class Correlation\n",
" ============== ==== ==== ======= ===== ====================\n",
" sepal length: 4.3 7.9 5.84 0.83 0.7826\n",
" sepal width: 2.0 4.4 3.05 0.43 -0.4194\n",
" petal length: 1.0 6.9 3.76 1.76 0.9490 (high!)\n",
" petal width: 0.1 2.5 1.20 0.76 0.9565 (high!)\n",
" ============== ==== ==== ======= ===== ====================\n",
"\n",
" :Missing Attribute Values: None\n",
" :Class Distribution: 33.3% for each of 3 classes.\n",
" :Creator: R.A. Fisher\n",
" :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)\n",
" :Date: July, 1988\n",
"\n",
"The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken\n",
"from Fisher's paper. Note that it's the same as in R, but not as in the UCI\n",
"Machine Learning Repository, which has two wrong data points.\n",
"\n",
"This is perhaps the best known database to be found in the\n",
"pattern recognition literature. Fisher's paper is a classic in the field and\n",
"is referenced frequently to this day. (See Duda & Hart, for example.) The\n",
"data set contains 3 classes of 50 instances each, where each class refers to a\n",
"type of iris plant. One class is linearly separable from the other 2; the\n",
"latter are NOT linearly separable from each other.\n",
"\n",
".. topic:: References\n",
"\n",
" - Fisher, R.A. \"The use of multiple measurements in taxonomic problems\"\n",
" Annual Eugenics, 7, Part II, 179-188 (1936); also in \"Contributions to\n",
" Mathematical Statistics\" (John Wiley, NY, 1950).\n",
" - Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis.\n",
" (Q327.D83) John Wiley & Sons. ISBN 0-471-22361-1. See page 218.\n",
" - Dasarathy, B.V. (1980) \"Nosing Around the Neighborhood: A New System\n",
" Structure and Classification Rule for Recognition in Partially Exposed\n",
" Environments\". IEEE Transactions on Pattern Analysis and Machine\n",
" Intelligence, Vol. PAMI-2, No. 1, 67-71.\n",
" - Gates, G.W. (1972) \"The Reduced Nearest Neighbor Rule\". IEEE Transactions\n",
" on Information Theory, May 1972, 431-433.\n",
" - See also: 1988 MLC Proceedings, 54-64. Cheeseman et al\"s AUTOCLASS II\n",
" conceptual clustering system finds 3 classes in the data.\n",
" - Many, many more ...\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "e7F8m4ytdVdd",
"colab_type": "text"
},
"source": [
"### 1. Comment sont réparties les données dans les tableaux ?\n",
"Les données sont répartis dans un tableau de 150 instances segmenter en 2 parties, input et output. En input, chaque instance a 4 colonnes (features) numériques: \n",
"- sepal length in cm\n",
"- sepal width in cm\n",
"- petal length in cm\n",
"- petal width in cm\n",
"\n",
"En output, on a une seul valeur numérique pour chaque instance qui désigne le type d'Iris:\n",
"- 0: Iris-Setosa\n",
"- 1: Iris-Versicolour\n",
"- 2: Iris-Virginica"
]
},
{
"cell_type": "code",
"metadata": {
"id": "Cnb8D2cdeT9k",
"colab_type": "code",
"outputId": "467e6e0a-e954-4987-f690-3a442772327f",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 35
}
},
"source": [
"# Show for each class, the number of instances in it\n",
"# We used the vector selection to filter\n",
"# the data\n",
"X[y==0].shape,X[y==1].shape,X[y==2].shape"
],
"execution_count": 2,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"((50, 4), (50, 4), (50, 4))"
]
},
"metadata": {
"tags": []
},
"execution_count": 2
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "H7SSVm-eeTZv",
"colab_type": "text"
},
"source": [
"#### Combien y a-t-il de données dans chaque classe ?\n",
"Chaque classe contient 50 instances."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "1qEDWYSEemKo",
"colab_type": "text"
},
"source": [
"### Quels sont les attributs et la classe du 32ème élément de l'échantillon ?\n",
"\n",
"|Sepal length|Sepal width|Petal length|Petal width|Classe|\n",
"|-|-|-|-|-|\n",
"|5.4 | 3.4| 1.5| 0.4|0|"
]
},
{
"cell_type": "code",
"metadata": {
"id": "hF8HHpWUeV6w",
"colab_type": "code",
"outputId": "449e5902-375c-43c3-c816-cb9184014cb5",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 35
}
},
"source": [
"X[31], y[31]"
],
"execution_count": 3,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"(array([5.4, 3.4, 1.5, 0.4]), 0)"
]
},
"metadata": {
"tags": []
},
"execution_count": 3
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Fc-qEuxkex5d",
"colab_type": "text"
},
"source": [
"### Comprendre, commenter et programmer le code source suivant"
]
},
{
"cell_type": "code",
"metadata": {
"id": "wWPHCm3dexVN",
"colab_type": "code",
"outputId": "f1a261ce-aba7-42e6-8c7d-378552a1ea6c",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 765
}
},
"source": [
"# Inclure les dépendances matplotlib, cycle et pylab\n",
"from itertools import cycle\n",
"import matplotlib\n",
"import pylab as pl\n",
"\n",
"# Visualiser le nuage des points de tous les classes\n",
"def plot_2D(data, target, target_names):\n",
" colors = cycle('rgbcmykw') # cycle de couleurs: (r) rouge, (g) vert, (b) bleu, (c) cyan, (m) magenta, (y) jaune, (k) noir, (w) blanc\n",
" target_ids = range(len(target_names))\n",
" pl.figure()\n",
" # Itérer sur les classes et dans chaque itération, on utilise une couleur unique\n",
" for i, c, label in zip(target_ids, colors, target_names):\n",
" # Choisir Sepal length et Sepal width seulement comme deux axes \n",
" pl.scatter(data[target == i, 0], data[target == i, 1], c=c, label=label)\n",
" pl.legend()\n",
" pl.show()\n",
"\n",
"# Appeler la fonction plot_2D en passant en entré les données, la classe de chaque instance\n",
"# et les différents noms des classes\n",
"plot_2D(irisData.data, irisData.target, [\"Iris-Setosa\", \"Iris-Versicolour\", \"Iris-Virginica\"])"
],
"execution_count": 4,
"outputs": [
{
"output_type": "display_data",
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"tags": [],
"needs_background": "light"
}
},
{
"output_type": "display_data",
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"tags": [],
"needs_background": "light"
}
},
{
"output_type": "display_data",
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"tags": [],
"needs_background": "light"
}
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "jsgwd__WeqMd",
"colab_type": "text"
},
"source": [
"On peut également modifier les attributs utilisé en modifiant cette ligne\n",
"```python\n",
"pl.scatter(data[target == i, 0], data[target == i, 1], c=c, label=label)\n",
"```\n",
"**Example**\n",
"Pour choisir les attributs 'Petal length' et 'Petal width', la modification sera comme suit:\n",
"```python\n",
"pl.scatter(data[target == i, 2], data[target == i, 3], c=c, label=label)\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Vt8Pmfwou3hT",
"colab_type": "text"
},
"source": [
"### Recherche d'une droite permettant de séparer les examples d'une classe des exemples des deux autres classes\n",
"Pour ce faire, on va superposer les nuages des points pour visualiser la distance entre les points de chaque classe."
]
},
{
"cell_type": "code",
"metadata": {
"id": "3ddlcmdCdZu2",
"colab_type": "code",
"outputId": "1837d836-edf7-4bbb-e45c-1f0c8b882c4c",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 268
}
},
"source": [
"# Inclure les dépendances matplotlib, cycle et pylab\n",
"from itertools import cycle\n",
"import matplotlib\n",
"import numpy as np\n",
"import pylab as pl\n",
"\n",
"line_x= np.linspace(4.5,6.5,1000)\n",
"\n",
"# Visualiser le nuage des points de tous les classes\n",
"def plot_2D(data, target, target_names):\n",
" colors = cycle('rgbcmykw') # cycle de couleurs\n",
" target_ids = range(len(target_names))\n",
" pl.figure()\n",
" ax = pl.subplot(111)\n",
" ax.plot(line_x,(line_x*1)-2.3)\n",
" # Itérer sur les classes et dans chaque itération, on utilise une couleur unique\n",
" for i, c, label in zip(target_ids, colors, target_names):\n",
" # Choisir Sepal length et Sepal width seulement comme deux axes \n",
" ax.scatter(data[target == i, 0], data[target == i, 1], c=c, label=label)\n",
" ax.legend()\n",
" ax.figure.show()\n",
"\n",
"# Appeler la fonction plot_2D en passant en entré les données, la classe de chaque instance\n",
"# et les différents noms des classes\n",
"plot_2D(irisData.data, irisData.target, [\"Iris-Setosa\", \"Iris-Versicolour\", \"Iris-Virginica\"])"
],
"execution_count": 5,
"outputs": [
{
"output_type": "display_data",
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"tags": [],
"needs_background": "light"
}
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Py56ReH5wfKB",
"colab_type": "text"
},
"source": [
"Il est clair qu'il y a une droite qui sépare la classe `Iris-Setosa` des autres classes.\n",
"L'équation de cette droite est:\n",
"$$\n",
"y = x - 2.3\n",
"$$"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "owons9o4zgGx",
"colab_type": "text"
},
"source": [
"## Un premier apprentissage de classifieur"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "VTsoJj873w1T",
"colab_type": "text"
},
"source": [
"### Naïve Bayes\n",
"L'example ci-dessous permet de:\n",
"1. Initialiser le classifieur Naïve Bayes Multinomial\n",
"2. Lancer l'apprentissage sur tout l'ensemble de données sauf la dérniére instance\n",
"3. Lancer la prédiction:\n",
" - sur le 32éme élement\n",
" - sur le dérnier élement\n",
" - sur tout l'ensemble de données\n",
"\n",
"##### **Conclusion**\n",
"La précision du classifieur est 96.6% sur les données d'apprentissage (train dataset) sauf la dernière instance. Ce score peut être biaisé car il ne reflète pas des prédictions sur des nouvelles instances."
]
},
{
"cell_type": "code",
"metadata": {
"id": "qDu5K6FZzeaL",
"colab_type": "code",
"outputId": "d893b12f-94ee-422a-ce21-44763bea8fbb",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 199
}
},
"source": [
"from sklearn import naive_bayes\n",
"# Initialiser le classifieur\n",
"nb = naive_bayes.MultinomialNB(fit_prior=True)# un algo d'apprentissage\n",
"# Charger les données\n",
"irisData = datasets.load_iris()\n",
"\n",
"# Lancer la phase d'apprentissage\n",
"nb.fit(irisData.data[:-1], irisData.target[:-1])\n",
"\n",
"# Lancer la prédiction\n",
"p31 = nb.predict([irisData.data[31]])\n",
"#print(irisData.target[31], p31)\n",
"\n",
"plast = nb.predict([irisData.data[-1]])\n",
"#print(plast)\n",
"predicted_results = nb.predict(irisData.data[:])\n",
"\n",
"print(\"Prediction results\", predicted_results)\n",
"print(\"Actual results\", irisData.target)\n"
],
"execution_count": 6,
"outputs": [
{
"output_type": "stream",
"text": [
"Prediction results [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
" 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1\n",
" 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2\n",
" 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 1 2 1 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2\n",
" 2 2]\n",
"Actual results [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
" 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1\n",
" 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2\n",
" 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2\n",
" 2 2]\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "HhVae88W9Gl9",
"colab_type": "text"
},
"source": [
"#### b- Deuxième programme\n",
"Cet algorithme divise des données en deux: les données d'apprentissage et les données de test."
]
},
{
"cell_type": "code",
"metadata": {
"id": "58xVzW769J4m",
"colab_type": "code",
"outputId": "195edc24-ea43-42bc-c330-95f08db67a5d",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 90
}
},
"source": [
"from sklearn import naive_bayes\n",
"\n",
"nb = naive_bayes.MultinomialNB(fit_prior=True)\n",
"\n",
"nb.fit(irisData.data[:99], irisData.target[:99])\n",
"\n",
"# Prédire sur les données restantes\n",
"prediction_result = nb.predict(irisData.data[100:149])\n",
"\n",
"print(\"Prediction results\", prediction_result)\n",
"print(\"Actual results\", irisData.target[100:149])"
],
"execution_count": 7,
"outputs": [
{
"output_type": "stream",
"text": [
"Prediction results [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1\n",
" 1 1 1 1 1 1 1 1 1 1 1 1]\n",
"Actual results [2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2\n",
" 2 2 2 2 2 2 2 2 2 2 2 2]\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "8kqI4h_MTHIM",
"colab_type": "text"
},
"source": [
"##### **Conclusion**\n",
"Les résultats sont incohérents. Le problème c'est que notre dataset n'est pas équilibré aprés la division. En effet le modèle n'a jamais vu le 3éme type 'Iris-Virginica'. Pour remédier à ce problème, il faut mélanger le dataset avant la division (Train & Test).\n",
"\n",
"\n",
"Ci-dessous le même algorithme avec le mélange avant la division en utilisant la fonction `shuffle` de sklearn"
]
},
{
"cell_type": "code",
"metadata": {
"id": "vKfk_de_prA6",
"colab_type": "code",
"outputId": "6c577745-c1a4-4c58-e296-7b2eaa5fa2d1",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 74
}
},
"source": [
"from sklearn import naive_bayes\n",
"from random import shuffle\n",
"\n",
"nb = naive_bayes.MultinomialNB(fit_prior=True)\n",
"data = list(zip(irisData.data, irisData.target))\n",
"\n",
"shuffle(data)\n",
"X, y = list(zip(*data))\n",
"\n",
"nb.fit(X[:99], y[:99])\n",
"\n",
"# Prédire sur les données restantes\n",
"prediction_result = nb.predict(X[100:149])\n",
"\n",
"print(\"Prediction \", list(prediction_result))\n",
"print(\"Actual \", list(y[100:149]))"
],
"execution_count": 8,
"outputs": [
{
"output_type": "stream",
"text": [
"Prediction [2, 1, 0, 2, 0, 1, 1, 2, 2, 2, 2, 0, 0, 1, 1, 2, 2, 1, 2, 2, 2, 0, 0, 2, 2, 2, 2, 1, 1, 1, 0, 0, 2, 1, 2, 0, 0, 2, 2, 1, 2, 0, 0, 1, 2, 0, 0, 2, 2]\n",
"Actual [1, 1, 0, 2, 0, 1, 1, 2, 2, 1, 1, 0, 0, 1, 1, 2, 1, 1, 1, 1, 2, 0, 0, 2, 2, 1, 2, 1, 1, 1, 0, 0, 2, 1, 2, 0, 0, 2, 2, 1, 2, 0, 0, 1, 2, 0, 0, 2, 2]\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "_yYTEpZbV_pG",
"colab_type": "text"
},
"source": [
"## III- Evaluer les performances d'un classifieur"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ympqHUKH-Keh",
"colab_type": "text"
},
"source": [
"### Performances sur l'ensemble d'apprentissage"
]
},
{
"cell_type": "code",
"metadata": {
"id": "l44ua0BqWDab",
"colab_type": "code",
"outputId": "69d61d05-38f3-456c-d785-b60ad3e981be",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 35
}
},
"source": [
"nb.fit(irisData.data, irisData.target)\n",
"prediction_result = nb.predict(irisData.data)\n",
"\n",
"# Calculer l'erreur en faisant la somme des inégalité\n",
"def calculate_error(P, Y):\n",
" ea = 0\n",
" for i in range(len(irisData.target)):\n",
" if (P[i] != Y[i]):\n",
" ea = ea+1\n",
" return (ea/len(Y))\n",
"\n",
"calculate_error(prediction_result, irisData.target)"
],
"execution_count": 9,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"0.04666666666666667"
]
},
"metadata": {
"tags": []
},
"execution_count": 9
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "PKoaw_bPuk0l",
"colab_type": "text"
},
"source": [
"Les opérateurs sur les tableaux permet d'effectuer ce comptage en une seule instruction. \n",
"\n",
"Les valeurs non nulles représente les erreurs commises par le modéle. Donc on peut utiliser la fonction `count_nonzero()` du package `numpy`."
]
},
{
"cell_type": "code",
"metadata": {
"id": "3bceMW0Vl6hS",
"colab_type": "code",
"outputId": "a2d4e68d-ab2c-4081-a6d0-fda4f522d4c7",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 35
}
},
"source": [
"diff = prediction_result - irisData.target\n",
"np.count_nonzero(diff)/len(irisData.target)"
],
"execution_count": 10,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"0.04666666666666667"
]
},
"metadata": {
"tags": []
},
"execution_count": 10
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "5-9SCdUW7P0g",
"colab_type": "text"
},
"source": [
"On peut aussi calculer le taux de bonne classification du\n",
"modèle et le taux d'erreur en utilisant la méthode `.score()`"
]
},
{
"cell_type": "code",
"metadata": {
"id": "-Ae_EgV17Erj",
"colab_type": "code",
"outputId": "8228985e-77d9-4933-d9be-d45db8f06a31",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 54
}
},
"source": [
"print(f\"Accuracy = {nb.score(irisData.data, irisData.target)}\")\n",
"print(f\"Taux d'erreur = {1-nb.score(irisData.data, irisData.target)}\")"
],
"execution_count": 11,
"outputs": [
{
"output_type": "stream",
"text": [
"Accuracy = 0.9533333333333334\n",
"Taux d'erreur = 0.046666666666666634\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "O8lYa_p8-ycf",
"colab_type": "text"
},
"source": [
"### Performances en généralisation"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "XjRZajoDZCUt",
"colab_type": "text"
},
"source": [
"La fonction `split(S)` permet de séléctionner les indexes qui vont être présent dans le Set `S2`. Puis, charger les listes `dataS1`, `targetS1`, `dataS2` et `targetS2` suivant l'index (si il fait partie des indexes `S2`, on l'ajoute à la liste `S2` sinon dans `S1`."
]
},
{
"cell_type": "code",
"metadata": {
"id": "IsO1sd11-ztg",
"colab_type": "code",
"outputId": "6d4aa013-8bd2-4bde-e616-e6b1aea81fe6",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 54
}
},
"source": [
"from random import sample\n",
"S = list(zip(irisData.data, irisData.target))\n",
"def split(S):\n",
" s2_size = len(S)*0.33\n",
" s2_indexes = sample(range(len(S)),k=round(s2_size))\n",
"\n",
" dataS1 = []\n",
" targetS1 = []\n",
" dataS2 = []\n",
" targetS2 = []\n",
" for i in range(len(S)):\n",
" if i in s2_indexes:\n",
" dataS2.append(S[i][0])\n",
" targetS2.append(S[i][1])\n",
" else:\n",
" dataS1.append(S[i][0])\n",
" targetS1.append(S[i][1])\n",
" return [dataS1, targetS1, dataS2, targetS2]\n",
"\n",
"\n",
"[dataS1, targetS1, dataS2, targetS2] = split(S)\n",
"print(f\"dataS1 = {len(dataS1)} targetS1 = {len(targetS1)}\")\n",
"print(f\"dataS2 = {len(dataS2)} targetS2 = {len(targetS2)}\")"
],
"execution_count": 20,
"outputs": [
{
"output_type": "stream",
"text": [
"dataS1 = 100 targetS1 = 100\n",
"dataS2 = 50 targetS2 = 50\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "EjIEbA0rsgw5",
"colab_type": "text"
},
"source": [
"Pour tester la fonction `split()`, on va créer une fonction `test` qui va effectuer l'apprentissage et le calcule d'erreur."
]
},
{
"cell_type": "code",
"metadata": {
"id": "WEsNxkd8ZrNZ",
"colab_type": "code",
"outputId": "557f7a64-7535-4d03-c0a1-a1f131f04676",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 35
}
},
"source": [
"def test(S, clf):\n",
" '''\n",
" - S is the list containing (X,y) pairs\n",
" - clf is the sklearn model\n",
" '''\n",
" [dataS1, targetS1, dataS2, targetS2] = split(S)\n",
" clf.fit(dataS1, targetS1)\n",
" return 1-clf.score(dataS2, targetS2)\n",
"\n",
"test(S,nb)\n"
],
"execution_count": 25,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"0.1333333333333333"
]
},
"metadata": {
"tags": []
},
"execution_count": 25
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "E7bZawE7c46Z",
"colab_type": "text"
},
"source": [
"**L'erreur estimée est-elle plus petite que l'erreur apparente précédemment calculée ?**\n",
"- Non, elle est supérieure à celle calculée précédemment.\n",
"\n",
"**Obtient-on toujours la même estimation pour l'erreur réelle ?**\n",
"- Pour chaque execution, on obtient une estimation différente de l'erreur réelle\n"
]
},
{
"cell_type": "code",
"metadata": {
"id": "AVW6c2_bdofz",
"colab_type": "code",
"outputId": "56c4c548-bf88-48e8-d53e-9fd34c4db441",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 126
}
},
"source": [
"# Cette fonction permet d'effectuer le test t fois et retourner la moyenne\n",
"# du taux d'erreur\n",
"def repeat_test(S,nb, t):\n",
" result = 0\n",
" for i in range(t):\n",
" result += test(S,nb)\n",
" return result / t\n",
"\n",
"# Pour chaque t on va effectuer le test et afficher la moyenne du taux d'erreur\n",
"for t in [10,50,100,200,500,1000]:\n",
" print(f\"t = {t} | moyenne = {repeat_test(S,nb, t)}\")\n"
],
"execution_count": 22,
"outputs": [
{
"output_type": "stream",
"text": [
"t = 10 | moyenne = 0.24\n",
"t = 50 | moyenne = 0.20159999999999997\n",
"t = 100 | moyenne = 0.19619999999999993\n",
"t = 200 | moyenne = 0.19640000000000005\n",
"t = 500 | moyenne = 0.2028000000000002\n",
"t = 1000 | moyenne = 0.20078000000000007\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "E3f2BB_Mi5rg",
"colab_type": "code",
"outputId": "88f65b75-cb8b-41ab-d712-293491ea3c43",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 126
}
},
"source": [
"# Pour chaque t on va répeter le test 20 fois \n",
"# et afficher la moyenne du taux d'erreur\n",
"for t in [10,50,100,200,500,1000]:\n",
" moy = 0\n",
" for i in range(20):\n",
" moy += repeat_test(S,nb, t)\n",
" print(f\"t = {t} | 20 fois | moyenne = {moy/20}\")"
],
"execution_count": 15,
"outputs": [
{
"output_type": "stream",
"text": [
"t = 10 | 20 fois | moyenne = 0.18760000000000004\n",
"t = 50 | 20 fois | moyenne = 0.19938000000000003\n",
"t = 100 | 20 fois | moyenne = 0.20166999999999993\n",
"t = 200 | 20 fois | moyenne = 0.19890499999999997\n",
"t = 500 | 20 fois | moyenne = 0.20429200000000014\n",
"t = 1000 | 20 fois | moyenne = 0.2014140000000002\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "TDD57gy1joAF",
"colab_type": "text"
},
"source": [
"**Est-ce que l'erreur moyenne est stable ou non?**\n",
"- L'erreur moyenne est plus en plus stable à chaque itération.\n",
"\n",
"**Pouvez-vous interpréter ce résultat?**\n",
"- On peut conclure que les taux d'erreur obtenus sont des estimations d'un même l'estimateur (théorique). Plus l'ensemble de données est grand, plus les taux d'erreur converge vers le taux réel d'erreur."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "_lt8Witzkw-H",
"colab_type": "text"
},
"source": [
"**Est-ce que l'erreur estimée (dans sa version stable), dans le cas d'un échantillon de test qui ne prend que le 10ème de l'échantillon initial, est la même qu'avec la proportion d'1/3 ?**\n",
"\n",
"Pour réaliser cette comparaison, on a besoin de modifier la fonction `split(S)`\n",
"- L'erreur avec la proportion d'1/3 est `0.20`\n",
"- L'erreur avec le 10éme de l'échantillon initial est `0.13`"
]
},
{
"cell_type": "code",
"metadata": {
"id": "8qlTT9AulF7s",
"colab_type": "code",
"colab": {}
},
"source": [
"def split(S):\n",
" s2_size = len(S)*0.1 # Le 10éme de l'échantillon initial\n",
" s2_indexes = sample(range(len(S)),k=round(s2_size))\n",
"\n",
" dataS1 = []\n",
" targetS1 = []\n",
" dataS2 = []\n",
" targetS2 = []\n",
" for i in range(len(S)):\n",
" if i in s2_indexes:\n",
" dataS2.append(S[i][0])\n",
" targetS2.append(S[i][1])\n",
" else:\n",
" dataS1.append(S[i][0])\n",
" targetS1.append(S[i][1])\n",
" return [dataS1, targetS1, dataS2, targetS2]"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "9Pb1zMufymFi",
"colab_type": "text"
},
"source": [
"En utilisant la fonction `train_test_split`, on va tester différent test sizes."
]
},
{
"cell_type": "code",
"metadata": {
"id": "PJp5CtmzyuHz",
"colab_type": "code",
"outputId": "7f5bc246-3ceb-4a9b-c049-9550b7629b7e",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 217
}
},
"source": [
"from sklearn.model_selection import train_test_split\n",
"from numpy import arange\n",
"\n",
"def test_it(test_size=.33):\n",
" X_train, X_test, y_train, y_test = train_test_split(irisData.data, irisData.target,test_size=test_size)\n",
" nb.fit(X_train, y_train)\n",
" print(f\"Test size= {test_size} Accuracy = {nb.score(X_test, y_test)} Taux d'erreur = {1-nb.score(X_test, y_test)}\")\n",
" return 1-nb.score(X_test, y_test)\n",
"\n",
"min_error_test_size = None\n",
"min_error = None\n",
"avg_error = 0\n",
"test_sizes = arange(.05, 0.5, .05)\n",
"for test_size in test_sizes:\n",
" error = test_it(test_size)\n",
" avg_error += error\n",
" if (min_error_test_size is None):\n",
" min_error_test_size = test_size\n",
" min_error = error\n",
" else:\n",
" if error < min_error:\n",
" min_error = error\n",
" min_error_test_size = test_size\n",
"\n",
"print(f\"min_error is {min_error} with test_size = {min_error_test_size}\")\n",
"print(f\"Erreur moyenne = {avg_error/ len(test_sizes)}\")"
],
"execution_count": 17,
"outputs": [
{
"output_type": "stream",
"text": [
"Test size= 0.05 Accuracy = 1.0 Taux d'erreur = 0.0\n",
"Test size= 0.1 Accuracy = 0.8 Taux d'erreur = 0.19999999999999996\n",
"Test size= 0.15000000000000002 Accuracy = 0.8695652173913043 Taux d'erreur = 0.13043478260869568\n",
"Test size= 0.2 Accuracy = 0.7333333333333333 Taux d'erreur = 0.2666666666666667\n",
"Test size= 0.25 Accuracy = 0.8947368421052632 Taux d'erreur = 0.10526315789473684\n",
"Test size= 0.3 Accuracy = 0.8222222222222222 Taux d'erreur = 0.1777777777777778\n",
"Test size= 0.35000000000000003 Accuracy = 0.9622641509433962 Taux d'erreur = 0.037735849056603765\n",
"Test size= 0.4 Accuracy = 0.75 Taux d'erreur = 0.25\n",
"Test size= 0.45 Accuracy = 0.8676470588235294 Taux d'erreur = 0.13235294117647056\n",
"min_error is 0.0 with test_size = 0.05\n",
"Erreur moyenne = 0.14447013057566124\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "6R-sZBQS-iM_",
"colab_type": "text"
},
"source": [
"### 2.2 Estimer l'erreur réelle par validation croisée"
]
},
{
"cell_type": "code",
"metadata": {
"id": "YqsE5aGx8tUD",
"colab_type": "code",
"outputId": "ff6f5a2a-9473-49d4-9559-37694ff62381",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 108
}
},
"source": [
"from sklearn.model_selection import cross_val_score\n",
"\n",
"def test_fold(fold):\n",
" error_rate = 1 - cross_val_score(nb, irisData.data, irisData.target, cv=fold)\n",
" return 1-error_rate\n",
"def test_it(params,fn):\n",
" for param in params:\n",
" scores = fn(param)\n",
" print(\"K= %d folds | Taux d'erreur: %0.4f (+/- %0.2f)\" % (param,1-scores.mean(), scores.std() * 2))\n",
"\n",
"test_it([10,2,3,5,8],test_fold)\n"
],
"execution_count": 18,
"outputs": [
{
"output_type": "stream",
"text": [
"K= 10 folds | Taux d'erreur: 0.0467 (+/- 0.13)\n",
"K= 2 folds | Taux d'erreur: 0.0467 (+/- 0.01)\n",
"K= 3 folds | Taux d'erreur: 0.0533 (+/- 0.04)\n",
"K= 5 folds | Taux d'erreur: 0.0467 (+/- 0.09)\n",
"K= 8 folds | Taux d'erreur: 0.0526 (+/- 0.12)\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "XNl0NaKbIQcs",
"colab_type": "text"
},
"source": [
"# Génération du tableau des erreurs obtenues"
]
},
{
"cell_type": "code",
"metadata": {
"id": "7fhi6YlcISO_",
"colab_type": "code",
"outputId": "0b053f72-3eca-4be3-90cf-1520a10b51ba",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 145
}
},
"source": [
"from sklearn.naive_bayes import MultinomialNB\n",
"from sklearn.tree import DecisionTreeClassifier\n",
"from sklearn.ensemble import RandomForestClassifier\n",
"from sklearn.svm import SVC\n",
"from sklearn.neighbors import KNeighborsClassifier\n",
"\n",
"# Performances sur l'ensemble d'apprentissage\n",
"def performance_method1(clf, X, y):\n",
" clf.fit(X,y)\n",
" return 1 - clf.score(X,y)\n",
"# Par division de l'échantillon d'apprentissage\n",
"def performance_method2(clf, X, y):\n",
" X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=.33)\n",
" clf.fit(X_train, y_train)\n",
" return 1 - clf.score(X_test, y_test)\n",
"# Par la validation croisée\n",
"def performance_method3(clf, X, y):\n",
" error_rate = 1 - cross_val_score(clf, X, y).mean()\n",
" return error_rate\n",
"\n",
"performances = {\n",
" \"Performances sur l'ensemble d'apprentissage\": performance_method1,\n",
" \"Par division de l'échantillon d'apprentissage\": performance_method2,\n",
" \"Par la validation croisée\":performance_method3\n",
"}\n",
"\n",
"classifiers = {\n",
" \"Naive Bayes\": MultinomialNB(fit_prior=True),\n",
" \"Arbre de décision\": DecisionTreeClassifier(),\n",
" \"Random forest\": RandomForestClassifier(),\n",
" \"SVM Classifier\": SVC(),\n",
" \"KNN\": KNeighborsClassifier(),\n",
"}\n",
"\n",
"def generate_markdown_report(X, y):\n",
" print(\"| |\"+\"|\".join([perf_label for perf_label in performances])+\"|\")\n",
" print(\"|-|\"+\"|\".join([\"-\" for temp in performances])+\"|\")\n",
" for clf_label, clf in classifiers.items():\n",
" print(\"|\", clf_label, end=\"|\")\n",
" for perf_label, performance_fn in performances.items():\n",
" print(\"%.4f\"%performance_fn(clf, X, y), end=\"|\")\n",
" print()\n",
"\n",
"generate_markdown_report(irisData.data, irisData.target)\n"
],
"execution_count": 19,
"outputs": [
{
"output_type": "stream",
"text": [
"| |Performances sur l'ensemble d'apprentissage|Par division de l'échantillon d'apprentissage|Par la validation croisée|\n",
"|-|-|-|-|\n",
"| Naive Bayes|0.0467|0.4600|0.0467|\n",
"| Arbre de décision|0.0000|0.0800|0.0333|\n",
"| Random forest|0.0000|0.0400|0.0400|\n",
"| SVM Classifier|0.0267|0.0000|0.0333|\n",
"| KNN|0.0333|0.0600|0.0267|\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "hQUK4dAZOhG0",
"colab_type": "text"
},
"source": [
"##### **tableau recapitulatif des taux d'erreurs**\n",
"| |Performances sur l'ensemble d'apprentissage|Par division de l'échantillon d'apprentissage|Par la validation croisée|\n",
"|-|-|-|-|\n",
"| Naive Bayes|0.0467|**0.0000**|0.0467|\n",
"| Arbre de décision|**0.0000**|0.0600|0.0400|\n",
"| Random forest|**0.0000**|0.0400|0.0400|\n",
"| SVM Classifier|0.0267|0.0800|0.0333|\n",
"| KNN|0.0333|0.0600|**0.0267**|"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "hGsyhwBnRWlj",
"colab_type": "text"
},
"source": [
"**Est-ce que les erreurs obtenues par la méthode naïve Bayes et par la méthode arbre de décision, vous permettent d'indiquer si une méthode de classification est meilleure qu'une autre ?**\n",
"- On ne peut pas considérer qu'une méthode de classification est meilleure qu'une autre seulement par la comparaison de taux d'erreur. Cela dépend fortement à la nature du dataset. Certains algorithmes sont plus adapté à certains problèmes que d'autres."
]
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment