Last active
January 2, 2019 16:24
-
-
Save denismazzucato/d61d84c0027183e5bc22d0befe6801d1 to your computer and use it in GitHub Desktop.
Wine Dataset - Clustering
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"nbformat": 4, | |
"nbformat_minor": 0, | |
"metadata": { | |
"colab": { | |
"name": "Clustering.ipynb", | |
"version": "0.3.2", | |
"provenance": [], | |
"collapsed_sections": [], | |
"include_colab_link": true | |
}, | |
"kernelspec": { | |
"name": "python2", | |
"display_name": "Python 2" | |
} | |
}, | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "view-in-github", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"<a href=\"https://colab.research.google.com/gist/denismazzu96/d61d84c0027183e5bc22d0befe6801d1/clustering.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>" | |
] | |
}, | |
{ | |
"metadata": { | |
"id": "DdMyzU1xSRhD", | |
"colab_type": "code", | |
"colab": {} | |
}, | |
"cell_type": "code", | |
"source": [ | |
"# python 2\n", | |
"\n", | |
"import numpy as np\n", | |
"from sklearn.cluster import FeatureAgglomeration as fa\n", | |
"from sklearn.datasets import load_wine\n", | |
"from sklearn import svm\n", | |
"from sklearn.model_selection import train_test_split as tts\n", | |
"from sklearn.metrics import accuracy_score, precision_score, recall_score\n" | |
], | |
"execution_count": 0, | |
"outputs": [] | |
}, | |
{ | |
"metadata": { | |
"id": "oM3NDvZBSZAh", | |
"colab_type": "text" | |
}, | |
"cell_type": "markdown", | |
"source": [ | |
"# Prima Parte\n", | |
"\n", | |
"\n", | |
"---\n", | |
"\n", | |
"Eseguo la tecnica di Clustering su Dataset **Wine** di sklearn.\n", | |
"In particolare utilizziamo **FeatureAgglomeration** per ridurre il numero di features, tecnica basata su Clustering agglomerativo, bottom-up, fino a trovare il numero desiderato di cluster, nel nostro esempio da 13 passiamo a 12.\n", | |
"\n", | |
"In particolare le feature che vengono unite in un unico cluster sono la 5 e la 12, ovvero \"total_phenols\" e \"od280/od315_of_diluted_wines\".\n", | |
"\n", | |
"Il criterio utilizzato è \"ward\" che corrisponde ad unire i cluster che presentano la minore varianza possibile." | |
] | |
}, | |
{ | |
"metadata": { | |
"id": "uKz2pjhxVOwM", | |
"colab_type": "code", | |
"colab": { | |
"base_uri": "https://localhost:8080/", | |
"height": 202 | |
}, | |
"outputId": "5d0fb7e1-cad6-4e76-a2fa-bf29318144d0" | |
}, | |
"cell_type": "code", | |
"source": [ | |
"wine=load_wine()\n", | |
"X=wine.data\n", | |
"Y=wine.target\n", | |
"\n", | |
"print ('> Numero di features iniziali: ')\n", | |
"print X.shape[1]\n", | |
"\n", | |
"#le righe(esempi) saranno lo stesso numero ma le colonne (feature) saranno ridotte\n", | |
"C=fa(n_clusters=12,linkage =\"ward\") #ward=minimizes the variance of the clusters being merged\n", | |
"N=C.fit_transform(X)\n", | |
"print ('> Numero di features finali:')\n", | |
"print N.shape[1]\n", | |
"#C.labels_ = cluster labels for each feature\n", | |
"print('\\n> Cluster labels per ogni feature:')\n", | |
"print C.labels_\n", | |
"print('> Nomi delle features iniziali: ')\n", | |
"print wine.feature_names" | |
], | |
"execution_count": 5, | |
"outputs": [] | |
}, | |
{ | |
"metadata": { | |
"id": "OFks0xJ-Ve7b", | |
"colab_type": "text" | |
}, | |
"cell_type": "markdown", | |
"source": [ | |
"**Evaluation:**\n", | |
"\n", | |
"\n", | |
"```\n", | |
"> Numero di features iniziali: \n", | |
"13\n", | |
"> Numero di features finali:\n", | |
"12\n", | |
"\n", | |
"> Cluster labels per ogni feature:\n", | |
"[ 9 7 10 8 11 0 4 5 3 6 2 0 1]\n", | |
"> Nomi delle features iniziali: \n", | |
"['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoid_phenols', 'proanthocyanins', 'color_intensity', 'hue', 'od280/od315_of_diluted_wines', 'proline']\n", | |
"\n", | |
"```\n", | |
"\n" | |
] | |
}, | |
{ | |
"metadata": { | |
"id": "jm-VgJoyVxas", | |
"colab_type": "text" | |
}, | |
"cell_type": "markdown", | |
"source": [ | |
"# Seconda Parte\n", | |
"\n", | |
"Ora invece valuto le performance effettive dell'operazione di clustering.\n", | |
"Creiamo una svm con **kernel lineare** confrontando i dati originali con quelli ridotti.\n", | |
"\n", | |
"I risultati ottenuti differiscono in maniera impercettibile garantendo la stessa precisione con dati non ragguppati, le performance sono valutate sotto gli aspetti di:\n", | |
"- accuratezza\n", | |
"- precisione\n", | |
"- recall\n", | |
"\n", | |
"*Nota: abbassando ulteriormente il numero di feature richieste si va incontro ad un peggioramento del modello.*" | |
] | |
}, | |
{ | |
"metadata": { | |
"id": "dtJoTKVVP7I5", | |
"colab_type": "code", | |
"colab": { | |
"base_uri": "https://localhost:8080/", | |
"height": 201 | |
}, | |
"outputId": "27c41855-8a0c-477c-aa3b-fc9f5bcda99f" | |
}, | |
"cell_type": "code", | |
"source": [ | |
"\n", | |
"def evaluate(y_test, y_pred):\n", | |
"\tprint \"accuratezza \" , accuracy_score(y_test, y_pred)\n", | |
"\tprint \"precisione\" , precision_score(y_test, y_pred, average=None)\n", | |
"\tprint \"recall \" , recall_score(y_test, y_pred, average=None)\n", | |
"\n", | |
"kernel='linear'\n", | |
"#split dei dati per X\n", | |
"Xtrain, Xtest, Ytrain, Ytest= tts(X, Y, test_size=0.3, random_state=5)\n", | |
"\n", | |
"Svm2=svm.SVC(kernel=kernel)\n", | |
"Svm2.fit(Xtrain,Ytrain)\n", | |
"\n", | |
"print ('\\n> svm kernel '+kernel+' sul dataset originale:')\n", | |
"evaluate(Ytest,Svm2.predict(Xtest))\n", | |
"\n", | |
"#split dei dati per N\n", | |
"Ntrain, Ntest, Ytrain, Ytest= tts(N, Y, test_size=0.3, random_state=5)\n", | |
"\n", | |
"Svm=svm.SVC(kernel=kernel)\n", | |
"Svm.fit(Ntrain,Ytrain)\n", | |
"\n", | |
"print ('\\n> svm kernel '+kernel+' dopo l\\'operazione di clustering:')\n", | |
"evaluate(Ytest,Svm.predict(Ntest))" | |
], | |
"execution_count": 10, | |
"outputs": [] | |
}, | |
{ | |
"metadata": { | |
"id": "YYXh7RL0W58e", | |
"colab_type": "text" | |
}, | |
"cell_type": "markdown", | |
"source": [ | |
"**Evaluation:**\n", | |
"\n", | |
"```\n", | |
"> svm kernel linear sul dataset originale:\n", | |
"accuratezza 0.9444444444444444\n", | |
"precisione [0.95652174 0.94444444 0.92307692]\n", | |
"recall [0.95652174 0.94444444 0.92307692]\n", | |
"\n", | |
"> svm kernel linear dopo l'operazione di clustering:\n", | |
"accuratezza 0.9444444444444444\n", | |
"precisione [0.95652174 0.94444444 0.92307692]\n", | |
"recall [0.95652174 0.94444444 0.92307692]\n", | |
"```\n", | |
"\n" | |
] | |
} | |
] | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment