Skip to content

Instantly share code, notes, and snippets.

@lucaspg96
Created July 13, 2018 13:50
Show Gist options
  • Save lucaspg96/74eb790abac6e9c20a4424bf1b886f11 to your computer and use it in GitHub Desktop.
Save lucaspg96/74eb790abac6e9c20a4424bf1b886f11 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Email Classification Example\n",
"---\n",
"\n",
"This graph dataset contains the emails exchange inside an enterprise. Each node *u* represents an employee email, that is labeled by its department; each edge *(u,v)* says that *u* sent at least one email to *v*.\n",
"\n",
"Our objective is to predict the department in which the employee works."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Definition of some useful functions\n",
"---"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"#basic imports\n",
"%matplotlib inline\n",
"import matplotlib.pyplot as plt\n",
"import networkx as nx\n",
"from networkx import DiGraph\n",
"from node2vec import Node2Vec\n",
"import telegram\n",
"\n",
"\n",
"#Telegram Bot configurations\n",
"\n",
"my_token = '363083153:AAGFIcfkGdxRmEqYwSrq4h91t83mWMfgNgc' #token generated by @BotFather\n",
"my_chat_id = 169036135 #your telegram Id (can be obtained from @userinfobot)\n",
"\n",
"def send(msg, chat_id=my_chat_id, token=my_token):\n",
" \"\"\"\n",
" Send a mensage to a chat\n",
" \n",
" Parameters\n",
" ----------\n",
" msg : String\n",
" text that will be sent\n",
"\n",
" chat_id : int or String\n",
" id of the chat that will be sent the message.\n",
" If is a group ou user, the id MUST be integer (negative numbers for groups)\n",
" If it is a chanel, the id CAN be the chanel's tag\n",
" \n",
" token : String\n",
" Token of the bot that will send the message. \n",
" When sending message for an user, the user must has talked\n",
" at least one time with the bot (usualy, the /start command).\n",
" When sending to a group, the bot must be allowed to talk\n",
" on groups.\n",
" When sending to a chanel, the bot must be an admin.\n",
" \"\"\"\n",
" \n",
" bot = telegram.Bot(token=token)\n",
" bot.sendMessage(chat_id=chat_id, text=msg)\n",
"\n",
"\n",
"#Defining log object to notify the steps updates\n",
"\n",
"class AbstractSimpleLog():\n",
" def log(self, msg):\n",
" raise Exception(\"log method must be implemented\")\n",
"\n",
"class PrintLog(AbstractSimpleLog):\n",
" def log(self, msg):\n",
" print(msg)\n",
" \n",
"class BotLog(AbstractSimpleLog):\n",
" def __init__(self, reason=None):\n",
" if reason:\n",
" send(\"Bot started to log: {}\".format(reason))\n",
" else:\n",
" send(\"Log Started -------------------------\")\n",
" def log(self, msg):\n",
" send(msg)\n",
"\n",
"\n",
"#Default name for embeddign file\n",
"EMBEDDING_FILE = \"embeddings.emb\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Computing the Embedding\n",
"---"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First, we define the path to the dataset and to its edges"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"#We can definine an log object. To long computations, we use BotLog to keep tracking\n",
"# by the cellphone using Telegram App\n",
"\n",
"logger = PrintLog()\n",
"# logger = BotLog()\n",
"\n",
"dataset = \"Email_dataset/\" #path to dataset\n",
"graph_file = dataset+\"edges.ssv\" #path to edges file"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, we process the graph embedding"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"\r",
"Computing transition probabilities: 0%| | 0/1005 [00:00<?, ?it/s]"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Loading graph from Email_dataset/edges.ssv\n",
"Graph loaded\n",
"Computing transition probabilities\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Computing transition probabilities: 100%|██████████| 1005/1005 [00:03<00:00, 301.44it/s]\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Transitions probabilities computed\n",
"Starting Node2Vec embedding\n",
"Node2Vec embedding created\n",
"Saving embedding file\n",
"Embedding file saved\n"
]
}
],
"source": [
"logger.log(\"Loading graph from {}\".format(graph_file))\n",
"graph = nx.read_edgelist(graph_file, delimiter=\" \", create_using=DiGraph())\n",
"logger.log(\"Graph loaded\")\n",
"\n",
"logger.log(\"Computing transition probabilities\")\n",
"n2v = Node2Vec(graph, dimensions=128, walk_length=80, num_walks=50, workers=4, p=1, q=1)\n",
"logger.log(\"Transitions probabilities computed\")\n",
"\n",
"logger.log(\"Starting Node2Vec embedding\")\n",
"n2v_model = n2v.fit(window=80, min_count=1, batch_words=64)\n",
"logger.log(\"Node2Vec embedding created\")\n",
"\n",
"logger.log(\"Saving embedding file\")\n",
"n2v_model.wv.save_word2vec_format(dataset+EMBEDDING_FILE)\n",
"logger.log(\"Embedding file saved\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we are going to use the embedding generated to predict the employees departments"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Data Processing\n",
"---\n",
"\n",
"To process the data, we are going to use Numpy's functions"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import numpy as np\n",
"from numpy import array"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Since we saved the embedding, we can load it as a Numpy matrix"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"vectors = np.loadtxt(dataset+EMBEDDING_FILE,delimiter=' ',skiprows=1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can also define a function to get the node embedding representation"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def to_embedded(n):\n",
" return vectors[n,:]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, we load the data from the *labels.ssv* file and generate the dataset like:\n",
"\n",
"NODE_EMBEDDING, DEPARTMENT"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[ 1.00000000e+00, -5.52356600e-01, -1.31899000e+00, ...,\n",
" 2.73627400e+00, 3.10003760e-01, 1.00000000e+00],\n",
" [ 1.30000000e+02, 1.60013030e+00, -4.14573250e-01, ...,\n",
" -1.19659230e+00, -4.13204510e-02, 1.00000000e+00],\n",
" [ 5.32000000e+02, -9.19102910e-01, -1.96323200e-01, ...,\n",
" 2.04122850e+00, -2.22895600e+00, 2.10000000e+01],\n",
" ..., \n",
" [ 7.50000000e+02, -2.84327940e-03, -2.77449560e-03, ...,\n",
" 3.72426860e-03, -9.51730650e-04, 1.00000000e+00],\n",
" [ 7.90000000e+02, -5.00332680e-04, -2.21990440e-03, ...,\n",
" -1.75051400e-03, 2.45084990e-03, 6.00000000e+00],\n",
" [ 9.44000000e+02, 6.54737760e-04, -1.85789620e-03, ...,\n",
" -1.21631000e-04, -3.08797140e-03, 2.20000000e+01]])"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data = [] #data matrix initialized empty\n",
"\n",
"with open(\"Email_dataset/labels.ssv\") as f:\n",
" for line in f:\n",
" node,department = line.split() #get the node id and its department (class)\n",
" node_embedded = to_embedded(int(node)) #get the embedded representation of the node\n",
" data.append(np.append(node_embedded,array([department]))) #insert the embedding and the class inside the data matrix\n",
"\n",
"data = array(data,dtype=float) #transform the data matrix in a Numpy array\n",
"data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We shuffle the data and split into train and test subsets, using the *train_percentage* factor"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"np.random.shuffle(data)\n",
"train_percentage = 0.8\n",
"train_size = int(len(data)*train_percentage)\n",
"\n",
"train_data = array(data[0:train_size])\n",
"test_data = array(data[train_size:])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Predictions\n",
"---\n",
"We are going to use the following models to predict the classes"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"from sklearn.linear_model import LogisticRegression\n",
"from sklearn.linear_model import SGDClassifier\n",
"from sklearn.linear_model import Perceptron\n",
"\n",
"from sklearn.svm import SVC\n",
"\n",
"from sklearn.neural_network import MLPClassifier\n",
"\n",
"from sklearn.neighbors import KNeighborsClassifier\n",
"\n",
"from sklearn.ensemble import GradientBoostingClassifier\n",
"from sklearn.ensemble import RandomForestClassifier\n",
"from sklearn.ensemble import ExtraTreesClassifier\n",
"from sklearn.ensemble import AdaBoostClassifier\n",
"\n",
"from sklearn.gaussian_process import GaussianProcessClassifier\n",
"\n",
"from sklearn.tree import DecisionTreeClassifier\n",
"\n",
"from sklearn.naive_bayes import BernoulliNB\n",
"from sklearn.naive_bayes import GaussianNB"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We also define a function to train the model and give the score"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def train_and_eval(model):\n",
" model.fit(train_data[:,0:-1], train_data[:,-1])\n",
" return model.score(test_data[:,:-1], test_data[:,-1])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"At least, we train and compute the scores for each model"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"LogisticRegression score: 5.970149253731343%\n",
"SGDClassifier score: 1.4925373134328357%\n",
"Perceptron score: 2.9850746268656714%\n",
"SVC score: 8.955223880597014%\n",
"MLPClassifier score: 7.960199004975125%\n",
"KNeighborsClassifier score: 5.970149253731343%\n",
"GaussianProcessClassifier score: 0.4975124378109453%\n",
"DecisionTreeClassifier score: 4.975124378109453%\n",
"BernoulliNB score: 3.482587064676617%\n",
"GaussianNB score: 6.467661691542288%\n",
"GradientBoostingClassifier score: 5.970149253731343%\n",
"RandomForestClassifier score: 7.960199004975125%\n",
"ExtraTreesClassifier score: 5.472636815920398%\n",
"AdaBoostClassifier score: 9.45273631840796%\n"
]
}
],
"source": [
"score = train_and_eval(LogisticRegression())\n",
"logger.log(\"LogisticRegression score: {}%\".format(score*100))\n",
"score = train_and_eval(SGDClassifier(max_iter=100, tol=0.001))\n",
"logger.log(\"SGDClassifier score: {}%\".format(score*100))\n",
"score = train_and_eval(Perceptron(max_iter=100, tol=0.001))\n",
"logger.log(\"Perceptron score: {}%\".format(score*100))\n",
"\n",
"score = train_and_eval(SVC())\n",
"logger.log(\"SVC score: {}%\".format(score*100))\n",
"\n",
"score = train_and_eval(MLPClassifier())\n",
"logger.log(\"MLPClassifier score: {}%\".format(score*100))\n",
"\n",
"score = train_and_eval(KNeighborsClassifier())\n",
"logger.log(\"KNeighborsClassifier score: {}%\".format(score*100))\n",
"\n",
"score = train_and_eval(GaussianProcessClassifier())\n",
"logger.log(\"GaussianProcessClassifier score: {}%\".format(score*100))\n",
"\n",
"score = train_and_eval(DecisionTreeClassifier())\n",
"logger.log(\"DecisionTreeClassifier score: {}%\".format(score*100))\n",
"\n",
"score = train_and_eval(BernoulliNB())\n",
"logger.log(\"BernoulliNB score: {}%\".format(score*100))\n",
"score = train_and_eval(GaussianNB())\n",
"logger.log(\"GaussianNB score: {}%\".format(score*100))\n",
"\n",
"score = train_and_eval(GradientBoostingClassifier())\n",
"logger.log(\"GradientBoostingClassifier score: {}%\".format(score*100))\n",
"score = train_and_eval(RandomForestClassifier())\n",
"logger.log(\"RandomForestClassifier score: {}%\".format(score*100))\n",
"score = train_and_eval(ExtraTreesClassifier())\n",
"logger.log(\"ExtraTreesClassifier score: {}%\".format(score*100))\n",
"score = train_and_eval(AdaBoostClassifier())\n",
"logger.log(\"AdaBoostClassifier score: {}%\".format(score*100))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Our accuracy is very low to all the models. To understand this, let's analyse the departments distribution over the data"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAeoAAAHYCAYAAACC36ucAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAGvVJREFUeJzt3X2wbWddH/Dvj1xFIQIJuV4iUC61QUQtoFekYgvTODUa\nx2QsMuhoA0ObP6pAqVO9Fp1Yq5I6DtaOQicFMYJgI1oTiyAxvtVWQi4ECSFBEBJezMsVELQ6ysvT\nP/ZK3dk5J9n77LPv+eWcz2dmz157rWev51kve33Xy95r1xgjAEBPD9jrBgAA2xPUANCYoAaAxgQ1\nADQmqAGgMUENAI0JagBoTFADQGOCGgAaO7TXDUiSs846axw9enSvmwEAp8zb3va2PxtjHL6vci2C\n+ujRozlx4sReNwMATpmqunWZck59A0BjghoAGhPUANCYoAaAxgQ1ADQmqAGgMUENAI0JagBoTFAD\nQGOCGgAaE9QA0JigBoDGBDUANCaoAaAxQQ0AjQlqAGhMUANAY4IaABoT1ADQ2KG9bsCio8ffcI9+\nt1x6/h60BAD2niNqAGhMUANAY4IaABoT1ADQmKAGgMYENQA0JqgBoDFBDQCNCWoAaExQA0BjghoA\nGhPUANCYoAaAxgQ1ADQmqAGgMUENAI0JagBoTFADQGOCGgAaE9QA0JigBoDGBDUANCaoAaAxQQ0A\njQlqAGhMUANAY4IaABoT1ADQmKAGgMYENQA0dp9BXVU/V1V3VtW75vqdWVVXV9V7p+cz5ob9QFW9\nr6reU1XfsKmGA8BBsMwR9c8nOW+h3/Ek14wxzklyzfQ6VfWEJM9O8mXTe15WVaftWmsB4IC5z6Ae\nY/x+ko8t9L4gyeVT9+VJLpzr/0tjjL8ZY3wgyfuSPGWX2goAB85Or1EfGWPcNnXfnuTI1P3IJB+a\nK/fhqd89VNXFVXWiqk6cPHlyh80AgP1t7S+TjTFGkrGD9102xjg2xjh2+PDhdZsBAPvSToP6jqo6\nO0mm5zun/h9J8ui5co+a+gEAO7DToL4qyUVT90VJrpzr/+yqemBVPTbJOUneul4TAeDgOnRfBarq\ndUmekeSsqvpwkkuSXJrkiqp6XpJbkzwrScYYN1bVFUneneTTSb57jPGZDbUdAPa9+wzqMca3bzPo\n3G3K/1iSH1unUQDAjDuTAUBjghoAGhPUANCYoAaAxgQ1ADQmqAGgMUENAI0JagBoTFADQGOCGgAa\nE9QA0JigBoDGBDUANCaoAaAxQQ0AjQlqAGhMUANAY4IaABoT1ADQmKAGgMYENQA0JqgBoDFBDQCN\nCWoAaOzQXjdgHUePv+Ee/W659Pw9aAkAbIYjagBoTFADQGOCGgAaE9QA0JigBoDGBDUANCaoAaAx\nQQ0AjQlqAGhMUANAY4IaABoT1ADQmKAGgMYENQA0JqgBoDFBDQCNCWoAaExQA0BjghoAGhPUANCY\noAaAxgQ1ADQmqAGgMUENAI0JagBoTFADQGOCGgAaE9QA0JigBoDGBDUANCaoAaAxQQ0AjQlqAGhM\nUANAY4IaABoT1ADQmKAGgMYENQA0JqgBoDFBDQCNCWoAaExQA0BjawV1Vb2oqm6sqndV1euq6vOq\n6syqurqq3js9n7FbjQWAg2bHQV1Vj0zygiTHxhhfnuS0JM9OcjzJNWOMc5JcM70GAHZg3VPfh5J8\nflUdSvKgJH+a5IIkl0/DL09y4Zp1AMCBteOgHmN8JMlPJvlgktuSfGKM8eYkR8YYt03Fbk9yZKv3\nV9XFVXWiqk6cPHlyp80AgH1tnVPfZ2R29PzYJF+U5MFV9Z3zZcYYI8nY6v1jjMvGGMfGGMcOHz68\n02YAwL62zqnvr0/ygTHGyTHGp5L8apKvTXJHVZ2dJNPznes3EwAOpnWC+oNJnlpVD6qqSnJukpuS\nXJXkoqnMRUmuXK+JAHBwHdrpG8cY11bV65O8Pcmnk1yf5LIkpye5oqqel+TWJM/ajYYCwEG046BO\nkjHGJUkuWej9N5kdXQMAa3JnMgBoTFADQGOCGgAaE9QA0JigBoDGBDUANCaoAaAxQQ0AjQlqAGhM\nUANAY4IaABoT1ADQmKAGgMYENQA0JqgBoDFBDQCNCWoAaExQA0BjghoAGhPUANCYoAaAxgQ1ADQm\nqAGgMUENAI0JagBoTFADQGOCGgAaE9QA0JigBoDGBDUANCaoAaAxQQ0AjQlqAGhMUANAY4IaABoT\n1ADQmKAGgMYENQA0JqgBoDFBDQCNCWoAaExQA0BjghoAGhPUANCYoAaAxgQ1ADQmqAGgMUENAI0J\nagBoTFADQGOCGgAaO7TXDTgVjh5/w5b9b7n0/FPcEgBYjSNqAGhMUANAY4IaABoT1ADQmKAGgMYE\nNQA0JqgBoDFBDQCNCWoAaExQA0BjghoAGhPUANCYoAaAxgQ1ADR2IP7mchVb/SWmv8MEYK+sdURd\nVQ+rqtdX1c1VdVNV/aOqOrOqrq6q907PZ+xWYwHgoFn31PdPJ3nTGOPxSZ6Y5KYkx5NcM8Y4J8k1\n02sAYAd2HNRV9dAk/yTJK5NkjPG3Y4w/T3JBksunYpcnuXDdRgLAQbXOEfVjk5xM8qqqur6qXlFV\nD05yZIxx21Tm9iRHtnpzVV1cVSeq6sTJkyfXaAYA7F/rBPWhJF+Z5OVjjCcn+b9ZOM09xhhJxlZv\nHmNcNsY4NsY4dvjw4TWaAQD71zpB/eEkHx5jXDu9fn1mwX1HVZ2dJNPznes1EQAOrh0H9Rjj9iQf\nqqovmXqdm+TdSa5KctHU76IkV67VQgA4wNb9HfXzk/xiVX1ukvcneW5m4X9FVT0vya1JnrVmHQBw\nYK0V1GOMdyQ5tsWgc9cZLwAw4xaiANCYoAaAxgQ1ADQmqAGgMUENAI0JagBoTFADQGOCGgAaE9QA\n0JigBoDGBDUANCaoAaAxQQ0Aja37N5cH1tHjb9iy/y2Xnn+KWwLAfuaIGgAaE9QA0JigBoDGBDUA\nNCaoAaAxQQ0AjQlqAGhMUANAY4IaABoT1ADQmKAGgMYENQA0JqgBoDFBDQCN+ZvLU2Crv8T0d5gA\nLMMRNQA0JqgBoDFBDQCNCWoAaExQA0BjghoAGhPUANCY31E34zfXAMxzRA0AjQlqAGhMUANAY4Ia\nABoT1ADQmKAGgMYENQA0JqgBoDFBDQCNCWoAaExQA0BjghoAGhPUANCYoAaAxgQ1ADQmqAGgMUEN\nAI0JagBoTFADQGOCGgAaE9QA0JigBoDGBDUANCaoAaAxQQ0AjQlqAGhMUANAY4IaABoT1ADQmKAG\ngMbWDuqqOq2qrq+q/zm9PrOqrq6q907PZ6zfTAA4mHbjiPqFSW6ae308yTVjjHOSXDO9BgB2YK2g\nrqpHJTk/ySvmel+Q5PKp+/IkF65TBwAcZOseUf/nJN+X5LNz/Y6MMW6bum9PcmSrN1bVxVV1oqpO\nnDx5cs1mAMD+tOOgrqpvTnLnGONt25UZY4wkY5thl40xjo0xjh0+fHinzQCAfe3QGu99WpJvqapv\nSvJ5SR5SVa9JckdVnT3GuK2qzk5y5240FAAOoh0fUY8xfmCM8agxxtEkz07y22OM70xyVZKLpmIX\nJbly7VYCwAG1zhH1di5NckVVPS/JrUmetYE6Dryjx9+wZf9bLj3/FLcEgE3alaAeY/xukt+duj+a\n5NzdGC8AHHTuTAYAjQlqAGhMUANAY4IaABoT1ADQmKAGgMYENQA0JqgBoDFBDQCNCWoAaExQA0Bj\nghoAGhPUANDYJv7mkma2+ktMf4cJcP/giBoAGhPUANCYoAaAxgQ1ADQmqAGgMUENAI0JagBoTFAD\nQGOCGgAaE9QA0JhbiHI3bjcK0IsjagBoTFADQGOCGgAaE9QA0JigBoDGBDUANCaoAaAxQQ0AjQlq\nAGhMUANAY4IaABpzr292ZKt7gidb3xfc/cMBds4RNQA0JqgBoDFBDQCNuUZNK65nA9ydI2oAaExQ\nA0BjTn1zv+QUOXBQOKIGgMYENQA0JqgBoDHXqNn3XM8G7s8cUQNAY4IaABoT1ADQmGvUMPHXnUBH\njqgBoDFBDQCNCWoAaExQA0BjghoAGhPUANCYoAaAxgQ1ADQmqAGgMUENAI25hShsmNuNAutwRA0A\njQlqAGhMUANAY4IaABrbcVBX1aOr6neq6t1VdWNVvXDqf2ZVXV1V752ez9i95gLAwbLOEfWnk3zv\nGOMJSZ6a5Lur6glJjie5ZoxxTpJrptcAwA7sOKjHGLeNMd4+df9FkpuSPDLJBUkun4pdnuTCdRsJ\nAAfVrlyjrqqjSZ6c5NokR8YYt02Dbk9yZJv3XFxVJ6rqxMmTJ3ejGQCw76wd1FV1epJfSfJvxhif\nnB82xhhJxlbvG2NcNsY4NsY4dvjw4XWbAQD70lpBXVWfk1lI/+IY41en3ndU1dnT8LOT3LleEwHg\n4FrnW9+V5JVJbhpjvHRu0FVJLpq6L0py5c6bBwAH2zr3+n5aku9KckNVvWPq9++TXJrkiqp6XpJb\nkzxrvSbCwbDVPcET9wWHg27HQT3G+IMktc3gc3c6XgDg77gzGQA0JqgBoDH/Rw33Q/7jGg4OR9QA\n0JigBoDGBDUANCaoAaAxQQ0AjQlqAGhMUANAY4IaABoT1ADQmKAGgMbcQhT2Obcbhfs3R9QA0Jig\nBoDGBDUANOYaNZDEtWzoyhE1ADQmqAGgMUENAI25Rg2sbNnr2VuVW6Wsa+TgiBoAWhPUANCYU9/A\n/Y7T5BwkjqgBoDFBDQCNCWoAaExQA0BjghoAGhPUANCYoAaAxvyOGti33MKU/cARNQA0JqgBoDFB\nDQCNuUYNsKJN/M0nbMcRNQA0JqgBoDFBDQCNuUYN0IDr3mzHETUANCaoAaAxQQ0AjQlqAGhMUANA\nY4IaABrz8yyAfWqVv+70N599OaIGgMYENQA0JqgBoDHXqAFYmluYnnqOqAGgMUENAI0JagBozDVq\nADbC77h3hyNqAGhMUANAY4IaABpzjRqA+41Vfse9X657O6IGgMYENQA05tQ3AAda99uiOqIGgMYE\nNQA0JqgBoDHXqAFgSXtxW9SNHVFX1XlV9Z6qel9VHd9UPQCwn20kqKvqtCQ/m+QbkzwhybdX1RM2\nURcA7GebOqJ+SpL3jTHeP8b42yS/lOSCDdUFAPtWjTF2f6RVz0xy3hjjX06vvyvJ14wxvmeuzMVJ\nLp5efkmS92wxqrOS/NkSVS5bblNl7y/jPOj178dp2uv69+M07XX9+3GaDnr925V7zBjj8H2+e4yx\n648kz0zyirnX35XkZ3YwnhO7WW5TZe8v4zzo9e/Hadrr+vfjNO11/ftxmg56/auMc6vHpk59fyTJ\no+deP2rqBwCsYFNBfV2Sc6rqsVX1uUmeneSqDdUFAPvWRn5HPcb4dFV9T5LfTHJakp8bY9y4g1Fd\ntsvlNlX2/jLOg17/fpymva5/P07TXte/H6fpoNe/yjjvYSNfJgMAdodbiAJAY4IaABoT1ADQWNs/\n5aiqXxhj/Iu9bsc6quoFSf7HGONDS5R9fGZ3b3vk1OsjSa4aY9y0wSbO1//3k3xrZj+r+0ySP07y\n2jHGJ09B3V+T5KYxxier6vOTHE/ylUneneTHxxif2HQbToWqevgY46Nb9L/rlxF/Osb4rar6jiRf\nm+SmJJeNMT61zfi+LrO7AL5rjPHmDTZ9W1X1hWOMO09hfU9JMsYY1023JT4vyc1jjN9YY5yPz+xz\nd+0Y4y/n+p83xnjTfbx3y2Xa0aleVuyeFkfUVXXVwuPXk3zrXa+XeP8X7qDO5y5Z7o2rjnvOf0xy\nbVX9r6r611W15R1oqur7M7vNaiV56/SoJK87FX9oMu1Q/Nckn5fkq5M8MLPAfktVPWPT9Sf5uSR/\nNXX/dJKHJvlPU79XLbT1EVX18qr62ap6eFX9cFXdUFVXVNXZu9Wgqnr4Fv0eUlUvqapXT2E6P+xl\nC68vraqzpu5jVfX+zNaFW6vq6QujflWS85O8sKpeneTbklyb2bJ4xdw43zrX/a+S/EySL0hyyeJ6\nUlVvr6ofrKovXmJaHzq19+aq+lhVfbSqbpr6PWyu3JkLj4cneWtVnVFVZy6M81hV/U5VvaaqHl1V\nV1fVJ6rquqp68ly58xba8cqqemdVvbaqjiyM85Ik/yXJy6vqJdP0PzjJ8ap68U6mf1r3r0zy/CTv\nqqr5Wx3/+ELZpZfpKuvKvbTtjQuvl5qnU9mlltWK8//0qvqRqrpxqvdkVb2lqp6zUG6p9WnVssta\ntp1LjGdx/p+Sbc+W1rlbym49krw9yWuSPCPJ06fn26bupy+UPXPh8fAktyQ5I8mZK9T5wbnur9zm\n8VVJblt433lz3Q9N8sok70zy2iRHFspen9nO0D+byp1M8qYkFyX5grlyf5zkc7Zo4+cmee9Cv4cm\nuTTJzUk+luSjmR15XZrkYVvM1x9M8sX3MS9uSHLa1P2gJL87df+9JNcvlD09yY8kuTHJJ6ZpekuS\n56y4zN84133TfJsXyr1j4fWbMtuoHp/m+/dntlPx/CRXLpR9RJKXZ/YHMQ9P8sPTtF6R5Oy5cpcm\nOWvqPpbk/Unel+TW+fUvya9MZS/M7L4Av5Lkgdu0+4a57t9J8tVT9+OycJeiJO+cng8luWNuWdRd\nw+5an+a6r0tyeOp+8Hx9U78PJPnJJB/MbMfvRUm+aJtl8ZvTfHzEwrz7/iRvnuv32Wm8849PTc/v\nXxjnWzP7U55vT/KhJM+c+p+b5A+3Wt6Z7ZT8aJLHTO39ta3W02kd/WSSh0z9P39+Pq0y/dM4T5+6\njyY5keSFi/N7B8t0qXUlq217lpqnqyyrFef/lUmek9kNrP5tkh9Kck6SyzM787XS+rSDsg9J8pIk\nr07yHQvDXrZqO3cw/1fZ9hyb1pHXTGWuzmx7eV2SJ9/btnHLz+iqb9jEI7Mwe9E0MU+a+r1/m7Kr\nbCzeuc3jhiR/M1fuM0l+e5qxi4+/XhjnKiv24sb7c5J8S5LXJTk51//mzO75ujitj0nynjVW7FU2\nVndtRM7I3EYns9OqK39YV/kQJPnlJM+dul+V5NjU/bgk1y2Mcz6sPrgwbEehniU3wFuM/8VJ/ndm\nOwGLy/qmJIem7rcszu+F1+/KbKfsjCR/kWmHM7MzHPM7MX80ldmqvsVQmV9P/3GSlyW5fZq+ixfK\n3m0d225Yku+d5ulXzK9j27zv3pbT9du0c3H+Lr6+fqvubcouNf1Jblx43+nTNL50i3GuskyXWley\n2rZnqXm6yrJacf7/0cLr66bnB2R2+WGl9WkHZZfd+Vmqnbs8/xfn1dI7Vcs8Viq86UdmG/9fzuyU\n1ge3KbPKxuKOJE/KLPDmH0czux54V7l3JTlnm3F8aI0V+/qtxjkNe9Bc93mZHcG9MbMfxl82TeP7\nMncEv4MVe9mN1QszC7L/ltlOw12heTjJ7y+Mc9c/BJmdJfj5JH+S2SnfT2V2VPt7SZ64Xf1JfnRh\n2OLGcqkPVpbcAE/lHrAw/DmZnV24daH/85O8Ock/zexI/qczO0P0H5K8eqHsi6bpvTXJC5JcMy2L\nG5JcMlfulqncB6bns6f+p2+x7t0tyKd+p03r2qsW+r85yfdl7oxQkiOZ7dj81jaf0Zdmdtp9ux3q\nP8zsTNK3TdN14dT/6bn7zs+HM9vh+95pumpu2OJR8rWZPjfzy2FafxZ3XJaa/mn9fNJCuUNJfiHJ\nZ9ZYpkutK1lt27PUPF1lWa04//9Pkq+bur8lyW/ODZvfoVtlfVql7LI7P0u1cwfz/962PYvzaumd\nqmUeKxU+VY/Mrtf9+L0MX3Zj8cq7FtgWw1471/3MJF+yTbkL11ixH7fCND8gyVOT/PPp8dRMp0AX\nyq2yYq+ysf6yaT48/j7auZEPwdTvIUmemNkR95Ft3vcjmU5VLvT/B0lev9BvqVDPkhvgJD+R5Ou3\nqPu8LFyimPo/I8l/z+wSyA1JfiOzf4zb6jLHF2U625HkYdOyeMqS686Dkjx2od8vrbDunZHZdwJu\nTvLxzC6p3DT12/Jy0rTs35Lk9m2GPzGzsz9vTPL4aZ5+PLOgetpcuUsWHnedzn9Ekl9YGOcDt6nr\nrMztuK8y/ZltSx6xzbCnbdFvqWW67LqS1bY9W83TP5/m6dfeyzRuu6xWnP9PzOxI8eNJ/uCudme2\nQ/+Ce1mfPj6tTz+xuD6tsu5l+Z2fxXY+bqt27mD+r7LtWWmn6j7X01Xf0OlxbyvgDsb1+MxOS5y+\n0H/xiHbpFXtD0zy/Yn9sYcU+Y6Hs0hvrFer/h5v4EGxoWa3ywXpGtt4AH1qy7m/cop6n5O9Oo39Z\nZjt337TpdWSb+p+Q2Q7mlvVP0/X1S8zT+XF+RWbfgdhunF+zzPSv0s5NrCebGuey07XicvrSFer/\n/23N7Dr+l2+zTFeZpi9dZj3Z4n2vvpdhS31OssKO8sK6t9I6lXvZjq+wTHe0U7VtvTtdYbs8FlbA\n5+5wHC/I7P+wfy2z04sXzA27x1HpvYxnR/Xv4rxYuv5NtPVU1J/Z0e8pW1bz5VapO7OduLdk9uWk\nl2R2OvuHkvx+khefgnVhsf7f3q7+Zdf/VaZp2bKbmk+7tZ7sZD6tMv93sJxuXrL+ZZfpKuv0UvVn\ndv148fGXd3Vv4nOSu39OV5mni+389RXauu14l23r0u9Zd4PQ6ZFtrmsv8b6lv/m5ifr3Yvo30dZT\nUf+pXla5+68DVvqGcJb8hvKG1oVVviG91HTtYJz3WXZT82m31pOdjnMT07+D+pddprs9zlV+xbMr\nyz/3/JwuO0+v38u2Lvtoe8OT7VTVO7cblNm12p14wJhudDDGuKVmvx1+fVU9Zhrvputf2ir1b6Kt\ne11/NrCsVmjn0nUn+fQY4zNJ/qqq/mRMN44ZY/x1VX12ielc1yr1Lztdq4xz2bKbmk+rLKtNjHMT\n079K/cuW3cQ4j2X2BdUXJ/l3Y4x3VNVfjzF+7x5zdIXpX+Fzuso8/ao9butS7ndBndlEfkNm10jn\nVWZfdNqJO6rqSWOMdyTJGOMvq+qbM7sRx1ecgvpXsUr9m2jrXte/iWW1bLlV6v7bqnrQGOOvMtsY\nzEZY9dDMfmK4aavUv+x0rTLOZctuaj6tsqw2Mc5NTP8q9S9bdtfHOcb4bJKfqqpfnp7vyPZZs8r0\nL/s5XXqcDdq6nFUPwff6kSW/yb3iOJf+5ucm6t/U9G9oXu11/bu+rFYot0rdS39DeUPrySrfkF5q\nulYc51JlNzWfVllWG1r3dn36V6x/2WW66+PcYti2v+JZcfqX/ZzueJ061W1d9uH/qAGgsRb3+gYA\ntiaoAaAxQQ0AjQlqAGhMUANAY/8PsAMboF4aYNEAAAAASUVORK5CYII=\n",
"text/plain": [
"<matplotlib.figure.Figure at 0x1257084a8>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"import pandas as pd\n",
"df = pd.read_csv(\"Email_dataset/labels.ssv\", delimiter=\" \", names=[\"Node\",\"Dep\"])\n",
"_ = df[\"Dep\"].value_counts().plot(kind=\"bar\", figsize=(8,8))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Analysing the graph we can see that the dataset is unbalanced, so that can explain the bad accuracies that we got.\n",
"So we are going take a subgraph that have only a balanced distribution of departments."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Retrieving the Subgraph\n",
"---\n",
"First, we compute a dictionary with the nodes departments"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
"labels_dict = {}\n",
"with open(\"Email_dataset/labels.ssv\") as f:\n",
" for line in f:\n",
" node,department = line.split()\n",
" labels_dict[node] = department"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"After, we generate a subset of edges that contains only the nodes that are from the departments defined in *labels_filter*"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"labels_filter = [\"4\",\"14\"]\n",
"with open(\"Email_dataset/edges.ssv\",'r') as f_in:\n",
" with open(\"Email_dataset/edges_filtered.ssv\",'w') as f_out:\n",
" for line in f_in:\n",
" src,trg = line.split()\n",
" if labels_dict[src] in labels_filter and labels_dict[trg] in labels_filter:\n",
" f_out.write(line)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, we reload the graph an compute its embedding"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Computing transition probabilities: 28%|██▊ | 54/194 [00:00<00:00, 537.90it/s]"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Loading graph from Email_dataset/edges.ssv\n",
"Graph loaded\n",
"Computing transition probabilities\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Computing transition probabilities: 100%|██████████| 194/194 [00:00<00:00, 1001.05it/s]\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Transitions probabilities computed\n",
"Starting Node2Vec embedding\n",
"Node2Vec embedding created\n",
"Saving embedding file\n",
"Embedding file saved\n"
]
}
],
"source": [
"logger = PrintLog()\n",
"\n",
"logger.log(\"Loading graph from {}\".format(graph_file))\n",
"graph = nx.read_edgelist(\"Email_dataset/edges_filtered.ssv\", delimiter=\" \", create_using=DiGraph())\n",
"logger.log(\"Graph loaded\")\n",
"\n",
"logger.log(\"Computing transition probabilities\")\n",
"n2v = Node2Vec(graph, dimensions=128, walk_length=50, num_walks=30, workers=4, p=1, q=1)\n",
"logger.log(\"Transitions probabilities computed\")\n",
"\n",
"logger.log(\"Starting Node2Vec embedding\")\n",
"n2v_model = n2v.fit(window=50, min_count=1, batch_words=64)\n",
"logger.log(\"Node2Vec embedding created\")\n",
"\n",
"logger.log(\"Saving embedding file\")\n",
"n2v_model.wv.save_word2vec_format(dataset+EMBEDDING_FILE)\n",
"logger.log(\"Embedding file saved\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And now we process the data again, but we take only the employees that are at the graph"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"vectors = np.loadtxt(dataset+EMBEDDING_FILE,delimiter=' ',skiprows=1)\n",
"\n",
"def to_embedded(n):\n",
" return vectors[n,:]\n",
"\n",
"data = []\n",
"\n",
"with open(\"Email_dataset/labels.ssv\") as f:\n",
" for line in f:\n",
" node,department = line.split()\n",
" if department in labels_filter:\n",
" try:\n",
" node_embedded = to_embedded(graph.nodes().index(node))\n",
" data.append(np.append(node_embedded,array([department])))\n",
" except:\n",
" pass\n",
"\n",
"data = array(data,dtype=float)\n",
"\n",
"np.random.shuffle(data)\n",
"train_percentage = 0.7\n",
"train_size = int(len(data)*train_percentage)\n",
"\n",
"train_data = array(data[0:train_size])\n",
"test_data = array(data[train_size:])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"At least, we run again the models"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"LogisticRegression score: 55.932203389830505%\n",
"SGDClassifier score: 45.76271186440678%\n",
"Perceptron score: 54.23728813559322%\n",
"SVC score: 42.3728813559322%\n",
"MLPClassifier score: 55.932203389830505%\n",
"KNeighborsClassifier score: 45.76271186440678%\n",
"GaussianProcessClassifier score: 40.67796610169492%\n",
"DecisionTreeClassifier score: 52.54237288135594%\n",
"BernoulliNB score: 45.76271186440678%\n",
"GaussianNB score: 55.932203389830505%\n",
"GradientBoostingClassifier score: 55.932203389830505%\n",
"RandomForestClassifier score: 54.23728813559322%\n",
"ExtraTreesClassifier score: 57.6271186440678%\n",
"AdaBoostClassifier score: 52.54237288135594%\n"
]
}
],
"source": [
"score = train_and_eval(LogisticRegression())\n",
"logger.log(\"LogisticRegression score: {}%\".format(score*100))\n",
"score = train_and_eval(SGDClassifier(max_iter=100, tol=0.001))\n",
"logger.log(\"SGDClassifier score: {}%\".format(score*100))\n",
"score = train_and_eval(Perceptron(max_iter=100, tol=0.001))\n",
"logger.log(\"Perceptron score: {}%\".format(score*100))\n",
"\n",
"score = train_and_eval(SVC())\n",
"logger.log(\"SVC score: {}%\".format(score*100))\n",
"\n",
"score = train_and_eval(MLPClassifier())\n",
"logger.log(\"MLPClassifier score: {}%\".format(score*100))\n",
"\n",
"score = train_and_eval(KNeighborsClassifier())\n",
"logger.log(\"KNeighborsClassifier score: {}%\".format(score*100))\n",
"\n",
"score = train_and_eval(GaussianProcessClassifier())\n",
"logger.log(\"GaussianProcessClassifier score: {}%\".format(score*100))\n",
"\n",
"score = train_and_eval(DecisionTreeClassifier())\n",
"logger.log(\"DecisionTreeClassifier score: {}%\".format(score*100))\n",
"\n",
"score = train_and_eval(BernoulliNB())\n",
"logger.log(\"BernoulliNB score: {}%\".format(score*100))\n",
"score = train_and_eval(GaussianNB())\n",
"logger.log(\"GaussianNB score: {}%\".format(score*100))\n",
"\n",
"score = train_and_eval(GradientBoostingClassifier())\n",
"logger.log(\"GradientBoostingClassifier score: {}%\".format(score*100))\n",
"score = train_and_eval(RandomForestClassifier())\n",
"logger.log(\"RandomForestClassifier score: {}%\".format(score*100))\n",
"score = train_and_eval(ExtraTreesClassifier())\n",
"logger.log(\"ExtraTreesClassifier score: {}%\".format(score*100))\n",
"score = train_and_eval(AdaBoostClassifier())\n",
"logger.log(\"AdaBoostClassifier score: {}%\".format(score*100))"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.1"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment