Skip to content

Instantly share code, notes, and snippets.

@Z30G0D
Created February 13, 2018 09:29
Show Gist options
  • Save Z30G0D/7aef18b1407a9cbbf89ff7dc740a3518 to your computer and use it in GitHub Desktop.
Save Z30G0D/7aef18b1407a9cbbf89ff7dc740a3518 to your computer and use it in GitHub Desktop.
A simple naive bayes classifier for spam emails
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Naive Bayes Classifier\n",
"## Spam emails classifier\n",
"\n",
"Hello all, I wanted to build a spam classifier based on the naive bayes classifier.<br>\n",
"The dataset used here is spambase by UCI and located <a href=\"https://archive.ics.uci.edu/ml/datasets/spambase\">here</a>.<br>\n",
"The database has 4601 emails, each email is represented by an array of numbers.<br>\n",
"39.4% of the dataset is considered spam (1813 emails).<br>\n",
"You can find more details in the dataset documentation.<br>\n",
"As always, please refer to my email for any comments or remarks: tomer@nahshon.net\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"from sklearn.model_selection import train_test_split\n",
"from IPython.display import Math\n",
"from IPython.display import Image\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"data = open('Spambase/spambase/spambase.data', 'r')"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"data1 = []\n",
"for line in data:\n",
" line = [float(element) for element in line.rstrip('\\n').split(',')]\n",
" data1.append(np.asarray(line))\n",
"#print(len(data1)) "
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"num_features = 48\n",
"X = [data1[i][:num_features] for i in range(len(data1))]\n",
"y = [data1[i][-1] for i in range(len(data1))]"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"((4601, 48), (4601,))"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X = np.array(X)\n",
"y = np.array(y)\n",
"X.shape, y.shape\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Ok, we have our database.In \"X\" We have 4601 emails with 48 attributes, each attribute represents a frequency of a word (buy, sell etc.) described by the formula below. In \"y\" we have 4601 values of 1 or 0, 1 refer to a spam email, else 0."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\\begin{eqnarray}\n",
"Frequency\\space of \\space word =100 \\times \\frac{Number\\space of\\space times\\space word\\space appears\\space in \\space mail }{Totals\\space number\\space of\\space words\\space in\\space the\\space mail} \\\\\n",
"\\end{eqnarray}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Ok, let's split the data into training set and test set."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"((3082, 48), (3082,), (1519, 48), (1519,))"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X_train.shape, y_train.shape, X_test.shape, y_test.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let's first find the class specific likelihood ratios.\n",
"\\begin{eqnarray}\n",
"P(fi \\mid y) = \\frac{count(fi , y) }{\\sum\\nolimits_{ fi \\in All \\space words\\space in\\space vocabulary} (fi , y) } \\\\\n",
"\\end{eqnarray}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We want to find what is the likelihood of a specific word given the class (spam or not).<br>\n",
"So we count the number of times the word(fi)appears given the class(y) and divide it by the total number of words given the class.<br>\n",
"So in order to obtain this expression we need to look at all the emails in our data set and average them column wise,<br>\n",
"This is due to the fact that in each email every cell contains the frequency of the same word, so by averaging them and dividing by 100 (to avoid the same formula given in the first Latex expression of \"frequency of word\") we get the desired expression.<br>\n",
"Let's seperate the two classes first:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"((1852, 48), (1230, 48))"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X_train_class_1 = np.array([X_train[i,:] for i in range(len(X_train)) if y_train[i]==1])\n",
"X_train_class_0 = np.array([X_train[i,:] for i in range(len(X_train)) if y_train[i]==0])\n",
"\n",
"X_train_class_0.shape, X_train_class_1.shape"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"likelihood_class_1 = np.mean(X_train_class_1, axis=0)/100.0\n",
"likelihood_class_0 = np.mean(X_train_class_0, axis=0)/100.0"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(48,)"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"likelihood_class_1.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So now we have our likelihood ratios, great.<br>\n",
"Let's remember what we want to find.<br>\n"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<img src=\"https://inst.eecs.berkeley.edu/~cs188/fa09/projects/classification/images/img4_new.png\"/>"
],
"text/plain": [
"<IPython.core.display.Image object>"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"Image(url= \"https://inst.eecs.berkeley.edu/~cs188/fa09/projects/classification/images/img4_new.png\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Consider \"y\" in the formula above our desired class (1 or 0) and \"f\" as our given data words.<br>\n",
"We would like to know the class given the data(f1...fm). The first transition is the simple <a href=\"https://en.wikipedia.org/wiki/Bayes%27_theorem\"> Baye's Theorem</a>.<br>\n",
"The second transition relies on the very strong assumption that our probabilities are independent from one another, and hence can be multiplied to get the total probability.<br>\n",
"So what we obtained by finding the likelihood ratio in the previous cell is the multiplied terms P(fi|y).P(fi|y) in this formula is equivalent to P(word|spam) for our dataset.<br>\n",
"And in order to find the class with the highest probability given the data we just take the highest argument comparing all classes (as can be seen from the third row onwards).<br>\n",
"Please note that the denominator is removed from the argmax expression. this is due to the fact that it is equal for all classes and does not affect the outcome of the max argument.<br><br>\n",
"Next, we want to get the log-likelihood expression. There are two reasons behind this:<br>\n",
"1)Avoiding overflow - since the probabilities are very small the computer won't handle these sizes, the log function solves this problem and enlarges the probabilities by equal scale.<br>\n",
"2)Multiplication to addition - instead of multiplying we can add the terms (log rules, go review them please :) ) "
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [],
"source": [
"def log_likelihood(feature_vector, Class):\n",
" assert len(feature_vector)==num_features # making sure that number of features is identical to length of feature vector\n",
" log_likelihood = 0.0 #init likelihood\n",
" if Class==0:\n",
" for i in range(num_features):\n",
" if feature_vector[i]==1: # word exists\n",
" log_likelihood += np.log10(likelihood_class_0[i])\n",
" elif feature_vector[i]==0: # word doesn't exist\n",
" log_likelihood += np.log10(1.0 - likelihood_class_0[i])\n",
" elif Class==1:\n",
" for j in range(num_features):\n",
" if feature_vector[j]==1:\n",
" log_likelihood += np.log10(likelihood_class_1[j])\n",
" elif feature_vector[j]==0:\n",
" log_likelihood += np.log10(1.0 - likelihood_class_1[j])\n",
" return log_likelihood"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The input of this function is a feature vector and a class. The feature vector contains a binary array.<br>\n",
"For every word (feature) the vector states if the word exists in the mail (1) or not (0), this is easy to obtain from the original vector(the dataset) since if the word frequency is higher than 0, the cell equals to 1.<br>\n",
"This function basically implements the product of the the probabilities after the independence assumption.<br>\n",
"But since it is log, we just add the probabilities.<br>\n",
"Notice that when the word doesn't exist we substract from 1 the likelihood ( we have two choices only , whether a word exists or not).<br><br>\n",
"Now let's calculate the prior probability (P(y)) for all the classes"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(-0.221191652036485, -0.39892752294300254)"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"prior_class_1 = np.log10(len(X_train_class_1)/ (len(X_train_class_1) + (len(X_train_class_0))))\n",
"prior_class_0 = np.log10(len(X_train_class_0)/ (len(X_train_class_1) + (len(X_train_class_0))))\n",
"prior_class_0, prior_class_1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is the prior probability P(y), for every class (y), before applying the data. Usually from my understanding it is not that easy to calculate but in our case it is our fairly easy since we have a closed data set.<br>\n",
"In \"real life\" calculating this number is quite tricky.<br>\n",
"So we now have the independent probabilities (P(fi|y)) added, and also the prior probability p(y).<br>\n",
"So all we need to find now is the maximum argument per class.<br>\n",
"We basically need to find the MAP (maximum aposteriori probabaility) i.e. P(y|f1,f2...fn), or in other words: the class given the data."
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [],
"source": [
"def calculate_posterior(feature_vector):\n",
" posterior_prob_class1 = prior_class_1 + log_likelihood(feature_vector, 1)\n",
" posterior_prob_class0 = prior_class_0 + log_likelihood(feature_vector, 0)\n",
" \n",
" return posterior_prob_class0, posterior_prob_class1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Ok, now we have our posterior probability P(y|f1,f2...fn) for every class y (1 or 0).<br>\n",
"So our mission will be to calculate the posterior probability for every class given the feature vector and choose the highest one.<br>\n",
"Let's Calculate the posterior class for a random email and choose the MAP."
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [],
"source": [
"def classify(feature_vector):\n",
" posterior_class_0, posterior_class_1 = calculate_posterior(feature_vector)\n",
" if posterior_class_0 > posterior_class_1:\n",
" return 0\n",
" else:\n",
" return 1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's insert a random sample from the training set to see that it works"
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.0"
]
},
"execution_count": 52,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"A = X_train[5,:]\n",
"A[A>0] = 1 # changing to feature vector to a binary array as explained before.\n",
"A = A.astype(np.float)\n",
"y_train[5]"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0"
]
},
"execution_count": 51,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"prediction = classify(A)\n",
"prediction"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Well it works!, Let's try this on our test set. "
]
},
{
"cell_type": "code",
"execution_count": 77,
"metadata": {},
"outputs": [],
"source": [
"def predict(X):\n",
" pred = np.ones(len(X))\n",
" for i in range(len(X)):\n",
" pred[i] = classify(X[i,:])\n",
" return pred"
]
},
{
"cell_type": "code",
"execution_count": 79,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"C:\\Users\\zeogo\\Miniconda2\\envs\\jupy\\lib\\site-packages\\ipykernel_launcher.py:13: RuntimeWarning: divide by zero encountered in log10\n",
" del sys.path[0]\n"
]
},
{
"data": {
"text/plain": [
"array([1., 0., 1., ..., 0., 1., 0.])"
]
},
"execution_count": 79,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X_test[X_test>0] = 1 #changing all the emails in our test set to binary arrays\n",
"pred = predict(X_test)\n",
"pred"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Ok, we got our predictions let's build a function to measure our accuracy.<br>\n",
"We are going to compare \"pred\" to \"y_test\" in order to see what is our accuracy for the test set."
]
},
{
"cell_type": "code",
"execution_count": 80,
"metadata": {},
"outputs": [],
"source": [
"def accuracy(predictions, truth):\n",
" count = 0\n",
" for i in range(len(predictions)):\n",
" if predictions[i] == truth[i]:\n",
" count += 1\n",
" accuracy = count/len(truth)\n",
" return accuracy\n"
]
},
{
"cell_type": "code",
"execution_count": 94,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The accuracy is: 88.479263%\n"
]
}
],
"source": [
"print('The accuracy is: {0:f}%'.format((accuracy(pred, y_test))*100))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Ok, we got our accuracy to 88.4%.\n",
"This is it for this notebook,<br>\n",
"Please make sure to send me comments to my personal email box: tomer@nahshon.net<br>\n",
"Thank you!<br>\n",
"Tomer"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python (myenv)",
"language": "python",
"name": "myenv"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.4"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment