Skip to content

Instantly share code, notes, and snippets.

@Esaslow
Last active July 9, 2018 20:48
Show Gist options
  • Save Esaslow/ab428b0b30599a4dceb9f5ba100c6970 to your computer and use it in GitHub Desktop.
Save Esaslow/ab428b0b30599a4dceb9f5ba100c6970 to your computer and use it in GitHub Desktop.
Jupyter
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### IMPORTS"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"from collections import Counter\n",
"import numpy as np\n",
"from importlib import reload\n",
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"import functions as F"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Classifying spam emails using Naive Bayes and TF/IDF\n",
"Using bayesian updating and by vectorizing words we can classify whether or not an email that comes through is spam or not. How does this work? I'll explain the process a little it below, and then go throught the implimentation.\n",
"\n",
"#### 1. Turning the documents into vectors\n",
"##### 1.1 Matrix with one column, and each row is a complete doc\n",
"The first step to this process is taking all of the documents (emails) that you are trying to predict whether or not are spam and put them into a single matrix. This matrix would look like\n",
"\n",
"|**Document Body**|\n",
"|:--:|\n",
"|'This would be document one including all of the text'|\n",
"|'This is document 2'|\n",
"|'This is document 3... getting the idea?'|\n",
"|'This matrix for the emails contains 702 emails for my test set'|\n",
"|'I made a matrix for just the body text of each email'|\n",
"|'I also made a matrix for just the subject of each email'|\n",
"\n",
"\n",
"##### 1.2 Matrix with same number of rows, but a col that represents each unique word\n",
"\n",
"Now we are going to transform our matrix from one that only has a single col to a matrix that contains a col for every unique word. What if the word is contained twice in our document? Well, we put the number of counts in that document into this matrix:\n",
"\n",
"|**Doc Name**| word1|word2|...|word n-1|word n|\n",
"|:----------:|:----:|:---:|:-:|:------:|:----:|\n",
"|**doc1**|1|1|...|0|1|\n",
"|**doc2**|2|0|...|1|1|\n",
"|**doc1**|2|2|...|1|1|\n",
"\n",
"The structure of this matrix is crucial. Play with it in your mind for a while, the longer the better. This is what allows us to map our results back to specific words which gives us a better understanding of what is actually going on. \n",
"\n",
"##### 1.3 Normalizing the word counts\n",
"This is where things get a little difficult, but for an intuitive understand, try this out. If we have to vectors pointing in the same direction, but one has a much greater magnitude, our model will not see them as similar. To account for this, we need to normalize our matrix. Sklearn has an implimintation of this using a method TFIDF. This also has some interesting features that can process the words for you, and get rid of nonsensical words. Along with other cool features, this is a great way to vectorize your data. I will leave it up to the user to explore the available options with the sklearn package.\n",
"\n",
"### 2. Creating a pipeline that takes in emails and predicts the probability of the email being spam\n",
"\n",
"Start by creating a function that is called **extract_features.** This function takes in the directory that all of the emails are in. We take the emails, and read the body of the email. Then add the body of each document to a list of documents so that I have all the text saved as a new line. \n",
"\n",
"The structure of this matrix is 1 col, and each row is just the document. I did this for both the subjects and the bodies of each of the emails.\n",
"\n",
"Once I have this corpus _(collection of all of my documents as a list)_ I can vectorize it using the sklearn implimentation. Doing this by creating **two** different vectorizers. One for the subject and one for the body of the email. "
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"::::::::::::::::::::::::::::::::::::::::::::::::::\n",
"\n",
"The size of the subject line vectors is: (702, 5385)\n",
"--------------------------------------------------\n",
"The size of the body vectors is: (702, 509782)\n",
"--------------------------------------------------\n",
"The size of the target vector is: 702 \n",
"\n",
"::::::::::::::::::::::::::::::::::::::::::::::::::\n"
]
}
],
"source": [
"from sklearn.feature_extraction.text import TfidfVectorizer \n",
"reload(F)\n",
"train_dir = 'ling-spam/train-mails'\n",
"test_dir = 'ling-spam/test-mails'\n",
"\n",
"#Create vectorizor\n",
"Body_vectorizor = TfidfVectorizer(analyzer='word', ngram_range=(1,5), min_df = 0, stop_words = 'english')\n",
"Subject_vectorizor = TfidfVectorizer(analyzer='word', ngram_range=(1,5), min_df = 0, stop_words = 'english')\n",
"\n",
"#extract a corpus for the bodys, subjects, and targets\n",
"docs,subjects, target = F.extract_features(train_dir)\n",
"\n",
"#transform into matrix format\n",
"body_matrix = Body_vectorizor.fit_transform(docs)\n",
"subject_matrix = Subject_vectorizor.fit_transform(subjects)\n",
"print(':'*50)\n",
"print('\\nThe size of the subject line vectors is: ',subject_matrix.shape)\n",
"print('-'*50)\n",
"print('The size of the body vectors is: ',body_matrix.shape)\n",
"print('-'*50)\n",
"print('The size of the target vector is: ',len(target),'\\n')\n",
"print(':'*50)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Modeling Part 1:\n",
"Since I created two different vectorized systems of the words, I will need to do one of two things to train models\n",
"1. Train a model on the seperate vector systems\n",
"2. Concatinate the vector systems and then create a model\n",
"\n",
"I am going to first train a model on the seperate vector systems to see the feature importance. This will be cool because we will get to see what is driving the classification of spam for both of the models. After this, I will impliment a system that puts both of these vectors together and then predicts off of the concatinated matricies. \n",
"\n",
"It is much harder to map back to the origin words after I have put the matricies together so that is why I am going through this process first."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.ensemble import GradientBoostingClassifier as GB\n",
"\n",
"#Initialize and train a model for the body of the email\n",
"body_model = GB(n_estimators = 10, learning_rate = .5)\n",
"body_model.fit(body_matrix,target);\n",
"\n",
"#Initialize and train a model for the subject line of the email\n",
"subject_model = GB(n_estimators = 10, learning_rate = .5)\n",
"subject_model.fit(subject_matrix,target);"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<matplotlib.figure.Figure at 0x1a1bb8bac8>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"_,ax = plt.subplots(1,2,figsize = (10,10))\n",
"reload(F)\n",
"F.plot_feat_importance(ax[0],subject_model,Subject_vectorizor)\n",
"F.plot_feat_importance(ax[1],body_model,Body_vectorizor);"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Fitting the model for the concatenated matricies\n",
"The first thing I need to do is to concatinate the two matricies I have created into a single matrix that contains both but has a higher weight for the subject lines. \n",
"\n",
"This contains the matrix for both the subject lines and the body elements of each email.\n",
"\n",
"Here I am weighting the subject line 3x more heavily than the body words, Cool\n",
"\n",
"Below is the code to do that:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"--------------------------------------------------\n",
"The resulting size after concatenation is: (702, 515167)\n",
"--------------------------------------------------\n"
]
}
],
"source": [
"from scipy.sparse import hstack\n",
"from scipy import sparse\n",
"\n",
"#concat\n",
"sub_and_body = np.concatenate((body_matrix.todense(),3*subject_matrix.todense()),axis = 1)\n",
"\n",
"#convert back to sparse\n",
"sub_and_body = sparse.csr_matrix(sub_and_body)\n",
"\n",
"#print the resulting matrix size\n",
"print('-'*50)\n",
"print('The resulting size after concatenation is: ',\n",
" sub_and_body.shape)\n",
"print('-'*50)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Training the Final Model:\n",
"Now that we concatenated the matricies together, we can predict off of the new dataset. We need to first fit our model, then load the testing dataset and apply the same vectorization to it. Once we get this done we can predict off the test dataset.\n",
"\n",
"For this I am going to use the **GradientBoostingClassifier** from Sklearn. This is a great classifier and fairly robust. More information can be found from sklean documentation on this specific classifier."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.ensemble import GradientBoostingClassifier as GB\n",
"import pandas as pd\n",
"\n",
"#Train the final model\n",
"total_model = GB(n_estimators = 200, learning_rate = .01)\n",
"total_model.fit(sub_and_body,target);"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Total clasfication percent correct: 98.14814814814815 %\n"
]
}
],
"source": [
"print('Total clasfication percent correct: ',\n",
" sum(target == total_model.predict(sub_and_body))/len(target)*100,\n",
" '%')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Wow! Thats really great prediction on the training data. Lets move on and apply this to the test dataset which was already split out:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Load in the testing dataset, vectorize, and then concatinate\n",
"**Make sure to apply the same weight as before to the subject line matrix**"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"#extract a corpus for the bodys, subjects, and targets\n",
"docs,subjects, target = F.extract_features(test_dir)\n",
"\n",
"#transform into matrix format\n",
"body_matrix = Body_vectorizor.transform(docs)\n",
"subject_matrix = Subject_vectorizor.transform(subjects)\n",
"\n",
"sub_and_body = np.concatenate((body_matrix.todense(),3*subject_matrix.todense()),axis = 1)\n"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Total clasfication percent correct: 91.15384615384615 %\n"
]
}
],
"source": [
"print('Total clasfication percent correct: ',\n",
" sum(target == total_model.predict(sub_and_body))/len(target)*100,\n",
" '%')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Wow we are getting really great results. \n",
"\n",
"I was very careful to ensure that there was no data-leakage, and I am excited to impliment this system as my personal spam detection software.\n",
"\n",
"This is really cool though. Using this method I was able to classify whether or not something was spam to over 90% accuracy. \n",
"\n",
"If I where to use this on my own personal email list, I could classify specific senders as spam along with the application of predicting based on the subject line and text body. I would need to optimize this to do a better job by using ROC curves and using information about the number of false postitives and true negatives. Since I have such great accuracy, I could ensure that there are virtually zero false positives and using this make sure that I never miss an important email. Rad!\n",
"\n",
"### Next steps\n",
"- Look at the number of true positives and false positive\n",
"- plot roc curve\n",
"- Decide where a good cutoff is for my spam filter"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.4"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment