Skip to content

Instantly share code, notes, and snippets.

@daniel-covelli
Created April 16, 2020 01:38
Show Gist options
  • Save daniel-covelli/055306a9d551e97cd7cfc5cd27e735e7 to your computer and use it in GitHub Desktop.
Save daniel-covelli/055306a9d551e97cd7cfc5cd27e735e7 to your computer and use it in GitHub Desktop.
SPAM/HAM.ipynb
Display the source blob
Display the rendered blob
Raw
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"name": "SPAM/HAM.ipynb",
"provenance": [],
"mount_file_id": "1WPoOppqzkURZj9mayZA_qNwtbyBQQpEm",
"authorship_tag": "ABX9TyPMyWVbZ1W+6/H6jvolyokR",
"include_colab_link": true
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
"<a href=\"https://colab.research.google.com/gist/daniel-covelli/055306a9d551e97cd7cfc5cd27e735e7/spam-ham.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "tTId221qAfIW",
"colab_type": "text"
},
"source": [
"# Spam/Ham Classification\n",
"The following code will aim to classify and distinguish spam emails from non-spam emails."
]
},
{
"cell_type": "code",
"metadata": {
"id": "J0ONaCJMAeBN",
"colab_type": "code",
"colab": {}
},
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"\n",
"import matplotlib.pyplot as plt\n",
"\n",
"import seaborn as sns"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "RpvjlKJRBYCJ",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 204
},
"outputId": "6f62a87f-2a04-4d0d-8001-830c376dd7fd"
},
"source": [
"data = pd.read_csv(\"/content/drive/My Drive/Colab Notebooks/DATA/SPAM HAM/spam.csv\", encoding='latin-1')\n",
"data.head()"
],
"execution_count": 87,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>v1</th>\n",
" <th>v2</th>\n",
" <th>Unnamed: 2</th>\n",
" <th>Unnamed: 3</th>\n",
" <th>Unnamed: 4</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>ham</td>\n",
" <td>Go until jurong point, crazy.. Available only ...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>ham</td>\n",
" <td>Ok lar... Joking wif u oni...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>spam</td>\n",
" <td>Free entry in 2 a wkly comp to win FA Cup fina...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>ham</td>\n",
" <td>U dun say so early hor... U c already then say...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>ham</td>\n",
" <td>Nah I don't think he goes to usf, he lives aro...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" v1 ... Unnamed: 4\n",
"0 ham ... NaN\n",
"1 ham ... NaN\n",
"2 spam ... NaN\n",
"3 ham ... NaN\n",
"4 ham ... NaN\n",
"\n",
"[5 rows x 5 columns]"
]
},
"metadata": {
"tags": []
},
"execution_count": 87
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "p1Icxz18D-cH",
"colab_type": "text"
},
"source": [
"Lets clean the data a bit and check if there are any NaN values that we should. be concerned about.\n"
]
},
{
"cell_type": "code",
"metadata": {
"id": "eeAkVch7DM-B",
"colab_type": "code",
"colab": {}
},
"source": [
" def clean(df):\n",
" '''\n",
" Args:\n",
" df: a pandas datframe\n",
"\n",
" Returns:\n",
" A cleaned data frame, with relevant columns \n",
" '''\n",
" clean = df.rename(columns={'v2': \"email\"})\n",
" clean['spam'] = clean.v1.map({'ham':0,'spam':1})\n",
" clean = clean[['spam', 'email']]\n",
" return clean\n",
"\n",
"data = clean(data)"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "l63MMWC4M9PB",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 68
},
"outputId": "07c01fe1-efa4-41fd-f087-94a04690ad39"
},
"source": [
"data.isnull().sum(axis = 0)"
],
"execution_count": 89,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"spam 0\n",
"email 0\n",
"dtype: int64"
]
},
"metadata": {
"tags": []
},
"execution_count": 89
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "GYDLGKPKD9XX",
"colab_type": "text"
},
"source": [
"The clean **data** DataFrame contains labeled data that will be used to train the model. It contains the following columns:\n",
"\n",
"\n",
"1. **spam**: 1 if an email is spam, 0 if an email is ham (not spam)\n",
"2. **email**: The text of the email\n",
"\n"
]
},
{
"cell_type": "code",
"metadata": {
"id": "D4MjKS8uITHr",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 204
},
"outputId": "6e4381f6-caad-4503-9564-7c6f8edbacf9"
},
"source": [
"data.head()"
],
"execution_count": 90,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>spam</th>\n",
" <th>email</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0</td>\n",
" <td>Go until jurong point, crazy.. Available only ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>0</td>\n",
" <td>Ok lar... Joking wif u oni...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1</td>\n",
" <td>Free entry in 2 a wkly comp to win FA Cup fina...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>0</td>\n",
" <td>U dun say so early hor... U c already then say...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>0</td>\n",
" <td>Nah I don't think he goes to usf, he lives aro...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" spam email\n",
"0 0 Go until jurong point, crazy.. Available only ...\n",
"1 0 Ok lar... Joking wif u oni...\n",
"2 1 Free entry in 2 a wkly comp to win FA Cup fina...\n",
"3 0 U dun say so early hor... U c already then say...\n",
"4 0 Nah I don't think he goes to usf, he lives aro..."
]
},
"metadata": {
"tags": []
},
"execution_count": 90
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "pBfOY0_JD3Bp",
"colab_type": "text"
},
"source": [
"Now lets split the data into a test and training set. \n"
]
},
{
"cell_type": "code",
"metadata": {
"id": "akeHue7HJRi6",
"colab_type": "code",
"colab": {}
},
"source": [
"from sklearn.model_selection import train_test_split\n",
"\n",
"train, test = train_test_split(data, test_size=0.1, random_state=83)"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "c5gyrbZ_iyna",
"colab_type": "text"
},
"source": [
"# Data Exploration"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "poLlZlXme2T0",
"colab_type": "text"
},
"source": [
"Lets try do identity some features that destinguish our spam emails from our ham emails."
]
},
{
"cell_type": "code",
"metadata": {
"id": "aozUilmoDuVO",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 105
},
"outputId": "e6690fc2-f5f2-434a-8f69-10eecd52c107"
},
"source": [
"spam_emails = train[train['spam']==1]['email']\n",
"print(spam_emails.iloc[0])\n",
"print(spam_emails.iloc[3])\n",
"print(spam_emails.iloc[33])\n",
"print(spam_emails.iloc[49])\n"
],
"execution_count": 92,
"outputs": [
{
"output_type": "stream",
"text": [
"Congratulations ur awarded 500 of CD vouchers or 125gift guaranteed & Free entry 2 100 wkly draw txt MUSIC to 87066 TnCs www.Ldew.com1win150ppmx3age16\n",
"Someone U know has asked our dating service 2 contact you! Cant Guess who? CALL 09058091854 NOW all will be revealed. PO BOX385 M6 6WU\n",
"Sunshine Hols. To claim ur med holiday send a stamped self address envelope to Drinks on Us UK, PO Box 113, Bray, Wicklow, Eire. Quiz Starts Saturday! Unsub Stop\n",
"Text PASS to 69669 to collect your polyphonic ringtones. Normal gprs charges apply only. Enjoy your tones\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "zp9_fhnofy1w",
"colab_type": "text"
},
"source": [
"As we can see, many of the spam emails have some mention 'winning' something that is free or 'matching' with someone romantically with specific request to 'text' or 'call' a specific number. Nowing this, we can target specific keywords that we think might appear in spam emails more frequently. Lets define a function that will do this. "
]
},
{
"cell_type": "code",
"metadata": {
"id": "HIufYVKZhG0E",
"colab_type": "code",
"colab": {}
},
"source": [
"def words_in_texts(words, texts):\n",
" '''\n",
" Arguments:\n",
" words: words to find\n",
" texts: strings to search through\n",
" \n",
" Returns:\n",
" NumPy array of 0s and 1s with shape (n, p) \n",
" where n is the number of texts and p is the number of words.\n",
" '''\n",
" master = []\n",
" for n in texts:\n",
" minor = []\n",
" for p in words:\n",
" if p in n:\n",
" minor += [1]\n",
" else:\n",
" minor += [0]\n",
" master += [minor]\n",
" indicator_array = np.asarray(master)\n",
" return indicator_array"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "Y6Z9SwCO321W",
"colab_type": "text"
},
"source": [
"Lets check to see if we can find words that appear in spam emails more frequently then ham emails."
]
},
{
"cell_type": "code",
"metadata": {
"id": "tk9NHq1x4DG1",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 457
},
"outputId": "fcf81466-693b-45f0-a9c3-487f0a32c281"
},
"source": [
"def word_viz(df, words):\n",
" '''\n",
" Args:\n",
" df: DataFrame containing emails\n",
" words: a list of 5 words to check\n",
" Returns:\n",
" A Seaborn BarPlot of the proportion that each word\n",
" is either in spam or ham emails\n",
" '''\n",
" # creates an array of insances p words show up in n emails \n",
" hits = words_in_texts(words, df['email'])\n",
" hits = np.c_[hits, df.spam.map({0:'ham', 1:'spam'})]\n",
" \n",
" # gets column i in matrix m\n",
" def column(m, i):\n",
" return [row[i] for row in m]\n",
" \n",
" # constructs DataFrame into catplot readable format w/ df.melt( )\n",
" df = pd.DataFrame({\n",
" words[0]: column(hits, 0),\n",
" words[1]: column(hits, 1),\n",
" words[2]: column(hits, 2),\n",
" words[3]: column(hits, 3),\n",
" words[4]:column(hits,4),\n",
" 'type': column(hits,5)\n",
" })\n",
" df = df.melt('type')\n",
"\n",
" # create barplot of new DataFrame\n",
" g = sns.catplot(x=\"variable\", y=\"value\", hue=\"type\", data=df,\n",
" height=6, kind=\"bar\", palette=\"muted\", ci=None,\n",
" legend_out=False)\n",
" g.set_ylabels(\"Proportion of Emails\")\n",
" g.set_xlabels(\"Words\")\n",
" plt.title('Frequency of words in Spam/Ham Emails')\n",
" plt.legend(prop={'size': 16}, title = None)\n",
" g.fig.set_figwidth(10)\n",
" return g\n",
"\n",
"words = ['win', 'txt', 'text', 'call', 'award']\n",
"word_viz(train, words);"
],
"execution_count": 121,
"outputs": [
{
"output_type": "display_data",
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 720x432 with 1 Axes>"
]
},
"metadata": {
"tags": [],
"needs_background": "light"
}
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Ph1xF53qWpt9",
"colab_type": "text"
},
"source": [
"We also might want to look at the length of spam and ham emails as an indicator. Lets see if that is the case. "
]
},
{
"cell_type": "code",
"metadata": {
"id": "iFmEPzvXWpDf",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 279
},
"outputId": "72d4ec6d-48fe-44cd-ee85-742646d26b27"
},
"source": [
"def email_count(df, cats):\n",
" '''\n",
" Args:\n",
" df: the DataFrame from which the data will be graphed\n",
" cats: email categories of interest\n",
" Return:\n",
" A visualization of the distribution of Email length for\n",
" for the different categories\n",
" '''\n",
" copy = df.copy()\n",
" copy['spam'] = copy.spam.map({0:'ham', 1:'spam'})\n",
" copy['len'] = copy['email'].str.len()\n",
" for cat in cats:\n",
" subset = copy[copy['spam'] == cat]\n",
"\n",
" sns.distplot(subset['len'], hist = False, kde = True,\n",
" kde_kws = {'linewidth': 3},\n",
" label = cat)\n",
"\n",
" plt.legend(prop={'size': 16}, title = None)\n",
" plt.xlabel('Length of Email Body')\n",
" plt.ylabel('Distribution')\n",
" plt.xlim(0, 500);\n",
"\n",
"email_count(train, ['ham', 'spam'])\n"
],
"execution_count": 118,
"outputs": [
{
"output_type": "display_data",
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"tags": [],
"needs_background": "light"
}
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "lZ4nisHdiodQ",
"colab_type": "text"
},
"source": [
"# Creating A Model"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "urcKUwiljGLQ",
"colab_type": "text"
},
"source": [
"For our first model, lets use the words that we selected above, to create two matrices, \n",
"\n",
"1. **X_train**: whos columns are the words of interest and rows are each email \n",
"2. **Y_train**: whos column is the spam categorization and rows are each email\n",
"\n",
"\n"
]
},
{
"cell_type": "code",
"metadata": {
"id": "zUn8uXTE4EjQ",
"colab_type": "code",
"colab": {}
},
"source": [
"words = ['win', 'txt', 'text', 'call', 'award']\n",
"\n",
"X_train = words_in_texts(words, train['email']) \n",
"Y_train = train['spam']"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "uFEFg7BAkk0D",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 102
},
"outputId": "9c38944d-0a13-4a11-8758-018c8cd72327"
},
"source": [
"X_train[:5]"
],
"execution_count": 134,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"array([[0, 0, 0, 0, 0],\n",
" [0, 0, 0, 0, 0],\n",
" [0, 0, 0, 0, 0],\n",
" [0, 0, 0, 1, 0],\n",
" [0, 0, 0, 0, 0]])"
]
},
"metadata": {
"tags": []
},
"execution_count": 134
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "qVsMgccbj_Fb",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 119
},
"outputId": "92829106-8b90-4353-819c-3d4d36cbed61"
},
"source": [
"Y_train[:5]"
],
"execution_count": 132,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"3701 0\n",
"5420 0\n",
"3650 0\n",
"1151 0\n",
"2764 0\n",
"Name: spam, dtype: int64"
]
},
"metadata": {
"tags": []
},
"execution_count": 132
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "VIxA3J6ukouH",
"colab_type": "text"
},
"source": [
"Now that we have the matrices, we can use sklearn to train our model using logistic regression. "
]
},
{
"cell_type": "code",
"metadata": {
"id": "rliDSg_ZkiYr",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
},
"outputId": "291f39da-cb7f-4ede-b2dd-ddf24c8329a2"
},
"source": [
"from sklearn.linear_model import LogisticRegression\n",
"from sklearn import metrics\n",
"\n",
"model = LogisticRegression()\n",
"model.fit(X_train, Y_train)\n",
"\n",
"training_accuracy = metrics.accuracy_score(Y_train, model.predict(X_train))\n",
"print('Training Accuracy: ', training_accuracy)"
],
"execution_count": 139,
"outputs": [
{
"output_type": "stream",
"text": [
"Training Accuracy: 0.8960909453530116\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "1Y9Inp-goLCH",
"colab_type": "text"
},
"source": [
"Great! Now we know that our first model can correctly classify spam around 90% percent of the time. This is a great start, but we can do more. "
]
},
{
"cell_type": "code",
"metadata": {
"id": "9KQFj38UoSuN",
"colab_type": "code",
"colab": {}
},
"source": [
""
],
"execution_count": 0,
"outputs": []
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment