Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save rajvijen/8e5a285ec439922d8df331783f5cecd4 to your computer and use it in GitHub Desktop.
Save rajvijen/8e5a285ec439922d8df331783f5cecd4 to your computer and use it in GitHub Desktop.
This is the notebook for SMS Spam Classification Using SVMs.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"metadata": {},
"cell_type": "markdown",
"source": "# SMS Spam Classification using SVMs:-\n\nyou can refer dataset here [SMS Spam Collection Dataset](https://gist.githubusercontent.com/rajvijen/51255cf4875372b904bdb812a3b85b28/raw/816dcd4cdc7553faea396186067e814487046c74/sms_spam_classification_data.csv). For details about dataset refer this kaggle dataset [link](https://www.kaggle.com/uciml/sms-spam-collection-dataset)."
},
{
"metadata": {},
"cell_type": "markdown",
"source": "## Required Libraries:-\nFirst of all import all required libraries."
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "import numpy as np\nimport pandas as pd\nimport matplotlib.pyplot as plt\nfrom collections import Counter\nfrom sklearn import feature_extraction, model_selection, metrics, svm\n\nfrom IPython.display import Image\nimport warnings\nwarnings.filterwarnings(\"ignore\")\n%matplotlib inline",
"execution_count": 1,
"outputs": []
},
{
"metadata": {},
"cell_type": "markdown",
"source": "## EDA:-\nObserve the dataset in tabular format."
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "data = pd.read_csv('https://gist.githubusercontent.com/rajvijen/51255cf4875372b904bdb812a3b85b28/raw/816dcd4cdc7553faea396186067e814487046c74/sms_spam_classification_data.csv', encoding='latin-1')\ndata.head(10)",
"execution_count": 2,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 2,
"data": {
"text/plain": " v1 v2 Unnamed: 2 \\\n0 ham Go until jurong point, crazy.. Available only ... NaN \n1 ham Ok lar... Joking wif u oni... NaN \n2 spam Free entry in 2 a wkly comp to win FA Cup fina... NaN \n3 ham U dun say so early hor... U c already then say... NaN \n4 ham Nah I don't think he goes to usf, he lives aro... NaN \n5 spam FreeMsg Hey there darling it's been 3 week's n... NaN \n6 ham Even my brother is not like to speak with me. ... NaN \n7 ham As per your request 'Melle Melle (Oru Minnamin... NaN \n8 spam WINNER!! As a valued network customer you have... NaN \n9 spam Had your mobile 11 months or more? U R entitle... NaN \n\n Unnamed: 3 Unnamed: 4 \n0 NaN NaN \n1 NaN NaN \n2 NaN NaN \n3 NaN NaN \n4 NaN NaN \n5 NaN NaN \n6 NaN NaN \n7 NaN NaN \n8 NaN NaN \n9 NaN NaN ",
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>v1</th>\n <th>v2</th>\n <th>Unnamed: 2</th>\n <th>Unnamed: 3</th>\n <th>Unnamed: 4</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>ham</td>\n <td>Go until jurong point, crazy.. Available only ...</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>1</th>\n <td>ham</td>\n <td>Ok lar... Joking wif u oni...</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>2</th>\n <td>spam</td>\n <td>Free entry in 2 a wkly comp to win FA Cup fina...</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>3</th>\n <td>ham</td>\n <td>U dun say so early hor... U c already then say...</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>4</th>\n <td>ham</td>\n <td>Nah I don't think he goes to usf, he lives aro...</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>5</th>\n <td>spam</td>\n <td>FreeMsg Hey there darling it's been 3 week's n...</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>6</th>\n <td>ham</td>\n <td>Even my brother is not like to speak with me. ...</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>7</th>\n <td>ham</td>\n <td>As per your request 'Melle Melle (Oru Minnamin...</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>8</th>\n <td>spam</td>\n <td>WINNER!! As a valued network customer you have...</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>9</th>\n <td>spam</td>\n <td>Had your mobile 11 months or more? U R entitle...</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n </tr>\n </tbody>\n</table>\n</div>"
},
"metadata": {}
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Get some insights from data.\n### Data Visualization:-"
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "count_class = pd.value_counts(data[\"v1\"], sort = True)\ncount_class.plot(kind = 'bar', color = ['blue', 'orange'])\nplt.title('Spam vs Non-spam distribution of data')\nplt.show()",
"execution_count": 3,
"outputs": [
{
"output_type": "display_data",
"data": {
"text/plain": "<Figure size 432x288 with 1 Axes>",
"image/png": "\n"
},
"metadata": {}
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "**`Pie-plot`**"
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "count_class.plot(kind = 'pie', autopct = '% 1.0f%%')\nplt.title('Percentage distribution of data')\nplt.ylabel('')\nplt.show()",
"execution_count": 4,
"outputs": [
{
"output_type": "display_data",
"data": {
"text/plain": "<Figure size 432x288 with 1 Axes>",
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAPEAAAD7CAYAAAC7UHJvAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvOIA7rQAAHu9JREFUeJzt3XmcU+W9x/HPbxZ2ZK0IIhwrbiBqURAVBalWbRSt+y5e16rXVm01rh0L2lCL9rbuV627Vm3Vi9G6L7ggLmCpuCJRFFEUCQzDbJnn/vGckTBkZjIzmTznJL/36zWvmclyzjcn+eacnJxFjDEopcKrxHUApVTHaImVCjktsVIhpyVWKuS0xEqFnJZYqZDTEgeYiFSIyD3+38NEpFJESnM07JtE5DL/70ki8kUuhusPbw8R+TBXw2vDeLcWkXkislpEzsni9j9M3zDLa4lFJCEia/0X49ci8jcR6ZXPDK3xM+7tOkdTxpjPjTG9jDGplm4nIlNF5JUshneGMWZaLrKJiBGREWnDnm2M2ToXw26jC4AXjTG9jTF/yeWAReQOEZmey2Hmios58YHGmF7AGGAscGlbByAiZTlPVURyNTcPoOHAe65D5J0xJm8/QALYO+3/q4HH/b/7ALcBXwFfAtOBUv+6qcCrwLXACmC6f/mpwPvAamAhMMa/fAjwD2A5sBg4J22cFcCDwF3+/d4DdvavuxtoANYClcAF/uUPAcuAJPAyMCpteAOAWcAq4E0/9ytp128DPOPn/hA4ooXpsznwkp/rGeA64B7/Og8wQFnaNPnUv+1i4FhgW6AaSPn5V/q3vQO4EXgCWAPs7V/WOB0nAV8AFwPf+s/TsWm5XgROSft/auNj9KeH8YdbCRzZOLy022/rD2OlP72npF13B3A9EPcfyxvAFi1Moyn+MFb6w9zWv/x5/3FX+zm2asv0bel5Bk4D6oBaf9iz/MujwCLWvf5+kc8+/ZDbVYmBzfwnY5r//6PAzUBPYGNgLnB62oumHvhvoAzoDhyOLftYQIAR2HfiEuBt4HKgC/Bj7It937QSVwM/B0qBPwBzmnuj8S/7L6A30BX4MzA/7boH/J8ewEhgSdoLvKf//0l+7jHYkoxqZvq8Dlzjj2dP/8WxQYn94a4CtvavG5z2gptK2ptIWlGSwO7+9OnGhiWuTxv3RGwpG4f/Is2U2P/fACPS/p+EX2KgHPgE+wbRBZjsP66t07KtAMb5j+1e4IFmps9Wfq59/OFe4A+7S6acbZm+WTzPP0yvtMsOx84wSrBvXmuAwcVQ4krsu+hnwA3YQg4CaoDuabc9Gngh7UXzeZNhPQX8KsM4dslw24uAv6WV+Nm060YCa1sqcZNh9fVftH2wbwJ1jS9I//of5sT+Ezu7yf1vBn6XYbjDsEXqmXbZfTRf4pXAoenTLFPB0l6Ad2W4rGmJ08f9IHBZpnI0HQctl3gP7NytJO36+4GKtBy3pl33c+CDZqb9ZcCDaf+XYN/IJ2XK2Zbp29Lz3HR6tfDamA8clM9OGWNw8dnyYGPMs+kXiMho7DvrVyLSeHEJdi7WKP1vsHPyRRmGPxwYIiIr0y4rBWan/b8s7e8qoJuIlBlj6psOzP/8eCX2XfdH2MVtgIHYN6CyFnIOB3ZpkqUMu9je1BDge2PMmrTLPsM+zvUYY9aIyJHAb4DbRORV4HxjzAcZhpspVyaZxj2klftkYwiwxBjTkHbZZ8Cmaf83fT6aW9k5xL8vAMaYBhFZ0mRYLeVodvq28jwnMw1QRE4AzsO+weLnHphFlpwKygqiJdg58cBMRfI13d1qCbBFM8NabIzZsp1Zmo7nGOAg7OfIBHYO/D12EX459t19KPCRf/v00i0BXjLG7JPFeL8C+olIz7QX2rAMeWxIY54CnhKR7ti5//9i53rN7ZbW2u5qmcb9H//vNdiPC402aWVY6ZYCm4lISVqRh7FuerXFUmB04z9i3/E3w86NW9Pa9G3peYYm009EhmOn+U+B140xKRGZn3b7vAnE98TGmK+Ap4GZIrKRiJSIyBYiMrGFu90K/EZEdhJrhD9h5wKrRORCEekuIqUisp2IjM0yztfYz9GNemPfYL7DvpCvSsudAv4JVIhIDxHZBjgh7b6PA1uJyPEiUu7/jBWRbTNMg8+At4ArRKSLiEwADswUUEQGicgUEenpZ6vErtRpzD9URLpk+XjTNY57D+AA7IoesIuJh/iPcQRwcpP7NZ1m6d7Avglc4D/+Sf7jeqAd+R4EIiLyUxEpB87HPv7XWrtjFtO32efZ1/Qx9sQWezmAiJwEbNeOx9RhgSix7wTsio+F2HfAh7ErbDIyxjyEXfy5D7uC4lGgv1+sA4EdsWttv8UWvk+WOf4AXCoiK0XkN9i12J9h3+0XAnOa3P5sf9jLsIvJ92NfDBhjVgM/A47CzkWWATOwK04yOQb7mX4F8Dt/3JmUYF/AS/3bTgTO9K97HrvCcJmIfJvlY8bP9r0/zHuBM9IWz6/Frpn9GrjTvz5dBXCnP82OSL/CGFOLXaO8P/a5uAE4oZVF/4yMMR8CxwF/9Yd1IPYry9osB9HS9G3teb4NGOk/xkeNMQuBmdiVZV9jlxBebetjygXxP5CrHBGRGcAmxpgTXWdRxSFIc+JQEpFtRGR7f5F+HHZR8xHXuVTxCMqKrTDrjV2EHgJ8g13EesxpIlVUdHFaqZDTxWmlQk5LrFTIaYmVCjktsVIhpyVWKuS0xEqFnJZYqZDTEisVclpipUJOS6xUyGmJlQo5LbFSIaclVirktMRKhZyWWKmQ0xIrFXJaYqVCTkusVMhpiZUKOS2xUiGnJVYq5LTESoWcllipkNODx4eYF433AgZgT6nZ0//plfa7HHuq0CrsSc3S/64EliZikbr8J1e5pAePDzgvGh8ObANs6f9sgT0l5zCyP0lccxqwJ1BLNPlZDCxMxCLLmrmfChAtcYB40XhvYBww3v/ZBXvCa1eWYk8H+hb2LIFzErHIaod5VAZaYoe8aLwHsC/2tJ+7AiMJ9nqKFPbE4y8As4CXE7FIcyeFV3miJc4zLxofhD2vbuNZ6bu5TdQhK4Engf8DnkzEIknHeYqSljgPvGh8CPbk2AdjF5GDPLdtrzrgZeAh4P5ELLLKcZ6ioSXuJF40LsDPgDOAAyiubwLWAA8ANydikTddhyl0WuIc86LxjYH/Ak4Ffuw4ThDMA24B7tWVYp1DS5wjXjS+PRAFDgW6OI4TRGuA24BYIhb5ynWYQqIl7iAvGt8JuAyYAojjOGFQjZ0za5lzREvcTl40vh0wHbuWWbVdNXAztsy6UUkHaInbyN+CajpwDIW5ljnf1gI3AtMTscj3rsOEkZY4S140Xg6cB1wO9HAcpxAtBy4GbkvEIvqibAMtcRa8aHx34CZgO9dZisBc4IxELDLPdZCw0BK3wIvG+wMzgJPRlVb5lAL+DFyeiEWqXIcJOi1xM7xo/Gjgf3C7A0KxSwAnJmKRl10HCTItcRP+TgnXA1MdR1FWCvg9dsVXg+swQaQlTuN/bfQgsK3rLGoDLwDH6nfLG9KvSHxeNH4qdqWKFjiY9gLe9aLx/VwHCZqinxP7O+LfDBztOovKigH+BFys+zJbRV1iLxrfDLs/7CjXWVSbPQkckYhFKl0Hca1oS+zvsPAEsKnrLKrd3gEixb7ZZlF+Jvai8Z8Cs9ECh90YYI4XjRf1eoyiK7EXjR+LXRTbyHUWlRPDgde8aHyi6yCuFFWJvWj8QuBu7PGYVeHoCzzlReNHuA7iQtGU2IvGfw/E0M0nC1VX4D4vGj/MdZB8K4oVW140fhl2qx9V+OqAQxKxyOOug+RLwZfYX4SOuc6h8qoGODARizzjOkg+FHSJvWj8dOwuhKr4VAH7JWKR2a6DdLaCLbG/kuN+iuhzv9rAamDvRCwy13WQzlSQJfai8T2BZ9CjTir4DtglEYssch2ksxRcif1NKd9G9wNW6ywEdi3Us1IU1KKmF413Ax5BC6zWNxJ4wIvGC+r13qjQHtQtwE6uQ6hA2h97lNKCUzCL0140/mvgWtc5VKAZ7HfIj7oOkksFUWIvGt8LeJriOmmZap9VwNhELPKR6yC5EvoSe9H4AOA9YJDrLCo03sSu6Eq5DpILhfCZ+Hq0wKptxmJPflcQQj0n9jd2f8h1DhVKtcC4RCzyrusgHRXaEnvR+I+wi9H6dZJqr3exn4/rXAfpiDAvTt+AFlh1zA7Yc2uFWijnxF40fiTwgOscqiDUA7slYpE3XQdpr9CV2IvGNwI+QefCKnfewn4+DlcZfGFcnL4YLbDKrZ2BY12HaK82zYlFxAMeN8Y4OcWnF40PAz4EurkYvypoS4CtE7HIWtdB2ipsc+Kr0AKrzrEZcK7rEO3Rnjnxk8ArwG7Al8BBwHHAadj9dz8BjjfGVInIHcBaYBvsoUVPAk4EdgXeMMZMzXbcXjS+E3ZLGz3Qneosq4EtE7HI166DtEV75sRbAtcbY0YBK4FDgX8aY8YaY3YA3seelLtRP2Ay9l1uFnYnhVHAaBHZsQ3j/RNaYNW5egNXuA7RVu0p8WJjzHz/77cBD9hORGaLyALsCoL0cxvNMnZ2vwD42hizwBjTgN1Qw8tmhF40fgAwqR1ZlWqrU7xofCvXIdqiPSWuSfs7hd1z6A7gbGPMaOw7WbcMt29oct8Gst/r6NJ25FSqPUqB81yHaItcrdjqDXwlIuXkeFW9F41PAnbJ5TCVasUJ/ma9oZCrEl8GvIE9ON0HORpmowtzPDylWtMdOMt1iGwFeostLxofif3srFS+LQeGh+F746B/T3yO6wCqaP0IOMF1iGwEdk7sReP9gC+AHq6zqKL1EbBN0LepDvKceCpaYOXWVsA+rkO0JsglPs51AKWwWxgGWiAXp/0v2z90nUMp7GbDgxKxyGrXQZoT1Dnx0a4DKOXrDhzuOkRLtMRKte4o1wFaErjFaS8aH4PdJlupoKgHBidikW9dB8kkiHPiY1wHUKqJMuzeeoEUxNOeHJavEa1Z+BLJ1x8EEUp79WfgAedT2qMPyx+bQd2KLwBoqF5DSbeeDDnpr1R/sZAVT9+AlJYzcMpvKe83hIbqSpY/NoONj/g9IrqnZAGbAtzsOkQmgVqc9qLxEcDH+RiXaUjxxfUnMOTkGyjt0YfvX7gdKe9K3wnr77+x4vlbKenak767H803j1xJv4lTqU9+w9rFb9N/8imseP5WeozYhW7DRucjtnKnEugfxGNUB21xenLexmQMGIOpq8EYQ0NtFaW9BjS5iaHqg1foue2eAEhJGaa+FlNfg5SUUff9V6RWf6cFLg69gHGuQ2QStMXpvJVYSsvo/7MzWXr7WZSUd6Os3xD67/PL9W5T88V7lPbsS3n/TQHoM/5wvvvXdUh5FwZGzuf7F26j7x66TUoRmQy86jpEU0GbE++VrxGZVD2V859g8NS/sOlZd9FlY4/knPVP67Rm4Us/zIUBugz6MYNPmMkmR/+B+uQySnv1B2D5YzP4dtafSK35Pl/xlRs/dR0gk8CU2IvGRwEb52t8td98CkB5v8GICD222YOaL9//4XrTkKLqo9fpsc2eG9zXGEPytb/TZ/ejWfnqffSdcAw9R+3Fqrdn5Su+cmNXLxrv7jpEU4EpMfn8PAyU9hpA3bdLSFUlAahePI/yAZv9cH11Yj7lA4ZSttHADe675j/P0X2LnSnt1gtTVwNSAiL2b1XIugATXIdoKkifiSfmc2RlvQfQZ/ejWXbvhUhpGWUb/YgBkXWHHV7z/svrLUo3aqirpvI/zzHoiGkAbDT2YJY/chVSWsbAKRfkLb9yZgL2CDaBEZivmLxo/GNghOscSrXikUQscojrEOkCsTjtf874sescSmXBySmMWhKIEmPPEBGULEq1ZAsvGg/UqYSCUpyRrgMolaUSAvZ6DUqJR7V+E6UCI1CL1FpipdpOS5yBlliFybauA6RzXmIvGhdgmOscSrXBYNcB0jkvMTAAKHcdQqk2yNvmwdkIQok3cR1AqTYK1MnWglDiQa4DKNVG3bxofCPXIRoFocQDWr+JUoETmEXqIJS4n+sASrVDYBapg1Di/q4DKNUOOidO09N1AKXaITAn+wtCiZUKo8Dsix+EEgdjh2al2qbUdYBGQXg3aXAdoFCNlMSiQ0pnf+k6RyH61vSpgYjrGEAwSqxz4k7ykRk67ODSV5MDZdUY11kK0N+CckIIXZwuYPWUlU+suXbLKtNVz/Wce/WuAzTSEhe4NXTvPblmZt86U/qF6ywFRkucJuU6QKFbRv9BB9ReWddgZIXrLAUkMOdkCkKJ9YWVBx+aYZufUBf90hjWus5SICpdB2gUhBJ/7TpAsXilYfToi+pP+bcxuvSTA4FZ668lLjIPpCbvclPqwMCdFCyEtMRptMR5NqP+6D2fSu38ouscIbaWimRgzp4XhBJ/4zpAMTq97rxJ7zUMf8V1jpAKzFwYAlDiRCyyCqh2naMYTamdPn6Z6fem6xwhFKiv65yX2KeL1A6kKC2bXDNzZKXpttB1lpDROXEGn7oOUKyq6NZzYs21G9ea0s9dZwkRLXEGC1wHKGbf0WfgfrUzTIOR5a6zhISWOAMtsWOfmiHDj6q9dLkxrHGdJQQCteSoJVY/mGu2HXlu3ZnvGxOc7YIDaq7rAOmCUuL30B0hAuHRhgk7X1t/2BzXOQIsQUUyUF+LBqLEiVikEljsOoey/pI6ZMKjqd1e6sxxXPJcNZtdu5peV61a7/Kb3qpl9I2V7HhTJRNuX8PC5XYL0Vc/r2f7GysZ+7+VfLLCHkdiZbVh33vWYExe3/8DNReGgJTYp4vUAfLrurMnvtMw4uXOGv6BW5cx95QNj5F4zOhyFvyyF/PP6MUFu3fhvKfsJgQzX6/lH0d056rJ3bjxzVoApr1Uw8UTuiIinRUzkzfyObJsBKnE77gOoNZ3aG3FhCUNAzvlRTt+aBmDe2/48tuo67pCrqmFxn6Wl8LaeqiqM5SXwqIVDXy5uoGJXt4PThO4Egfh8DyNOnXxTbWdoaRkn9qrt3+j61kL+kjV6HyN9/q5tVwzp4baFDx/gj0y7EUTunLarGq6l8Pdv+jOb56uZtpeXfMVqVE9AZzZBGlOPAfd/DJwqunafWLNtUNrTFnevlY5a1wXFp3Tmxl7d2P6bLvovOMmpcw5pScvnNiTT79vYEjvEgxw5MNVHPfPtXxdmZfjLS6gIhm4/bEDU+JELFIDvO46h9rQSnr326f26i4pI3ndPPao7cp49IP1D6BhjGH6yzVctmdXrniphismdeW47cv5yxu1+YgUuEVpCFCJfc+6DqAy+9wMGnpo7RUrjWFV67duv4+/W3e8gvhH9WzZf/2X6J3v1hHZsox+3YWqOigR+1OVn4PlBPIjX9BK/C/XAVTz5psRW/+y7lefGEOHZ3sXPFPN0GtWU1UHQ69ZTcWL9pPUdXPrGHWD/Yrpmjm13Hlw9x/uU1VnuPPdOs4c2wWA88Z34dAH13LRc9X8cmynn6e+Goh39kjaQ/L8HVuLvGhcgGUE6GRVakOnlT7+6kVl9+0mQl6/23Hs/6hIHuQ6RCaBmhMnYhEDPOE6h2rZLakDdn8gtVenfYccUA+7DtCcQJXYd5/rAKp1F9WfOvG11MhAfkbsBLXALNchmhPEEj8HfOU6hGrdMXWX7PlpwybF8I3Cc1QkV7oO0ZzAlTgRizQAD7jOobIhsl/tjDErTO/5rpN0ssAuSkMAS+y7x3UAlZ1ayrtOrLlm82pT/rHrLJ2kHnjMdYiWBLLEiVjkHeB91zlUdlbTs8/kmpm96k3JUtdZOsGLVCS/cx2iJYEsse9e1wFU9pYycPBBtdOqGgyB/ezYTre5DtCaIJf4LgJ05jnVuvfM5iNOrvvtZ8ZQ4zpLjnwF/MN1iNYEtsSJWGQJ8JDrHKptXmj4yQ6X10+dZwx52SOhk91ERTIwZz9sTmBL7LvadQDVdnenfjb+9tT+s13n6KBa4GbXIbIR6BInYpF52O+NVchMqz9+4gupHcK8Mcj9VCRDcVKDQJfYp3PjkDqp7oI9P2wYGsYzMBpgRms3EpGeIhIXkXdF5D8icqSIJERkhojM9X9G+Lc9UETeEJF5IvKsiAzyL68QkTtF5Gn/voeIyB9FZIGI/EtEWt2zI/AlTsQiTwH/dp1DtYdIpPaqcctNn7ddJ2mjx6hIZvMV537AUmPMDsaY7Vi3F94qY8w44Drgz/5lrwDjjTE/wW7MdEHacLYAIsBB2G0kXjDGjAbW+pe3KPAl9v3JdQDVPvWUlU+quWarKtP1A9dZ2iCW5e0WAHv7c949jDFJ//L7037v6v89FHhKRBYAvwVGpQ3nSWNMnT+8Uta9GSwAvNZChKXE9wMfug6h2mcN3XvvVTOzf50pXeI6SxaeoiKZ1RE8jDEfATthy/YHEbm88ar0m/m//wpc589hTwe6pd2mxh9eA1Bn1u0f3EAWx8ELRYkTsUg99t1LhdTX9N84UntVfYORFa6ztKAeODfbG4vIEKDKGHMPdmlxjH/VkWm/G3cQ6cO6czid2PGo64SixACJWGQWuqY61D4ym21+fN1FS42hynWWZlyX5WfhRqOBuSIyH7gEmO5f3lVE3gB+xbo3hQrgIRGZDXybo7xAwI7s0RovGt8emEeI3nzUhg4vfXHuH8tu2UmEUtdZ0iwHtqQimWz1li0QkQSwszEmp0VtSajKkIhF/g3c7jqH6piHUpPG3ZCa8prrHE1c0tECuxKqEvsuBVa7DqE65ur6o/Z4IjXuRdc5fPPI0Y4Oxhgvn3NhCGGJE7HI16z77KFC7My6X09a0OAFYfPMc6hIhnZb79CV2HcNELYNCFQGB9dO2/Ur0/9NhxEeoCL5isPxd1goS+x/5XQSdPz4x8qtFKVlk2tmjlptur/nYPTfAec7GG9OhbLEAIlYZAG6WF0Q1tK1x6SaawbVmrJEnkc9lYpk6I9GEtoS+67CnohNhdx39Bm4b22sJGVkeZ5G+T9UJB/P07g6VahLnIhFUsBxQKXrLKrjFpshw46qvexbYzr9+XyH9XdACLVQlxggEYssAv7bdQ6VG2+abbY9p+7sD4zptEMzVQJHUZEsmPUpoS8xQCIWuQO40XUOlRuzGnbbeWb94Z31MeksKpIFdXjdgiix71cE9NSTqu2uS/1iwj9TE3L9fN5NRfKuHA/TuYIpcSIWqQMOAz5znUXlxnl1Z058q2GrXJ247X3gzBwNK1BCtQNENrxofAfgVaCn6yyq44SGhpe6nDt3WMny8R0YzJfAblQkP89VriApmDlxo0Qs8i4wlfV3zFYhZSgp2af26h2Tpkd7D9H0PbBvoRYYCrDEAIlY5GHsjhKqANTQpdueNX8eVmPKF7XxrmuBKVQkXWwNljcFWWKARCxyFXZjEFUAkvTqu3ft1d1SRpZleZcUcGTYt4vORsGWGCARi1wCXOs6h8qNJWbjTQ+pvWKVMazK4uanUZEM7InBc6mgSwyQiEXOA25wnUPlxrtmxFan1527yJgWd365hIpk0Rw8ouBL7DsbPSJIwXi6YexPptcf95YxGVdezqAiWVQfo4qixIlYxACnAne7zqJy47bUz3e7LzW56XfIF1KRjDoJ5FDBfU/cEi8aF+yBwQtm4/did2/5lS/tXvreBOAMKpK3us7jQlGVuJEXjZ+BPcVGkI62qNrFVN1XfuURu017Pe46iStFWWIALxr/OfB3oJfrLKrdvgWmJGKR11u9ZQEr2hIDeNH4GOBxYLDrLKrNPgb293dFLWpFsWKrOYlY5B1gPDDfdRbVJv8AxmqBraKeEzfyovFu2FNQnu46i2pRLXB+Iha5znWQINESp/Gi8WOAm4DerrOoDXwKHJGIRfRQxU0U9eJ0U4lY5D7gJ+jB94LmYWCMFjgznRNn4EXjZcBl2DPd6ddQ7iSBCxKxyC2ugwSZlrgF/gEGbmTd2d5V/vwd+HUiFsl2r6WipSVuhb+V18nYLb0GOI5TDBYDZyZikX+5DhIWWuIsedH4AGyRTwbEcZxCVA/MBK5IxCJrXYcJEy1xG3nR+HjsJps7uc5SQB4FLk3EIgV9BI7OoiVuJy8aPwi4HBjjOkuIPQFcrmudO0ZL3EFeNH4g8Dt0ztwWzwKXJWIR/SovB7TEOeJF4wdgy7yz6ywB9hwwLRGL6EH+c0hLnGNeNL4n9gAEhwHdHMcJgjXAXcB1iVhkoeswhUhL3Em8aLwfcDy20Ns5juPCPOBW4N5ELJJ0HaaQaYnzwF+jfSpwKNDHcZzOtBi7pvkefw8xlQda4jzyovFyYDLwC2AKhbEf83xscR/1z76h8kxL7Ii/JdiOwH7A/sAuQBenobKzEruDyFPY4ibcxlFa4oDwovGu2FKPA8b6v7fC7dZhDcBC4HX/Zw7wgX/0UBUQWuIA86LxPthCbw94wPC0n745HNUK7P66i/zfnwKfAO8kYpFszragHNISh5Rf8OHAZtiD/fUAumf4bbAnFmv8WYVdJF4JLAcWJ2KRlfnOr3JHS6xUyOmRPZQKOS2xUiGnJVYq5LTESoWcllipkNMSKxVyWmKlQk5LrFTIaYmVCjktsVIhpyVWKuS0xEqFnJZYqZDTEisVclpipUJOS6xUyGmJlQo5LbFSIaclVirktMRKhZyWWKmQ0xIrFXJaYqVCTkusVMhpiZUKOS2xUiGnJVYq5LTESoWcllipkNMSKxVyWmKlQu7/AbnBmmHSc0bRAAAAAElFTkSuQmCC\n"
},
"metadata": {}
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "we need to know a little bit more about text data\n### Text Analytics:-\nlet's look at frequencies of words in spam and non-spam(ham) messages and plot that out."
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "count1 = Counter(\" \".join(data[data['v1']=='ham'][\"v2\"]).split()).most_common(20)\ndf1 = pd.DataFrame.from_dict(count1)\ndf1 = df1.rename(columns = {0: \"non-spam words\", 1 : \"count\"})\n\ncount2 = Counter(\" \".join(data[data['v1']=='spam'][\"v2\"]).split()).most_common(20)\ndf2 = pd.DataFrame.from_dict(count2)\ndf2 = df2.rename(columns={0: \"spam words\", 1 : \"count_\"})",
"execution_count": 5,
"outputs": []
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "#Plots:-\ndf1.plot.bar(legend = False)\ny_pos = np.arange(len(df1[\"non-spam words\"]))\nplt.xticks(y_pos, df1[\"non-spam words\"])\nplt.title('More frequent words in non-spam messages')\nplt.xlabel('words')\nplt.ylabel('number')\nplt.show()",
"execution_count": 6,
"outputs": [
{
"output_type": "display_data",
"data": {
"text/plain": "<Figure size 432x288 with 1 Axes>",
"image/png": "\n"
},
"metadata": {}
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "and for `spam words in messages`"
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "df2.plot.bar(legend = False, color = 'orange')\ny_pos = np.arange(len(df2[\"spam words\"]))\nplt.xticks(y_pos, df2[\"spam words\"])\nplt.title('More frequent words in spam messages')\nplt.xlabel('words')\nplt.ylabel('number')\nplt.show()",
"execution_count": 7,
"outputs": [
{
"output_type": "display_data",
"data": {
"text/plain": "<Figure size 432x288 with 1 Axes>",
"image/png": "\n"
},
"metadata": {}
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "By visualizing and observation we can say that most of frequest words in both classes are [`stop words`](https://en.wikipedia.org/wiki/Stop_words) such as 'to', 'a', 'or'.\nFor better accuracy in model it's better to remove stop words. \n\n### Feature Engineering:-\nThe features in our data are important to the [predictive models](https://en.wikipedia.org/wiki/Predictive_modelling) we use and will influence the results we are going to achieve. The quality and quantity of the features will have great influence on whether the model is good or not.\nso, first or most remove stop words."
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "f = feature_extraction.text.CountVectorizer(stop_words = 'english')\nX = f.fit_transform(data[\"v2\"])\nnp.shape(X)",
"execution_count": 8,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 8,
"data": {
"text/plain": "(5572, 8409)"
},
"metadata": {}
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "So, finally our goal is to detect spam words.\n**Predictive Modelling**:-\n\nFirst of all transform the categorical variables(spam/non-spam) into binary variable(1/0) by using *[label encoding](https://medium.com/@contactsunny/label-encoder-vs-one-hot-encoder-in-machine-learning-3fc273365621)*.\n\nNow, split the data into train and test set."
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "data[\"v1\"] = data[\"v1\"].map({'spam':1, 'ham':0})",
"execution_count": 9,
"outputs": []
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Now, split the data into train and test set."
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "X_train, X_test, y_train, y_test = model_selection.train_test_split(X, data['v1'], test_size=0.33, random_state=42)\nprint([np.shape(X_train), np.shape(X_test)])",
"execution_count": 10,
"outputs": [
{
"output_type": "stream",
"text": "[(3733, 8409), (1839, 8409)]\n",
"name": "stdout"
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "So, now use sci-kit learn's in-built `SVC`(support vector classifier) with `gaussian kernel` for predictive modelling.\nWe train the model by tuning `regularization` parameter C, and evaluate the accuracy, recall and precision of the model with the test set."
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "# make a list of parameter's to tune for training\nlist_C = np.arange(500, 2000, 100) #100000\n# zeros initialization\nscore_train = np.zeros(len(list_C))\nscore_test = np.zeros(len(list_C))\nrecall_test = np.zeros(len(list_C))\nprecision_test= np.zeros(len(list_C))\n\ncount = 0\nfor C in list_C:\n # Create a classifier: a support vector classifier\n clf = svm.SVC(C = C)#, kernel=’rbf’, degree=3, gamma=’auto_deprecated’)\n \n # learn the texts\n clf.fit(X_train, y_train)\n score_train[count] = clf.score(X_train, y_train)\n score_test[count]= clf.score(X_test, y_test)\n recall_test[count] = metrics.recall_score(y_test, clf.predict(X_test))\n precision_test[count] = metrics.precision_score(y_test, clf.predict(X_test))\n count = count + 1 ",
"execution_count": 11,
"outputs": []
},
{
"metadata": {},
"cell_type": "markdown",
"source": "**Acuraccy metrices**\n![confusion_matrix_01](../images/confusion_matrix_1.png)\nLet's look at accuracy metrics with parameter C."
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "matrix = np.matrix(np.c_[list_C, score_train, score_test, recall_test, precision_test])\nmodels = pd.DataFrame(data = matrix, columns = \n ['C', 'Train Accuracy', 'Test Accuracy', 'Test Recall', 'Test Precision'])\nmodels.head(10)",
"execution_count": 12,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 12,
"data": {
"text/plain": " C Train Accuracy Test Accuracy Test Recall Test Precision\n0 500.0 0.994910 0.982599 0.873016 1.0\n1 600.0 0.995714 0.982599 0.873016 1.0\n2 700.0 0.996785 0.982599 0.873016 1.0\n3 800.0 0.997053 0.982599 0.873016 1.0\n4 900.0 0.997589 0.983143 0.876984 1.0\n5 1000.0 0.998125 0.983143 0.876984 1.0\n6 1100.0 0.998928 0.983143 0.876984 1.0\n7 1200.0 0.999732 0.983143 0.876984 1.0\n8 1300.0 1.000000 0.983143 0.876984 1.0\n9 1400.0 1.000000 0.983143 0.876984 1.0",
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>C</th>\n <th>Train Accuracy</th>\n <th>Test Accuracy</th>\n <th>Test Recall</th>\n <th>Test Precision</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>500.0</td>\n <td>0.994910</td>\n <td>0.982599</td>\n <td>0.873016</td>\n <td>1.0</td>\n </tr>\n <tr>\n <th>1</th>\n <td>600.0</td>\n <td>0.995714</td>\n <td>0.982599</td>\n <td>0.873016</td>\n <td>1.0</td>\n </tr>\n <tr>\n <th>2</th>\n <td>700.0</td>\n <td>0.996785</td>\n <td>0.982599</td>\n <td>0.873016</td>\n <td>1.0</td>\n </tr>\n <tr>\n <th>3</th>\n <td>800.0</td>\n <td>0.997053</td>\n <td>0.982599</td>\n <td>0.873016</td>\n <td>1.0</td>\n </tr>\n <tr>\n <th>4</th>\n <td>900.0</td>\n <td>0.997589</td>\n <td>0.983143</td>\n <td>0.876984</td>\n <td>1.0</td>\n </tr>\n <tr>\n <th>5</th>\n <td>1000.0</td>\n <td>0.998125</td>\n <td>0.983143</td>\n <td>0.876984</td>\n <td>1.0</td>\n </tr>\n <tr>\n <th>6</th>\n <td>1100.0</td>\n <td>0.998928</td>\n <td>0.983143</td>\n <td>0.876984</td>\n <td>1.0</td>\n </tr>\n <tr>\n <th>7</th>\n <td>1200.0</td>\n <td>0.999732</td>\n <td>0.983143</td>\n <td>0.876984</td>\n <td>1.0</td>\n </tr>\n <tr>\n <th>8</th>\n <td>1300.0</td>\n <td>1.000000</td>\n <td>0.983143</td>\n <td>0.876984</td>\n <td>1.0</td>\n </tr>\n <tr>\n <th>9</th>\n <td>1400.0</td>\n <td>1.000000</td>\n <td>0.983143</td>\n <td>0.876984</td>\n <td>1.0</td>\n </tr>\n </tbody>\n</table>\n</div>"
},
"metadata": {}
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Check the model with the most test precision."
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "best_index = models['Test Precision'].idxmax()\nmodels.iloc[best_index, :]",
"execution_count": 13,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 13,
"data": {
"text/plain": "C 500.000000\nTrain Accuracy 0.994910\nTest Accuracy 0.982599\nTest Recall 0.873016\nTest Precision 1.000000\nName: 0, dtype: float64"
},
"metadata": {}
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "This model doesn't produce any `false-positive`, which is expected.\nLet's check if there is more than one model with 100% precision."
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "models[models['Test Precision']==1].head(5)",
"execution_count": 14,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 14,
"data": {
"text/plain": " C Train Accuracy Test Accuracy Test Recall Test Precision\n0 500.0 0.994910 0.982599 0.873016 1.0\n1 600.0 0.995714 0.982599 0.873016 1.0\n2 700.0 0.996785 0.982599 0.873016 1.0\n3 800.0 0.997053 0.982599 0.873016 1.0\n4 900.0 0.997589 0.983143 0.876984 1.0",
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>C</th>\n <th>Train Accuracy</th>\n <th>Test Accuracy</th>\n <th>Test Recall</th>\n <th>Test Precision</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>500.0</td>\n <td>0.994910</td>\n <td>0.982599</td>\n <td>0.873016</td>\n <td>1.0</td>\n </tr>\n <tr>\n <th>1</th>\n <td>600.0</td>\n <td>0.995714</td>\n <td>0.982599</td>\n <td>0.873016</td>\n <td>1.0</td>\n </tr>\n <tr>\n <th>2</th>\n <td>700.0</td>\n <td>0.996785</td>\n <td>0.982599</td>\n <td>0.873016</td>\n <td>1.0</td>\n </tr>\n <tr>\n <th>3</th>\n <td>800.0</td>\n <td>0.997053</td>\n <td>0.982599</td>\n <td>0.873016</td>\n <td>1.0</td>\n </tr>\n <tr>\n <th>4</th>\n <td>900.0</td>\n <td>0.997589</td>\n <td>0.983143</td>\n <td>0.876984</td>\n <td>1.0</td>\n </tr>\n </tbody>\n</table>\n</div>"
},
"metadata": {}
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Among these models with the highest possible precision, we are going to select which has more test accuracy."
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "best_index = models[models['Test Precision']==1]['Test Accuracy'].idxmax()\n\n# check with the best parameter(C) value \nclf = svm.SVC(C=list_C[best_index])\nclf.fit(X_train, y_train)\nmodels.iloc[best_index, :]",
"execution_count": 15,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 15,
"data": {
"text/plain": "C 900.000000\nTrain Accuracy 0.997589\nTest Accuracy 0.983143\nTest Recall 0.876984\nTest Precision 1.000000\nName: 4, dtype: float64"
},
"metadata": {}
}
]
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "confusion_matrix_test = metrics.confusion_matrix(y_test, clf.predict(X_test))\npd.DataFrame(data = confusion_matrix_test, columns = ['Predicted, non-spam(0)', 'Predicted, spam(1)'], index = ['Actual, non-spam(0)', 'Actual, spam(1)'])",
"execution_count": 16,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 16,
"data": {
"text/plain": " Predicted, non-spam(0) Predicted, spam(1)\nActual, non-spam(0) 1587 0\nActual, spam(1) 31 221",
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>Predicted, non-spam(0)</th>\n <th>Predicted, spam(1)</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>Actual, non-spam(0)</th>\n <td>1587</td>\n <td>0</td>\n </tr>\n <tr>\n <th>Actual, spam(1)</th>\n <td>31</td>\n <td>221</td>\n </tr>\n </tbody>\n</table>\n</div>"
},
"metadata": {}
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "We misclassify 31 spam messages as non-spam messages whereas we don't misclassify any non-spam message.\n\n### Results:-\nWe got 98.3143% accuracy, which is quite well with SVM classifier.\n\nIt classifies every non-spam message correctly (Model precision)\n\nIt classifies the 87.7% of spam messages correctly (Model recall)"
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "",
"execution_count": null,
"outputs": []
}
],
"metadata": {
"kernelspec": {
"name": "python3",
"display_name": "Python 3",
"language": "python"
},
"toc": {
"nav_menu": {},
"number_sections": true,
"sideBar": true,
"skip_h1_title": false,
"base_numbering": 1,
"title_cell": "Table of Contents",
"title_sidebar": "Contents",
"toc_cell": false,
"toc_position": {},
"toc_section_display": true,
"toc_window_display": false
},
"language_info": {
"name": "python",
"version": "3.6.8",
"mimetype": "text/x-python",
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"pygments_lexer": "ipython3",
"nbconvert_exporter": "python",
"file_extension": ".py"
},
"gist": {
"id": "",
"data": {
"description": "This is the notebook for SMS Spam Classification Using SVMs.",
"public": true
}
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment