Skip to content

Instantly share code, notes, and snippets.

@vijayanandrp
Last active December 15, 2017 17:38
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save vijayanandrp/9e5859d1608fabe9c70f4cd803966c66 to your computer and use it in GitHub Desktop.
Save vijayanandrp/9e5859d1608fabe9c70f4cd803966c66 to your computer and use it in GitHub Desktop.

Text SMS - Spam Classification Model

The base requirement of this project is to analyse the SMS dataset and come up with a machine learning models to predict or claissify the sms text. For getting my latest code and datasets please do visit my github.com account.

The following are the list of actions that we gonna do to solve this problem approach
  1. Reading a text-based dataset into pandas
  2. Vectorizing our dataset
  3. Building and evaluating a model
  4. Comparing models
  5. Examining a model for further insight

1. Reading a text-based dataset into pandas

If you have very good well-defined datasets already, then this would be the very first section in data science job. The main job of this section is to loading/reading the datasets into a pandas object for data analysis without any errors. You can download the dataset here. This dataset is really cool to analyse. Please do read the output carefully. It helps you a lot in future.

import pandas as pd
import os

spam_file = 'data/sms.csv'

if not os.path.isfile(spam_file):
    print(spam_file, ' is missing.')
    exit()

# 1. Loading dataset
sms_df = pd.read_csv(spam_file, sep='\t')
# getting the shape of dataset 
# means getting number of rows * columns values
sms_df.shape
(5574, 2)
# getting the column names of dataset
sms_df.columns
Index(['label', 'message'], dtype='object')
# getting the random sample values from the dataset
sms_df.sample(5)
label message
43 ham WHO ARE YOU SEEING?
1536 spam You have won a Nokia 7250i. This is what you g...
5231 ham It means u could not keep ur words.
655 ham Did u got that persons story
3461 ham I am back. Bit long cos of accident on a30. Ha...
# reading the top / head values from the dataset
sms_df.head(5)
label message
0 ham Go until jurong point, crazy.. Available only ...
1 ham Ok lar... Joking wif u oni...
2 spam Free entry in 2 a wkly comp to win FA Cup fina...
3 ham U dun say so early hor... U c already then say...
4 ham Nah I don't think he goes to usf, he lives aro...
# gettting the last / tail values from the dataset
sms_df.tail(5)
label message
5569 spam This is the 2nd time we have tried 2 contact u...
5570 ham Will ü b going to esplanade fr home?
5571 ham Pity, * was in mood for that. So...any other s...
5572 ham The guy did some bitching but I acted like i'd...
5573 ham Rofl. Its true to its name
# Let us see what the unique values in the label column
sms_df.label.unique()
array(['ham', 'spam'], dtype=object)
# Let us see how many spam values and ham values are in the dataset.
sms_df.label.value_counts()
ham     4827
spam     747
Name: label, dtype: int64
# converting label to a numerical value 
sms_df['label_num'] = sms_df['label'].map({'ham': 0, 'spam': 1})
# 2. Feature matrix (X), response vector (y) and train_test_split
X = sms_df['message']
y = sms_df['label_num']
# Let's check the shape of X and y
X.shape
(5574,)
y.shape
(5574,)
# split X and y into train and test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=5)
for df in [X_train, X_test, y_train, y_test]:
    print(df.shape)
(4180,)
(1394,)
(4180,)
(1394,)

2. Vectorizing our dataset

from sklearn.feature_extraction.text import CountVectorizer
# instantiate the vectorizer
vector = CountVectorizer()
# learn training data vocabulary, then use it to create a document-term matrix
vector.fit(X_train)
X_train_dtm = vector.transform(X_train)
# equivalent step for previous cell
X_train_dtm = vector.fit_transform(X_train)
# examine the fitted vocabulary 
# getting last 10 features
vector.get_feature_names()[:-11:-1]
['〨ud',
 'èn',
 'zouk',
 'zogtorius',
 'zoe',
 'zeros',
 'zed',
 'zealand',
 'zac',
 'yupz']
# total features/tokens/columns in the matrix
len(vector.get_feature_names())
7465
# examine the document matrix
X_train_dtm
<4180x7465 sparse matrix of type '<class 'numpy.int64'>'
	with 54983 stored elements in Compressed Sparse Row format>
# transform testing dataset (using fitted vocabulary) in to a document-term matrix
X_test_dtm = vector.transform(X_test)
# examine the document matrix 
X_test_dtm
<1394x7465 sparse matrix of type '<class 'numpy.int64'>'
	with 17831 stored elements in Compressed Sparse Row format>

3. Building and evaluating a model

We will use multinomial Naive Bayes:

The multinomial Naive Bayes classifier is suitable for classification with discrete features (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work.

Meaning of discrete - : separate, distinct, individual, detached, unattached, disconnected, discontinuous, disjunct, disjoined

# import and instantiate a Multinomial Naive Bayes model
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
# train the  model using X_train_dtm and target values y_train (timing it with an IPython "magic command")
%time nb.fit(X_train_dtm, y_train)
CPU times: user 12 ms, sys: 0 ns, total: 12 ms
Wall time: 14 ms





MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
# make the class predictions for X_test_dtm
y_predict = nb.predict(X_test_dtm)
# calculate accuracy of class predictions
from sklearn import metrics
metrics.accuracy_score(y_test, y_predict)
0.98995695839311337
# print the confusion matrix
metrics.confusion_matrix(y_test, y_predict)
array([[1193,    2],
       [  12,  187]])
# FALSE POSITIVES (ham incorrectly classified as spam)
X_test[(y_predict==1) & (y_test==0)]
1290    Hey...Great deal...Farm tour 9am to 5pm $95/pa...
5046    We have sent JD for Customer Service cum Accou...
Name: message, dtype: object
# FALSE NEGATIVES (spam incorrectly classified as ham)
X_test[(y_predict==0) & (y_test==1)]
2663    Hello darling how are you today? I would love ...
1940    More people are dogging in your area now. Call...
5429    Santa Calling! Would your little ones like a c...
2402    Babe: U want me dont u baby! Im nasty and have...
869     Hello. We need some posh birds and chaps to us...
3064    Hi babe its Jordan, how r u? Im home from abro...
3530    Xmas & New Years Eve tickets are now on sale f...
1663    Hi if ur lookin 4 saucy daytime fun wiv busty ...
2430    Guess who am I?This is the first time I create...
1469    Hi its LUCY Hubby at meetins all day Fri & I w...
4676    Hi babe its Chloe, how r u? I was smashed on s...
1458    CLAIRE here am havin borin time & am now alone...
Name: message, dtype: object
X_test[1458]
'CLAIRE here am havin borin time & am now alone U wanna cum over 2nite? Chat now 09099725823 hope 2 C U Luv CLAIRE xx Calls£1/minmoremobsEMSPOBox45PO139WA'
# calculate predicted probabilities for X_test_dtm
y_pred_prob =  nb.predict_proba(X_test_dtm)[:, 1]
y_pred_prob
array([  1.30006229e-02,   9.95618101e-04,   6.15130955e-04, ...,
         9.77996362e-01,   1.41464847e-04,   1.32646379e-01])
# Calculate AUC
metrics.roc_auc_score(y_test, y_predict)
0.96901242614747374

4. Comparing models

We will compare multinomial Naive Bayes with logistic regression:

Logistic regression, despite its name, is a linear model for classification rather than regression. Logistic regression is also known in the literature as logit regression, maximum-entropy classification (MaxEnt) or the log-linear classifier. In this model, the probabilities describing the possible outcomes of a single trial are modeled using a logistic function.

# import and instantiate the logistic regression model
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
# train and model using X_train_dtm (well calibarated)
%time logreg.fit(X_train_dtm, y_train)
CPU times: user 48 ms, sys: 0 ns, total: 48 ms
Wall time: 47.5 ms





LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
# to make the class predictions of X_test_dtm
y_predict = logreg.predict(X_test_dtm)
# to calculate the accuracy
metrics.accuracy_score(y_test, y_predict)
0.98206599713055953
# to calculate the predicted probabilities for X_test_dtm (well calibarated)
y_pred_prob = logreg.predict_proba(X_test_dtm)[:, 1]
# to calculate AUC
metrics.roc_auc_score(y_test, y_pred_prob)
0.98607682765290894

5. Examining a model for further insight

We will examine the our trained Naive Nayes Model to calculate the approximate "spamminess" of each token

# store the vocabulary/tokens/feature_names/column values of X_train
X_train_tokens = vector.get_feature_names()
# total tokens or feature extracted from CountVectorizer 
len(X_train_tokens)
7465
# first 20 tokens 
print(X_train_tokens[:20])
['00', '000', '000pes', '008704050406', '0089', '0121', '01223585236', '01223585334', '0125698789', '02', '0207', '02073162414', '02085076972', '021', '03', '04', '0430', '05', '050703', '0578']
# last 20 tokens
print(X_train_tokens[:-21:-1])
['〨ud', 'èn', 'zouk', 'zogtorius', 'zoe', 'zeros', 'zed', 'zealand', 'zac', 'yupz', 'yup', 'yuou', 'yuo', 'yunny', 'yun', 'yummy', 'yummmm', 'ything', 'ystrday', 'yrs']
# Naive Bayes counts the number of times each token appears in each class
nb.feature_count_
array([[  0.,   0.,   1., ...,   0.,   1.,   1.],
       [  6.,  21.,   0., ...,   1.,   0.,   0.]])
# row represent  classes (rows), columns represents tokens
nb.feature_count_.shape
(2, 7465)
# number of times each token appears across all HAM messages 
ham_token_count = nb.feature_count_[0, :]
ham_token_count 
array([ 0.,  0.,  1., ...,  0.,  1.,  1.])
# number of times each token appears across all SPAM messages 
spam_token_count = nb.feature_count_[1, :]
spam_token_count 
array([  6.,  21.,   0., ...,   1.,   0.,   0.])
# create a DataFrame of tokens with their separate ham and spam counts 
import pandas as pd
tokens = pd.DataFrame({'token': X_train_tokens, 'ham': ham_token_count, 'spam': spam_token_count}).set_index('token')
tokens.head()
ham spam
token
00 0 6
000 0 21
000pes 1 0
008704050406 0 1
0089 0 1
# Examine random Dataframe rows
tokens.sample(5, random_state=6)
ham spam
token
872 0 2
tantrum 1 0
little 23 1
known 1 0
massive 2 0
# Number of samples encountered for each class during fitting. 
# This value is weighted by the sample weight when provided.
# Naive Bayes counts the number of observations in each class
nb.class_count_
array([ 3632.,   548.])

Before we can calculate the "spamminess" of each token, we need to avoid dividing by zero and account for the class imbalance.

# add 1 to ham and spam counts to avoid dividing by 0
tokens.ham += 1
tokens.spam += 1
tokens.sample(5, random_state=6)
ham spam
token
872 1 3
tantrum 2 1
little 24 2
known 2 1
massive 3 1
# convert spam and ham tokens into frequencies
tokens['ham_freq'] = tokens.ham / nb.class_count_[0]
tokens['spam_freq'] = tokens.spam / nb.class_count_[1]
tokens.sample(5, random_state=6)
ham spam ham_freq spam_freq
token
872 1 3 0.000275 0.005474
tantrum 2 1 0.000551 0.001825
little 24 2 0.006608 0.003650
known 2 1 0.000551 0.001825
massive 3 1 0.000826 0.001825
tokens['spam_ratio'] = tokens.spam_freq / tokens.ham_freq
tokens.sample(5, random_state=6)
ham spam ham_freq spam_freq spam_ratio
token
872 1 3 0.000275 0.005474 19.883212
tantrum 2 1 0.000551 0.001825 3.313869
little 24 2 0.006608 0.003650 0.552311
known 2 1 0.000551 0.001825 3.313869
massive 3 1 0.000826 0.001825 2.209246
# examine the data values by spam ratio
tokens.sort_values('spam_ratio', ascending=False)[:20]
ham spam ham_freq spam_freq spam_ratio
token
claim 1 82 0.000275 0.149635 543.474453
prize 1 71 0.000275 0.129562 470.569343
uk 1 61 0.000275 0.111314 404.291971
150p 1 53 0.000275 0.096715 351.270073
tone 1 45 0.000275 0.082117 298.248175
16 1 40 0.000275 0.072993 265.109489
18 1 37 0.000275 0.067518 245.226277
guaranteed 1 37 0.000275 0.067518 245.226277
1000 1 34 0.000275 0.062044 225.343066
500 1 32 0.000275 0.058394 212.087591
100 1 29 0.000275 0.052920 192.204380
cs 1 27 0.000275 0.049270 178.948905
ringtone 1 26 0.000275 0.047445 172.321168
10p 1 24 0.000275 0.043796 159.065693
awarded 1 24 0.000275 0.043796 159.065693
www 3 69 0.000826 0.125912 152.437956
000 1 22 0.000275 0.040146 145.810219
5000 1 21 0.000275 0.038321 139.182482
weekly 1 21 0.000275 0.038321 139.182482
mob 1 21 0.000275 0.038321 139.182482
# examine the data values by spam ratio
tokens.sort_values('spam_ratio', ascending=True)[:20]
ham spam ham_freq spam_freq spam_ratio
token
gt 240 1 0.066079 0.001825 0.027616
lt 237 1 0.065253 0.001825 0.027965
he 176 1 0.048458 0.001825 0.037658
lor 123 1 0.033866 0.001825 0.053884
she 118 1 0.032489 0.001825 0.056167
later 116 1 0.031938 0.001825 0.057136
da 111 1 0.030562 0.001825 0.059709
ask 67 1 0.018447 0.001825 0.098921
but 332 5 0.091410 0.009124 0.099815
amp 66 1 0.018172 0.001825 0.100420
said 65 1 0.017896 0.001825 0.101965
doing 65 1 0.017896 0.001825 0.101965
home 129 2 0.035518 0.003650 0.102756
really 63 1 0.017346 0.001825 0.105202
morning 60 1 0.016520 0.001825 0.110462
come 175 3 0.048183 0.005474 0.113618
lol 57 1 0.015694 0.001825 0.116276
its 170 3 0.046806 0.005474 0.116960
anything 55 1 0.015143 0.001825 0.120504
cos 55 1 0.015143 0.001825 0.120504
test_value = input('Enter any single text words - ').lower()
while test_value.strip().lower() not in ['q', 'end', 'quit']:
    if test_value:
        try:
            print('Spam Ratio of {} is {}'.format(test_value, tokens.loc[test_value, 'spam_ratio']))
        except:
            print('Try again! The word {} is not found in the training dictionary.'.format(test_value)) 
    test_value = input('Enter any single text words - ')
    test_value = test_value.lower().strip()
Enter any single text words - money
Spam Ratio of money is 0.6762997169670787
Enter any single text words - vijay
Spam Ratio of vijay is 1.3255474452554743
Enter any single text words - anand
Spam Ratio of anand is 3.313868613138686
Enter any single text words - cool
Spam Ratio of cool is 0.7101147028154328
Enter any single text words - honey
Spam Ratio of honey is 1.1046228710462287
Enter any single text words - click
Spam Ratio of click is 13.255474452554745
Enter any single text words - won
Spam Ratio of won is 36.15129396151294
Enter any single text words - q

In my repo I have grid search for the model. Please visit my github - SMS Spam predictor for more infos.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment