vijayanandrp/spam_predict_1.md

## spam_predict_1.md

      
    Raw
  

              spam_predict_1.md
            
          
    Text SMS - Spam Classification Model

The base requirement of this project is to analyse the SMS dataset and come up with a machine learning models to predict or claissify the sms text. For getting my latest code and datasets please do visit my github.com account.
The following are the list of actions that we gonna do to solve this problem approach


Reading a text-based dataset into pandas
Vectorizing our dataset
Building and evaluating a model
Comparing models
Examining a model for further insight

1. Reading a text-based dataset into pandas

If you have very good well-defined datasets already, then this would be the very first section in data science job. The main job of this section is to loading/reading the datasets into a pandas object for data analysis without any errors. You can download the dataset here. This dataset is really cool to analyse. Please do read the output carefully. It helps you a lot in future.
import pandas as pd
import os

spam_file = 'data/sms.csv'

if not os.path.isfile(spam_file):
    print(spam_file, ' is missing.')
    exit()

# 1. Loading dataset
sms_df = pd.read_csv(spam_file, sep='\t')
# getting the shape of dataset 
# means getting number of rows * columns values
sms_df.shape
(5574, 2)

# getting the column names of dataset
sms_df.columns
Index(['label', 'message'], dtype='object')

# getting the random sample values from the dataset
sms_df.sample(5)


      label
      message
    
  
      43
      ham
      WHO ARE YOU SEEING?
    
    
      1536
      spam
      You have won a Nokia 7250i. This is what you g...
    
    
      5231
      ham
      It means u could not keep ur words.
    
    
      655
      ham
      Did u got that persons story
    
    
      3461
      ham
      I am back. Bit long cos of accident on a30. Ha...
    
  
# reading the top / head values from the dataset
sms_df.head(5)


      label
      message
    
  
      0
      ham
      Go until jurong point, crazy.. Available only ...
    
    
      1
      ham
      Ok lar... Joking wif u oni...
    
    
      2
      spam
      Free entry in 2 a wkly comp to win FA Cup fina...
    
    
      3
      ham
      U dun say so early hor... U c already then say...
    
    
      4
      ham
      Nah I don't think he goes to usf, he lives aro...
    
  
# gettting the last / tail values from the dataset
sms_df.tail(5)


      label
      message
    
  
      5569
      spam
      This is the 2nd time we have tried 2 contact u...
    
    
      5570
      ham
      Will ü b going to esplanade fr home?
    
    
      5571
      ham
      Pity, * was in mood for that. So...any other s...
    
    
      5572
      ham
      The guy did some bitching but I acted like i'd...
    
    
      5573
      ham
      Rofl. Its true to its name
    
  
# Let us see what the unique values in the label column
sms_df.label.unique()
array(['ham', 'spam'], dtype=object)

# Let us see how many spam values and ham values are in the dataset.
sms_df.label.value_counts()
ham     4827
spam     747
Name: label, dtype: int64

# converting label to a numerical value 
sms_df['label_num'] = sms_df['label'].map({'ham': 0, 'spam': 1})
# 2. Feature matrix (X), response vector (y) and train_test_split
X = sms_df['message']
y = sms_df['label_num']
# Let's check the shape of X and y
X.shape
(5574,)

y.shape
(5574,)

# split X and y into train and test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=5)
for df in [X_train, X_test, y_train, y_test]:
    print(df.shape)
(4180,)
(1394,)
(4180,)
(1394,)

2. Vectorizing our dataset

from sklearn.feature_extraction.text import CountVectorizer
# instantiate the vectorizer
vector = CountVectorizer()
# learn training data vocabulary, then use it to create a document-term matrix
vector.fit(X_train)
X_train_dtm = vector.transform(X_train)
# equivalent step for previous cell
X_train_dtm = vector.fit_transform(X_train)
# examine the fitted vocabulary 
# getting last 10 features
vector.get_feature_names()[:-11:-1]
['〨ud',
 'èn',
 'zouk',
 'zogtorius',
 'zoe',
 'zeros',
 'zed',
 'zealand',
 'zac',
 'yupz']

# total features/tokens/columns in the matrix
len(vector.get_feature_names())
7465

# examine the document matrix
X_train_dtm
<4180x7465 sparse matrix of type '<class 'numpy.int64'>'
	with 54983 stored elements in Compressed Sparse Row format>

# transform testing dataset (using fitted vocabulary) in to a document-term matrix
X_test_dtm = vector.transform(X_test)
# examine the document matrix 
X_test_dtm
<1394x7465 sparse matrix of type '<class 'numpy.int64'>'
	with 17831 stored elements in Compressed Sparse Row format>

3. Building and evaluating a model

We will use multinomial Naive Bayes:

The multinomial Naive Bayes classifier is suitable for classification with discrete features  (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work.

Meaning of discrete - :	separate, distinct, individual, detached, unattached, disconnected, discontinuous, disjunct, disjoined
# import and instantiate a Multinomial Naive Bayes model
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
# train the  model using X_train_dtm and target values y_train (timing it with an IPython "magic command")
%time nb.fit(X_train_dtm, y_train)
CPU times: user 12 ms, sys: 0 ns, total: 12 ms
Wall time: 14 ms


MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

# make the class predictions for X_test_dtm
y_predict = nb.predict(X_test_dtm)
# calculate accuracy of class predictions
from sklearn import metrics
metrics.accuracy_score(y_test, y_predict)
0.98995695839311337

# print the confusion matrix
metrics.confusion_matrix(y_test, y_predict)
array([[1193,    2],
       [  12,  187]])

# FALSE POSITIVES (ham incorrectly classified as spam)
X_test[(y_predict==1) & (y_test==0)]
1290    Hey...Great deal...Farm tour 9am to 5pm $95/pa...
5046    We have sent JD for Customer Service cum Accou...
Name: message, dtype: object

# FALSE NEGATIVES (spam incorrectly classified as ham)
X_test[(y_predict==0) & (y_test==1)]
2663    Hello darling how are you today? I would love ...
1940    More people are dogging in your area now. Call...
5429    Santa Calling! Would your little ones like a c...
2402    Babe: U want me dont u baby! Im nasty and have...
869     Hello. We need some posh birds and chaps to us...
3064    Hi babe its Jordan, how r u? Im home from abro...
3530    Xmas & New Years Eve tickets are now on sale f...
1663    Hi if ur lookin 4 saucy daytime fun wiv busty ...
2430    Guess who am I?This is the first time I create...
1469    Hi its LUCY Hubby at meetins all day Fri & I w...
4676    Hi babe its Chloe, how r u? I was smashed on s...
1458    CLAIRE here am havin borin time & am now alone...
Name: message, dtype: object

X_test[1458]
'CLAIRE here am havin borin time & am now alone U wanna cum over 2nite? Chat now 09099725823 hope 2 C U Luv CLAIRE xx Calls£1/minmoremobsEMSPOBox45PO139WA'

# calculate predicted probabilities for X_test_dtm
y_pred_prob =  nb.predict_proba(X_test_dtm)[:, 1]
y_pred_prob
array([  1.30006229e-02,   9.95618101e-04,   6.15130955e-04, ...,
         9.77996362e-01,   1.41464847e-04,   1.32646379e-01])

# Calculate AUC
metrics.roc_auc_score(y_test, y_predict)
0.96901242614747374

4. Comparing models

We will compare multinomial Naive Bayes with logistic regression:

Logistic regression, despite its name, is a linear model for classification rather than regression. Logistic regression is also known in the literature as logit regression, maximum-entropy classification (MaxEnt) or the log-linear classifier. In this model, the probabilities describing the possible outcomes of a single trial are modeled using a logistic function.

# import and instantiate the logistic regression model
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
# train and model using X_train_dtm (well calibarated)
%time logreg.fit(X_train_dtm, y_train)
CPU times: user 48 ms, sys: 0 ns, total: 48 ms
Wall time: 47.5 ms


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

# to make the class predictions of X_test_dtm
y_predict = logreg.predict(X_test_dtm)
# to calculate the accuracy
metrics.accuracy_score(y_test, y_predict)
0.98206599713055953

# to calculate the predicted probabilities for X_test_dtm (well calibarated)
y_pred_prob = logreg.predict_proba(X_test_dtm)[:, 1]
# to calculate AUC
metrics.roc_auc_score(y_test, y_pred_prob)
0.98607682765290894

5. Examining a model for further insight

We will examine the our trained Naive Nayes Model to calculate the approximate "spamminess" of each token
# store the vocabulary/tokens/feature_names/column values of X_train
X_train_tokens = vector.get_feature_names()
# total tokens or feature extracted from CountVectorizer 
len(X_train_tokens)
7465

# first 20 tokens 
print(X_train_tokens[:20])
['00', '000', '000pes', '008704050406', '0089', '0121', '01223585236', '01223585334', '0125698789', '02', '0207', '02073162414', '02085076972', '021', '03', '04', '0430', '05', '050703', '0578']

# last 20 tokens
print(X_train_tokens[:-21:-1])
['〨ud', 'èn', 'zouk', 'zogtorius', 'zoe', 'zeros', 'zed', 'zealand', 'zac', 'yupz', 'yup', 'yuou', 'yuo', 'yunny', 'yun', 'yummy', 'yummmm', 'ything', 'ystrday', 'yrs']

# Naive Bayes counts the number of times each token appears in each class
nb.feature_count_
array([[  0.,   0.,   1., ...,   0.,   1.,   1.],
       [  6.,  21.,   0., ...,   1.,   0.,   0.]])

# row represent  classes (rows), columns represents tokens
nb.feature_count_.shape
(2, 7465)

# number of times each token appears across all HAM messages 
ham_token_count = nb.feature_count_[0, :]
ham_token_count 
array([ 0.,  0.,  1., ...,  0.,  1.,  1.])

# number of times each token appears across all SPAM messages 
spam_token_count = nb.feature_count_[1, :]
spam_token_count 
array([  6.,  21.,   0., ...,   1.,   0.,   0.])

# create a DataFrame of tokens with their separate ham and spam counts 
import pandas as pd
tokens = pd.DataFrame({'token': X_train_tokens, 'ham': ham_token_count, 'spam': spam_token_count}).set_index('token')
tokens.head()


      ham
      spam
    
    
      token
      
      
      00
      0
      6
    
    
      000
      0
      21
    
    
      000pes
      1
      0
    
    
      008704050406
      0
      1
    
    
      0089
      0
      1
    
  
# Examine random Dataframe rows
tokens.sample(5, random_state=6)


      ham
      spam
    
    
      token
      
      
      872
      0
      2
    
    
      tantrum
      1
      0
    
    
      little
      23
      1
    
    
      known
      1
      0
    
    
      massive
      2
      0
    
  
# Number of samples encountered for each class during fitting. 
# This value is weighted by the sample weight when provided.
# Naive Bayes counts the number of observations in each class
nb.class_count_
array([ 3632.,   548.])

Before we can calculate the "spamminess" of each token, we need to avoid dividing by zero and account for the class imbalance.
# add 1 to ham and spam counts to avoid dividing by 0
tokens.ham += 1
tokens.spam += 1
tokens.sample(5, random_state=6)


      ham
      spam
    
    
      token
      
      
      872
      1
      3
    
    
      tantrum
      2
      1
    
    
      little
      24
      2
    
    
      known
      2
      1
    
    
      massive
      3
      1
    
  
# convert spam and ham tokens into frequencies
tokens['ham_freq'] = tokens.ham / nb.class_count_[0]
tokens['spam_freq'] = tokens.spam / nb.class_count_[1]
tokens.sample(5, random_state=6)


      ham
      spam
      ham_freq
      spam_freq
    
    
      token
      
      
      872
      1
      3
      0.000275
      0.005474
    
    
      tantrum
      2
      1
      0.000551
      0.001825
    
    
      little
      24
      2
      0.006608
      0.003650
    
    
      known
      2
      1
      0.000551
      0.001825
    
    
      massive
      3
      1
      0.000826
      0.001825
    
  
tokens['spam_ratio'] = tokens.spam_freq / tokens.ham_freq
tokens.sample(5, random_state=6)


      ham
      spam
      ham_freq
      spam_freq
      spam_ratio
    
    
      token
      
      
      872
      1
      3
      0.000275
      0.005474
      19.883212
    
    
      tantrum
      2
      1
      0.000551
      0.001825
      3.313869
    
    
      little
      24
      2
      0.006608
      0.003650
      0.552311
    
    
      known
      2
      1
      0.000551
      0.001825
      3.313869
    
    
      massive
      3
      1
      0.000826
      0.001825
      2.209246
    
  
# examine the data values by spam ratio
tokens.sort_values('spam_ratio', ascending=False)[:20]


      ham
      spam
      ham_freq
      spam_freq
      spam_ratio
    
    
      token
      
      
      claim
      1
      82
      0.000275
      0.149635
      543.474453
    
    
      prize
      1
      71
      0.000275
      0.129562
      470.569343
    
    
      uk
      1
      61
      0.000275
      0.111314
      404.291971
    
    
      150p
      1
      53
      0.000275
      0.096715
      351.270073
    
    
      tone
      1
      45
      0.000275
      0.082117
      298.248175
    
    
      16
      1
      40
      0.000275
      0.072993
      265.109489
    
    
      18
      1
      37
      0.000275
      0.067518
      245.226277
    
    
      guaranteed
      1
      37
      0.000275
      0.067518
      245.226277
    
    
      1000
      1
      34
      0.000275
      0.062044
      225.343066
    
    
      500
      1
      32
      0.000275
      0.058394
      212.087591
    
    
      100
      1
      29
      0.000275
      0.052920
      192.204380
    
    
      cs
      1
      27
      0.000275
      0.049270
      178.948905
    
    
      ringtone
      1
      26
      0.000275
      0.047445
      172.321168
    
    
      10p
      1
      24
      0.000275
      0.043796
      159.065693
    
    
      awarded
      1
      24
      0.000275
      0.043796
      159.065693
    
    
      www
      3
      69
      0.000826
      0.125912
      152.437956
    
    
      000
      1
      22
      0.000275
      0.040146
      145.810219
    
    
      5000
      1
      21
      0.000275
      0.038321
      139.182482
    
    
      weekly
      1
      21
      0.000275
      0.038321
      139.182482
    
    
      mob
      1
      21
      0.000275
      0.038321
      139.182482
    
  
# examine the data values by spam ratio
tokens.sort_values('spam_ratio', ascending=True)[:20]


      ham
      spam
      ham_freq
      spam_freq
      spam_ratio
    
    
      token
      
      
      gt
      240
      1
      0.066079
      0.001825
      0.027616
    
    
      lt
      237
      1
      0.065253
      0.001825
      0.027965
    
    
      he
      176
      1
      0.048458
      0.001825
      0.037658
    
    
      lor
      123
      1
      0.033866
      0.001825
      0.053884
    
    
      she
      118
      1
      0.032489
      0.001825
      0.056167
    
    
      later
      116
      1
      0.031938
      0.001825
      0.057136
    
    
      da
      111
      1
      0.030562
      0.001825
      0.059709
    
    
      ask
      67
      1
      0.018447
      0.001825
      0.098921
    
    
      but
      332
      5
      0.091410
      0.009124
      0.099815
    
    
      amp
      66
      1
      0.018172
      0.001825
      0.100420
    
    
      said
      65
      1
      0.017896
      0.001825
      0.101965
    
    
      doing
      65
      1
      0.017896
      0.001825
      0.101965
    
    
      home
      129
      2
      0.035518
      0.003650
      0.102756
    
    
      really
      63
      1
      0.017346
      0.001825
      0.105202
    
    
      morning
      60
      1
      0.016520
      0.001825
      0.110462
    
    
      come
      175
      3
      0.048183
      0.005474
      0.113618
    
    
      lol
      57
      1
      0.015694
      0.001825
      0.116276
    
    
      its
      170
      3
      0.046806
      0.005474
      0.116960
    
    
      anything
      55
      1
      0.015143
      0.001825
      0.120504
    
    
      cos
      55
      1
      0.015143
      0.001825
      0.120504
    
  
test_value = input('Enter any single text words - ').lower()
while test_value.strip().lower() not in ['q', 'end', 'quit']:
    if test_value:
        try:
            print('Spam Ratio of {} is {}'.format(test_value, tokens.loc[test_value, 'spam_ratio']))
        except:
            print('Try again! The word {} is not found in the training dictionary.'.format(test_value)) 
    test_value = input('Enter any single text words - ')
    test_value = test_value.lower().strip()
Enter any single text words - money
Spam Ratio of money is 0.6762997169670787
Enter any single text words - vijay
Spam Ratio of vijay is 1.3255474452554743
Enter any single text words - anand
Spam Ratio of anand is 3.313868613138686
Enter any single text words - cool
Spam Ratio of cool is 0.7101147028154328
Enter any single text words - honey
Spam Ratio of honey is 1.1046228710462287
Enter any single text words - click
Spam Ratio of click is 13.255474452554745
Enter any single text words - won
Spam Ratio of won is 36.15129396151294
Enter any single text words - q

In my repo I have grid search for the model. Please visit my github - SMS Spam predictor for more infos.
	label	message
43	ham	WHO ARE YOU SEEING?
1536	spam	You have won a Nokia 7250i. This is what you g...
5231	ham	It means u could not keep ur words.
655	ham	Did u got that persons story
3461	ham	I am back. Bit long cos of accident on a30. Ha...
	label	message
0	ham	Go until jurong point, crazy.. Available only ...
1	ham	Ok lar... Joking wif u oni...
2	spam	Free entry in 2 a wkly comp to win FA Cup fina...
3	ham	U dun say so early hor... U c already then say...
4	ham	Nah I don't think he goes to usf, he lives aro...
	label	message
5569	spam	This is the 2nd time we have tried 2 contact u...
5570	ham	Will ü b going to esplanade fr home?
5571	ham	Pity, * was in mood for that. So...any other s...
5572	ham	The guy did some bitching but I acted like i'd...
5573	ham	Rofl. Its true to its name
	ham	spam	ham_freq	spam_freq
token
872	1	3	0.000275	0.005474
tantrum	2	1	0.000551	0.001825
little	24	2	0.006608	0.003650
known	2	1	0.000551	0.001825
massive	3	1	0.000826	0.001825
	ham	spam	ham_freq	spam_freq	spam_ratio
token
claim	1	82	0.000275	0.149635	543.474453
prize	1	71	0.000275	0.129562	470.569343
uk	1	61	0.000275	0.111314	404.291971
150p	1	53	0.000275	0.096715	351.270073
tone	1	45	0.000275	0.082117	298.248175
16	1	40	0.000275	0.072993	265.109489
18	1	37	0.000275	0.067518	245.226277
guaranteed	1	37	0.000275	0.067518	245.226277
1000	1	34	0.000275	0.062044	225.343066
500	1	32	0.000275	0.058394	212.087591
100	1	29	0.000275	0.052920	192.204380
cs	1	27	0.000275	0.049270	178.948905
ringtone	1	26	0.000275	0.047445	172.321168
10p	1	24	0.000275	0.043796	159.065693
awarded	1	24	0.000275	0.043796	159.065693
www	3	69	0.000826	0.125912	152.437956
000	1	22	0.000275	0.040146	145.810219
5000	1	21	0.000275	0.038321	139.182482
weekly	1	21	0.000275	0.038321	139.182482
mob	1	21	0.000275	0.038321	139.182482
	ham	spam	ham_freq	spam_freq	spam_ratio
token
gt	240	1	0.066079	0.001825	0.027616
lt	237	1	0.065253	0.001825	0.027965
he	176	1	0.048458	0.001825	0.037658
lor	123	1	0.033866	0.001825	0.053884
she	118	1	0.032489	0.001825	0.056167
later	116	1	0.031938	0.001825	0.057136
da	111	1	0.030562	0.001825	0.059709
ask	67	1	0.018447	0.001825	0.098921
but	332	5	0.091410	0.009124	0.099815
amp	66	1	0.018172	0.001825	0.100420
said	65	1	0.017896	0.001825	0.101965
doing	65	1	0.017896	0.001825	0.101965
home	129	2	0.035518	0.003650	0.102756
really	63	1	0.017346	0.001825	0.105202
morning	60	1	0.016520	0.001825	0.110462
come	175	3	0.048183	0.005474	0.113618
lol	57	1	0.015694	0.001825	0.116276
its	170	3	0.046806	0.005474	0.116960
anything	55	1	0.015143	0.001825	0.120504
cos	55	1	0.015143	0.001825	0.120504