The base requirement of this project is to analyse the SMS dataset and come up with a machine learning models to predict or claissify the sms text. For getting my latest code and datasets please do visit my github.com account.
- Reading a text-based dataset into pandas
- Vectorizing our dataset
- Building and evaluating a model
- Comparing models
- Examining a model for further insight
If you have very good well-defined datasets already, then this would be the very first section in data science job. The main job of this section is to loading/reading the datasets into a pandas object for data analysis without any errors. You can download the dataset here. This dataset is really cool to analyse. Please do read the output carefully. It helps you a lot in future.
import pandas as pd
import os
spam_file = 'data/sms.csv'
if not os.path.isfile(spam_file):
print(spam_file, ' is missing.')
exit()
# 1. Loading dataset
sms_df = pd.read_csv(spam_file, sep='\t')
# getting the shape of dataset
# means getting number of rows * columns values
sms_df.shape
(5574, 2)
# getting the column names of dataset
sms_df.columns
Index(['label', 'message'], dtype='object')
# getting the random sample values from the dataset
sms_df.sample(5)
label | message | |
---|---|---|
43 | ham | WHO ARE YOU SEEING? |
1536 | spam | You have won a Nokia 7250i. This is what you g... |
5231 | ham | It means u could not keep ur words. |
655 | ham | Did u got that persons story |
3461 | ham | I am back. Bit long cos of accident on a30. Ha... |
# reading the top / head values from the dataset
sms_df.head(5)
label | message | |
---|---|---|
0 | ham | Go until jurong point, crazy.. Available only ... |
1 | ham | Ok lar... Joking wif u oni... |
2 | spam | Free entry in 2 a wkly comp to win FA Cup fina... |
3 | ham | U dun say so early hor... U c already then say... |
4 | ham | Nah I don't think he goes to usf, he lives aro... |
# gettting the last / tail values from the dataset
sms_df.tail(5)
label | message | |
---|---|---|
5569 | spam | This is the 2nd time we have tried 2 contact u... |
5570 | ham | Will ü b going to esplanade fr home? |
5571 | ham | Pity, * was in mood for that. So...any other s... |
5572 | ham | The guy did some bitching but I acted like i'd... |
5573 | ham | Rofl. Its true to its name |
# Let us see what the unique values in the label column
sms_df.label.unique()
array(['ham', 'spam'], dtype=object)
# Let us see how many spam values and ham values are in the dataset.
sms_df.label.value_counts()
ham 4827
spam 747
Name: label, dtype: int64
# converting label to a numerical value
sms_df['label_num'] = sms_df['label'].map({'ham': 0, 'spam': 1})
# 2. Feature matrix (X), response vector (y) and train_test_split
X = sms_df['message']
y = sms_df['label_num']
# Let's check the shape of X and y
X.shape
(5574,)
y.shape
(5574,)
# split X and y into train and test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=5)
for df in [X_train, X_test, y_train, y_test]:
print(df.shape)
(4180,)
(1394,)
(4180,)
(1394,)
from sklearn.feature_extraction.text import CountVectorizer
# instantiate the vectorizer
vector = CountVectorizer()
# learn training data vocabulary, then use it to create a document-term matrix
vector.fit(X_train)
X_train_dtm = vector.transform(X_train)
# equivalent step for previous cell
X_train_dtm = vector.fit_transform(X_train)
# examine the fitted vocabulary
# getting last 10 features
vector.get_feature_names()[:-11:-1]
['〨ud',
'èn',
'zouk',
'zogtorius',
'zoe',
'zeros',
'zed',
'zealand',
'zac',
'yupz']
# total features/tokens/columns in the matrix
len(vector.get_feature_names())
7465
# examine the document matrix
X_train_dtm
<4180x7465 sparse matrix of type '<class 'numpy.int64'>'
with 54983 stored elements in Compressed Sparse Row format>
# transform testing dataset (using fitted vocabulary) in to a document-term matrix
X_test_dtm = vector.transform(X_test)
# examine the document matrix
X_test_dtm
<1394x7465 sparse matrix of type '<class 'numpy.int64'>'
with 17831 stored elements in Compressed Sparse Row format>
We will use multinomial Naive Bayes:
The multinomial Naive Bayes classifier is suitable for classification with discrete features (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work.
Meaning of discrete - : separate, distinct, individual, detached, unattached, disconnected, discontinuous, disjunct, disjoined
# import and instantiate a Multinomial Naive Bayes model
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
# train the model using X_train_dtm and target values y_train (timing it with an IPython "magic command")
%time nb.fit(X_train_dtm, y_train)
CPU times: user 12 ms, sys: 0 ns, total: 12 ms
Wall time: 14 ms
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
# make the class predictions for X_test_dtm
y_predict = nb.predict(X_test_dtm)
# calculate accuracy of class predictions
from sklearn import metrics
metrics.accuracy_score(y_test, y_predict)
0.98995695839311337
# print the confusion matrix
metrics.confusion_matrix(y_test, y_predict)
array([[1193, 2],
[ 12, 187]])
# FALSE POSITIVES (ham incorrectly classified as spam)
X_test[(y_predict==1) & (y_test==0)]
1290 Hey...Great deal...Farm tour 9am to 5pm $95/pa...
5046 We have sent JD for Customer Service cum Accou...
Name: message, dtype: object
# FALSE NEGATIVES (spam incorrectly classified as ham)
X_test[(y_predict==0) & (y_test==1)]
2663 Hello darling how are you today? I would love ...
1940 More people are dogging in your area now. Call...
5429 Santa Calling! Would your little ones like a c...
2402 Babe: U want me dont u baby! Im nasty and have...
869 Hello. We need some posh birds and chaps to us...
3064 Hi babe its Jordan, how r u? Im home from abro...
3530 Xmas & New Years Eve tickets are now on sale f...
1663 Hi if ur lookin 4 saucy daytime fun wiv busty ...
2430 Guess who am I?This is the first time I create...
1469 Hi its LUCY Hubby at meetins all day Fri & I w...
4676 Hi babe its Chloe, how r u? I was smashed on s...
1458 CLAIRE here am havin borin time & am now alone...
Name: message, dtype: object
X_test[1458]
'CLAIRE here am havin borin time & am now alone U wanna cum over 2nite? Chat now 09099725823 hope 2 C U Luv CLAIRE xx Calls£1/minmoremobsEMSPOBox45PO139WA'
# calculate predicted probabilities for X_test_dtm
y_pred_prob = nb.predict_proba(X_test_dtm)[:, 1]
y_pred_prob
array([ 1.30006229e-02, 9.95618101e-04, 6.15130955e-04, ...,
9.77996362e-01, 1.41464847e-04, 1.32646379e-01])
# Calculate AUC
metrics.roc_auc_score(y_test, y_predict)
0.96901242614747374
We will compare multinomial Naive Bayes with logistic regression:
Logistic regression, despite its name, is a linear model for classification rather than regression. Logistic regression is also known in the literature as logit regression, maximum-entropy classification (MaxEnt) or the log-linear classifier. In this model, the probabilities describing the possible outcomes of a single trial are modeled using a logistic function.
# import and instantiate the logistic regression model
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
# train and model using X_train_dtm (well calibarated)
%time logreg.fit(X_train_dtm, y_train)
CPU times: user 48 ms, sys: 0 ns, total: 48 ms
Wall time: 47.5 ms
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
verbose=0, warm_start=False)
# to make the class predictions of X_test_dtm
y_predict = logreg.predict(X_test_dtm)
# to calculate the accuracy
metrics.accuracy_score(y_test, y_predict)
0.98206599713055953
# to calculate the predicted probabilities for X_test_dtm (well calibarated)
y_pred_prob = logreg.predict_proba(X_test_dtm)[:, 1]
# to calculate AUC
metrics.roc_auc_score(y_test, y_pred_prob)
0.98607682765290894
We will examine the our trained Naive Nayes Model to calculate the approximate "spamminess" of each token
# store the vocabulary/tokens/feature_names/column values of X_train
X_train_tokens = vector.get_feature_names()
# total tokens or feature extracted from CountVectorizer
len(X_train_tokens)
7465
# first 20 tokens
print(X_train_tokens[:20])
['00', '000', '000pes', '008704050406', '0089', '0121', '01223585236', '01223585334', '0125698789', '02', '0207', '02073162414', '02085076972', '021', '03', '04', '0430', '05', '050703', '0578']
# last 20 tokens
print(X_train_tokens[:-21:-1])
['〨ud', 'èn', 'zouk', 'zogtorius', 'zoe', 'zeros', 'zed', 'zealand', 'zac', 'yupz', 'yup', 'yuou', 'yuo', 'yunny', 'yun', 'yummy', 'yummmm', 'ything', 'ystrday', 'yrs']
# Naive Bayes counts the number of times each token appears in each class
nb.feature_count_
array([[ 0., 0., 1., ..., 0., 1., 1.],
[ 6., 21., 0., ..., 1., 0., 0.]])
# row represent classes (rows), columns represents tokens
nb.feature_count_.shape
(2, 7465)
# number of times each token appears across all HAM messages
ham_token_count = nb.feature_count_[0, :]
ham_token_count
array([ 0., 0., 1., ..., 0., 1., 1.])
# number of times each token appears across all SPAM messages
spam_token_count = nb.feature_count_[1, :]
spam_token_count
array([ 6., 21., 0., ..., 1., 0., 0.])
# create a DataFrame of tokens with their separate ham and spam counts
import pandas as pd
tokens = pd.DataFrame({'token': X_train_tokens, 'ham': ham_token_count, 'spam': spam_token_count}).set_index('token')
tokens.head()
ham | spam | |
---|---|---|
token | ||
00 | 0 | 6 |
000 | 0 | 21 |
000pes | 1 | 0 |
008704050406 | 0 | 1 |
0089 | 0 | 1 |
# Examine random Dataframe rows
tokens.sample(5, random_state=6)
ham | spam | |
---|---|---|
token | ||
872 | 0 | 2 |
tantrum | 1 | 0 |
little | 23 | 1 |
known | 1 | 0 |
massive | 2 | 0 |
# Number of samples encountered for each class during fitting.
# This value is weighted by the sample weight when provided.
# Naive Bayes counts the number of observations in each class
nb.class_count_
array([ 3632., 548.])
Before we can calculate the "spamminess" of each token, we need to avoid dividing by zero and account for the class imbalance.
# add 1 to ham and spam counts to avoid dividing by 0
tokens.ham += 1
tokens.spam += 1
tokens.sample(5, random_state=6)
ham | spam | |
---|---|---|
token | ||
872 | 1 | 3 |
tantrum | 2 | 1 |
little | 24 | 2 |
known | 2 | 1 |
massive | 3 | 1 |
# convert spam and ham tokens into frequencies
tokens['ham_freq'] = tokens.ham / nb.class_count_[0]
tokens['spam_freq'] = tokens.spam / nb.class_count_[1]
tokens.sample(5, random_state=6)
ham | spam | ham_freq | spam_freq | |
---|---|---|---|---|
token | ||||
872 | 1 | 3 | 0.000275 | 0.005474 |
tantrum | 2 | 1 | 0.000551 | 0.001825 |
little | 24 | 2 | 0.006608 | 0.003650 |
known | 2 | 1 | 0.000551 | 0.001825 |
massive | 3 | 1 | 0.000826 | 0.001825 |
tokens['spam_ratio'] = tokens.spam_freq / tokens.ham_freq
tokens.sample(5, random_state=6)
ham | spam | ham_freq | spam_freq | spam_ratio | |
---|---|---|---|---|---|
token | |||||
872 | 1 | 3 | 0.000275 | 0.005474 | 19.883212 |
tantrum | 2 | 1 | 0.000551 | 0.001825 | 3.313869 |
little | 24 | 2 | 0.006608 | 0.003650 | 0.552311 |
known | 2 | 1 | 0.000551 | 0.001825 | 3.313869 |
massive | 3 | 1 | 0.000826 | 0.001825 | 2.209246 |
# examine the data values by spam ratio
tokens.sort_values('spam_ratio', ascending=False)[:20]
ham | spam | ham_freq | spam_freq | spam_ratio | |
---|---|---|---|---|---|
token | |||||
claim | 1 | 82 | 0.000275 | 0.149635 | 543.474453 |
prize | 1 | 71 | 0.000275 | 0.129562 | 470.569343 |
uk | 1 | 61 | 0.000275 | 0.111314 | 404.291971 |
150p | 1 | 53 | 0.000275 | 0.096715 | 351.270073 |
tone | 1 | 45 | 0.000275 | 0.082117 | 298.248175 |
16 | 1 | 40 | 0.000275 | 0.072993 | 265.109489 |
18 | 1 | 37 | 0.000275 | 0.067518 | 245.226277 |
guaranteed | 1 | 37 | 0.000275 | 0.067518 | 245.226277 |
1000 | 1 | 34 | 0.000275 | 0.062044 | 225.343066 |
500 | 1 | 32 | 0.000275 | 0.058394 | 212.087591 |
100 | 1 | 29 | 0.000275 | 0.052920 | 192.204380 |
cs | 1 | 27 | 0.000275 | 0.049270 | 178.948905 |
ringtone | 1 | 26 | 0.000275 | 0.047445 | 172.321168 |
10p | 1 | 24 | 0.000275 | 0.043796 | 159.065693 |
awarded | 1 | 24 | 0.000275 | 0.043796 | 159.065693 |
www | 3 | 69 | 0.000826 | 0.125912 | 152.437956 |
000 | 1 | 22 | 0.000275 | 0.040146 | 145.810219 |
5000 | 1 | 21 | 0.000275 | 0.038321 | 139.182482 |
weekly | 1 | 21 | 0.000275 | 0.038321 | 139.182482 |
mob | 1 | 21 | 0.000275 | 0.038321 | 139.182482 |
# examine the data values by spam ratio
tokens.sort_values('spam_ratio', ascending=True)[:20]
ham | spam | ham_freq | spam_freq | spam_ratio | |
---|---|---|---|---|---|
token | |||||
gt | 240 | 1 | 0.066079 | 0.001825 | 0.027616 |
lt | 237 | 1 | 0.065253 | 0.001825 | 0.027965 |
he | 176 | 1 | 0.048458 | 0.001825 | 0.037658 |
lor | 123 | 1 | 0.033866 | 0.001825 | 0.053884 |
she | 118 | 1 | 0.032489 | 0.001825 | 0.056167 |
later | 116 | 1 | 0.031938 | 0.001825 | 0.057136 |
da | 111 | 1 | 0.030562 | 0.001825 | 0.059709 |
ask | 67 | 1 | 0.018447 | 0.001825 | 0.098921 |
but | 332 | 5 | 0.091410 | 0.009124 | 0.099815 |
amp | 66 | 1 | 0.018172 | 0.001825 | 0.100420 |
said | 65 | 1 | 0.017896 | 0.001825 | 0.101965 |
doing | 65 | 1 | 0.017896 | 0.001825 | 0.101965 |
home | 129 | 2 | 0.035518 | 0.003650 | 0.102756 |
really | 63 | 1 | 0.017346 | 0.001825 | 0.105202 |
morning | 60 | 1 | 0.016520 | 0.001825 | 0.110462 |
come | 175 | 3 | 0.048183 | 0.005474 | 0.113618 |
lol | 57 | 1 | 0.015694 | 0.001825 | 0.116276 |
its | 170 | 3 | 0.046806 | 0.005474 | 0.116960 |
anything | 55 | 1 | 0.015143 | 0.001825 | 0.120504 |
cos | 55 | 1 | 0.015143 | 0.001825 | 0.120504 |
test_value = input('Enter any single text words - ').lower()
while test_value.strip().lower() not in ['q', 'end', 'quit']:
if test_value:
try:
print('Spam Ratio of {} is {}'.format(test_value, tokens.loc[test_value, 'spam_ratio']))
except:
print('Try again! The word {} is not found in the training dictionary.'.format(test_value))
test_value = input('Enter any single text words - ')
test_value = test_value.lower().strip()
Enter any single text words - money
Spam Ratio of money is 0.6762997169670787
Enter any single text words - vijay
Spam Ratio of vijay is 1.3255474452554743
Enter any single text words - anand
Spam Ratio of anand is 3.313868613138686
Enter any single text words - cool
Spam Ratio of cool is 0.7101147028154328
Enter any single text words - honey
Spam Ratio of honey is 1.1046228710462287
Enter any single text words - click
Spam Ratio of click is 13.255474452554745
Enter any single text words - won
Spam Ratio of won is 36.15129396151294
Enter any single text words - q
In my repo I have grid search for the model. Please visit my github - SMS Spam predictor for more infos.