Skip to content

Instantly share code, notes, and snippets.

@mdfarragher
Created November 8, 2019 15:31
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save mdfarragher/9575b1e9d3cce21e19216bdb3914f2b7 to your computer and use it in GitHub Desktop.
Save mdfarragher/9575b1e9d3cce21e19216bdb3914f2b7 to your computer and use it in GitHub Desktop.

Assignment: Detect spam SMS messages

In this assignment you're going to build an app that can automatically detect spam SMS messages.

The first thing you'll need is a file with lots of SMS messages, correctly labelled as being spam or not spam. You will use a dataset compiled by Caroline Tagg in her 2009 PhD thesis. This dataset has 5574 messages.

Download the list of messages and save it as spam.tsv.

The data file looks like this:

Spam message list

It’s a TSV file with only 2 columns of information:

  • Label: ‘spam’ for a spam message and ‘ham’ for a normal message.
  • Message: the full text of the SMS message.

You will build a binary classification network that reads in all messages and then makes a prediction for each message if it is spam or ham.

Let’s get started. You need to build a new application from scratch by opening a terminal and creating a new NET Core console project:

https://gist.github.com/f1b66c3aa4bf4e9ede53117afa9d24fb

Also make sure to copy the dataset file spam.tsv into this folder because the code you're going to type next will expect it here.

Now install the following packages

https://gist.github.com/26248296a3e007aed031f83a2ee074fd

Microsoft.ML is the Microsoft machine learning package. We will use to load and process the data from the dataset. The CNTK.GPU library is Microsoft's Cognitive Toolkit that can train and run deep neural networks. And Xplot.Plotly is an awesome plotting library based on Plotly. The library is designed for F# so we also need to pull in the Fsharp.Core library.

The CNTK.GPU package will train and run deep neural networks using your GPU. You'll need an NVidia GPU and Cuda graphics drivers for this to work.

If you don't have an NVidia GPU or suitable drivers, the library will fall back and use the CPU instead. This will work but training neural networks will take significantly longer.

CNTK is a low-level tensor library for building, training, and running deep neural networks. The code to build deep neural network can get a bit verbose, so I've developed a little wrapper called CNTKUtil that will help you write code faster.

Please download the CNTKUtil files in a new CNTKUtil folder at the same level as your project folder.

Then make sure you're in the console project folder and crearte a project reference like this:

https://gist.github.com/66f65a16bfab1de6054a26f9eabccbcd

Now you are ready to start writing code. Edit the Program.cs file with Visual Studio Code and add the following code:

https://gist.github.com/32751809253c6d475db99a1269edd482

The SpamData class holds all the data for one single spam message. Note how each field is tagged with a LoadColumn attribute that will tell the TSV data loading code from which column to import the data.

Unfortunately we can't train a deep neural network on text data directly. We first need to convert the data to numbers, for example with the sparse vector encoding trick you learned about in previous lectures.

We'll get to that conversion later. For now we'll add a class here that will contain the converted text:

https://gist.github.com/01ecd04a211db0bf89507d3883586b16

There's the Label again, but notice how the message has now been converted to a VBuffer and stored in the Features field.

The VBuffer type is a sparse vector. It's going to store the sparse vector-encoded message text so we can train a neural network on it. The nice thing about this NET type is that it only stores the ones. The zeroes are not stored and do not occupy any space in memory.

The GetFeatures method calls DenseValues to return the complete sparse vector and returns it as a float[] that our neural network understands.

And there's a GetLabel method that returns 1 if the message is spam (indicated by the Label field containing the word 'spam') and 0 if the message is not spam.

The features represent the sparse vector-encoded text that we will use to train the neural network on, and the label is the output variable that we're trying to predict. So here we're training on encoded text to predict if that text is spam or not.

Now it's time to start writing the main program method:

https://gist.github.com/a41971907f1ddb08418028096d706a92

When working with the ML.NET library we always need to set up a machine learning context represented by the MLContext class.

The code calls the LoadFromTextFile method to load the CSV data in memory. Note the SpamData type argument that tells the method which class to use to load the data.

We then use TrainTestSplit to split the data in a training partition containing 70% of the data and a testing partition containing 30% of the data.

Note that we're deviating from the usual 80-20 split here. This is because the data file is quite small, and so 20% of the data is simply not enough to test the neural network on.

Now it's time to build a pipeline to convert the text to sparse vector-encoded data. We'll use the FeaturizeText component in the ML.NET machine learning library:

https://gist.github.com/3011dcf404ec1deaefa959530c697a81

Machine learning pipelines in ML.NET are built by stacking transformation components. Here we're using a single component, FeaturizeText, that converts the text messages in SpamData.Message into sparse vector-encoded data in a new column called 'Features'.

We call the Fit method to initialize the pipeline, and then call Transform twice to transform the text in the training and testing partitions.

Finally we call CreateEnumerable to convert the training and testing data to an enumeration of ProcessedData instances. So now we have the training data in training and the testing data in testing. Both are enumerations of ProcessedData instances.

But CNTK can't train on an enumeration of class instances. It requires a float[][] for features and float[] for labels.

So we need to set up four float arrays:

https://gist.github.com/67c82b91b54ee8052aeea4a6b8e40cd8

These LINQ expressions set up four arrays containing the feature and label data for the training and testing partitions.

Now we need to tell CNTK what shape the input data has that we'll train the neural network on, and what shape the output data of the neural network will have:

https://gist.github.com/5c186691bad0b045ffc877a1083570e8

We don't know in advance how many dimensions the FeaturizeText component will create, so we simply check the width of the training_data array.

The first Var method tells CNTK that our neural network will use a 1-dimensional tensor of nodeCount float values as input. This shape matches the array returned by the ProcessedData.GetFeatures method.

And the second Var method tells CNTK that we want our neural network to output a single float value. This shape matches the single value returned by the ProcessedData.GetLabel method.

Our next step is to design the neural network.

We will use a deep neural network with a 16-node input layer, a 16-node hidden layer, and a single-node output layer. We'll use the ReLU activation function for the input and hidden layers, and Sigmoid activation for the output layer.

Remember: the sigmoid function forces the output to a range of 0..1 which means we can treat it as a binary classification probability. So we can turn any regression network into a binary classification network by simply adding the sigmoid activation function to the output layer.

Here's how to build this neural network:

https://gist.github.com/f9a1b8bbf52a9aede11c3b39a45fc12d

Each Dense call adds a new dense feedforward layer to the network. We're stacking two layers, both using ReLU activation, and then add a final layer with a single node using Sigmoid activation.

Then we use the ToSummary method to output a description of the architecture of the neural network to the console.

Now we need to decide which loss function to use to train the neural network, and how we are going to track the prediction error of the network during each training epoch.

For this assignment we'll use BinaryCrossEntropy as the loss function because it's the standard metric for measuring binary classification loss.

We'll track the error with the BinaryClassificationError metric. This is the number of times (expressed as a percentage) that the model predictions are wrong. An error of 0 means the predictions are correct all the time, and an error of 1 means the predictions are wrong all the time.

https://gist.github.com/656310928c1a899ab4b44d459f5abb94

Next we need to decide which algorithm to use to train the neural network. There are many possible algorithms derived from Gradient Descent that we can use here.

For this assignment we're going to use the AdamLearner. You can learn more about the Adam algorithm here: https://machinelearningmastery.com/adam...

https://gist.github.com/c1f1563f1d42b05695a295fbef72112b

These configuration values are a good starting point for many machine learning scenarios, but you can tweak them if you like to try and improve the quality of your predictions.

We're almost ready to train. Our final step is to set up a trainer and an evaluator for calculating the loss and the error during each training epoch:

https://gist.github.com/4291c166bace79255a784d03ba4573bc

The GetTrainer method sets up a trainer which will track the loss and the error for the training partition. And GetEvaluator will set up an evaluator that tracks the error in the test partition.

Now we're finally ready to start training the neural network!

Add the following code:

https://gist.github.com/5045457520a49f4f1599a75550378fe4

We're training the network for 10 epochs using a batch size of 64. During training we'll track the loss and errors in the loss, trainingError and testingError arrays.

Once training is done, we show the final testing error on the console. This is the percentage of mistakes the network makes when predicting spam messages.

Note that the error and the accuracy are related: accuracy = 1 - error. So we also report the final accuracy of the neural network.

Here's the code to train the neural network. Put this inside the for loop:

https://gist.github.com/f06f055c39f963c1d987ff8a8285e27f

The Index().Shuffle().Batch() sequence randomizes the data and splits it up in a collection of 64-record batches. The second argument to Batch() is a function that will be called for every batch.

Inside the batch function we call GetBatch twice to get a feature batch and a corresponding label batch. Then we call TrainBatch to train the neural network on these two batches of training data.

The TrainBatch method returns the loss and error, but only for training on the 64-record batch. So we simply add up all these values and divide them by the number of batches in the dataset. That gives us the average loss and error for the predictions on the training partition during the current epoch, and we report this to the console.

So now we know the training loss and error for one single training epoch. The next step is to test the network by making predictions about the data in the testing partition and calculate the testing error.

Put this code inside the epoch loop and right below the training code:

https://gist.github.com/6341d7f48081988feb2cc10455d7309f

We don't need to shuffle the data for testing, so now we can call Batch directly. Again we're calling GetBatch to get feature and label batches, but note that we're now providing the testing_data and testing_labels arrays.

We call TestBatch to test the neural network on the 64-record test batch. The method returns the error for the batch, and we again add up the errors for each batch and divide by the number of batches.

That gives us the average error in the neural network predictions on the test partition for this epoch.

After training completes, the training and testing errors for each epoch will be available in the trainingError and testingError arrays. Let's use XPlot to create a nice plot of the two error curves so we can check for overfitting:

https://gist.github.com/45400f585d56575e4f302f8ee9e5d0ee

This code creates a Plot with two Scatter graphs. The first one plots the trainingError values and the second one plots the testingError values.

Finally we use File.WriteAllText to write the plot to disk as a HTML file.

We're now ready to build the app, so this is a good moment to save your work ;)

Go to the CNTKUtil folder and type the following:

https://gist.github.com/e14a4e9ecf1b5fb436b3069e7af7dd37

This will build the CNKTUtil project. Note how we're specifying the x64 platform because the CNTK library requires a 64-bit build.

Now go to the HeartDisease folder and type:

https://gist.github.com/13b74ce2ebe2ad596f326ee969b79733

This will build your app. Note how we're again specifying the x64 platform.

Now run the app:

https://gist.github.com/15947eed60dc97d48d0228e79fc34dc6

The app will create the neural network, load the dataset, train the network on the data, and create a plot of the training and testing errors for each epoch.

The plot is written to disk in a new file called chart.html. Open the file now and take a look at the training and testing curves.

What are your final classification errors on training and testing? What is the final testing accuracy? And what do the curves look like? Is the neural network overfitting?

Do you think this model is good at predicting spam?

Try to improve the neural network by changing the network architecture. You can add more nodes or extra layers. You can also changing the number of epochs, the batch size, or the learner parameters.

Did the changes help? Is the network overfitting? What is the best accuracy you can achieve?

Post your results in our support group.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment