cpragadeesh/GSoC_2017_work.md

## GSoC_2017_work.md

      
    Raw
  

              GSoC_2017_work.md
            
          
    Corpus testing and Automatic Symbol score generation

Link to repository
Introduction

Emails are scanned by rspamd to produce a list of symbols associated with them (such as MISSING_SUBJECT, SPF_FAIL). Each symbol has a score associated with it. An email's score is the sum total of the scores of the symbols associated with it. This total sum scores determines the action taken on an email. Symbol scores were set manually by us so far. This project aims to generate an optimal set of symbols scores to improve email classification accuracy using Neural Networks.
Project

Dataset

Any publicly available spam and ham corpus can be used as a dataset for the rescoring module. But best results are produced when the user creates their own dataset since the new scores generated will be tailored for their inbox specifically.
Model

We use a perceptron to rescore symbol weights optimally. We use a perceptron since it precisely models the way email score is determined by rspamd (A linear transformation). We use a sigmoid tranfer function to map the output of perceptron to value between 0 to 1 to determine its class. We then use Stochastic Gradient descent to learn the optimal parameters. You can also use a tanh, ReLU as transfer function. Use -h switch in rescore.lua to learn more about the options.
Output

In rescore.lua, -o option can be used to generate new score set in json format. You can also view diff of old scores and new scores using --diff switch. Use -h switch in rescore.lua to learn more about the options.
Corpus testing and Statistics generation

corpus_test.lua can be used for generating log files from you corpus. These log(s) can be used for statistics, rescoring purposes.
statistics.lua can be used for generating statistics such as Overall accuracy, False positive rate, False negative rate, etc. It also generates symbol-wise statistics that can be used to test new symbols, configurations.
Find more about these scripts inside the repository or using -h switch.
Code

A initial prototype was created using Python. It was then rewritten in lua + Torch as a final version.
Repository

Link
Significant Commits

Python prototype commits


Corpus testing
Statistics Generator
Multi-threaded Corpus testing
Rescoring Script

lua commits


Corpus testing
Statistics Generation
Rescoring module
Final Version + Docs

Future work


Scripts for automatic nightly rescoring needs to be written.
Composite symbols needs to be given a try.
Integration with rspamadm.

Mentors

I would like to thank Andrew Lewis, Steve Freegard and Vsevolod Stakhov for being very helpful throughout. This project wouldn't have been possible without their help.