Emails are scanned by rspamd to produce a list of symbols associated with them (such as MISSING_SUBJECT, SPF_FAIL). Each symbol has a score associated with it. An email's score is the sum total of the scores of the symbols associated with it. This total sum scores determines the action taken on an email. Symbol scores were set manually by us so far. This project aims to generate an optimal set of symbols scores to improve email classification accuracy using Neural Networks.
Any publicly available spam and ham corpus can be used as a dataset for the rescoring module. But best results are produced when the user creates their own dataset since the new scores generated will be tailored for their inbox specifically.
We use a perceptron to rescore symbol weights optimally. We use a perceptron since it precisely models the way email score is determined by rspamd (A linear transformation). We use a sigmoid tranfer function to map the output of perceptron to value between 0 to 1 to determine its class. We then use Stochastic Gradient descent to learn the optimal parameters. You can also use a tanh, ReLU as transfer function. Use -h switch in rescore.lua to learn more about the options.
In rescore.lua, -o option can be used to generate new score set in json format. You can also view diff of old scores and new scores using --diff switch. Use -h switch in rescore.lua to learn more about the options.
corpus_test.lua can be used for generating log files from you corpus. These log(s) can be used for statistics, rescoring purposes.
statistics.lua can be used for generating statistics such as Overall accuracy, False positive rate, False negative rate, etc. It also generates symbol-wise statistics that can be used to test new symbols, configurations.
Find more about these scripts inside the repository or using -h switch.
A initial prototype was created using Python. It was then rewritten in lua + Torch as a final version.
- Scripts for automatic nightly rescoring needs to be written.
- Composite symbols needs to be given a try.
- Integration with rspamadm.
I would like to thank Andrew Lewis, Steve Freegard and Vsevolod Stakhov for being very helpful throughout. This project wouldn't have been possible without their help.