Skip to content

Instantly share code, notes, and snippets.

@cpragadeesh
Last active August 28, 2017 14:37
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save cpragadeesh/fad4da2ce46302c7018a61ad8c8823e7 to your computer and use it in GitHub Desktop.
Save cpragadeesh/fad4da2ce46302c7018a61ad8c8823e7 to your computer and use it in GitHub Desktop.
Google Summer of Code 2017 Rspamd symbol re-scoring project.

Corpus testing and Automatic Symbol score generation

Link to repository

Introduction

Emails are scanned by rspamd to produce a list of symbols associated with them (such as MISSING_SUBJECT, SPF_FAIL). Each symbol has a score associated with it. An email's score is the sum total of the scores of the symbols associated with it. This total sum scores determines the action taken on an email. Symbol scores were set manually by us so far. This project aims to generate an optimal set of symbols scores to improve email classification accuracy using Neural Networks.

Project

Dataset

Any publicly available spam and ham corpus can be used as a dataset for the rescoring module. But best results are produced when the user creates their own dataset since the new scores generated will be tailored for their inbox specifically.

Model

We use a perceptron to rescore symbol weights optimally. We use a perceptron since it precisely models the way email score is determined by rspamd (A linear transformation). We use a sigmoid tranfer function to map the output of perceptron to value between 0 to 1 to determine its class. We then use Stochastic Gradient descent to learn the optimal parameters. You can also use a tanh, ReLU as transfer function. Use -h switch in rescore.lua to learn more about the options.

Output

In rescore.lua, -o option can be used to generate new score set in json format. You can also view diff of old scores and new scores using --diff switch. Use -h switch in rescore.lua to learn more about the options.

Corpus testing and Statistics generation

corpus_test.lua can be used for generating log files from you corpus. These log(s) can be used for statistics, rescoring purposes.

statistics.lua can be used for generating statistics such as Overall accuracy, False positive rate, False negative rate, etc. It also generates symbol-wise statistics that can be used to test new symbols, configurations.

Find more about these scripts inside the repository or using -h switch.

Code

A initial prototype was created using Python. It was then rewritten in lua + Torch as a final version.

Repository

Link

Significant Commits

Python prototype commits
lua commits

Future work

  • Scripts for automatic nightly rescoring needs to be written.
  • Composite symbols needs to be given a try.
  • Integration with rspamadm.

Mentors

I would like to thank Andrew Lewis, Steve Freegard and Vsevolod Stakhov for being very helpful throughout. This project wouldn't have been possible without their help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment