I ran a preliminary training on the data that I had collected in a private Alpha version of Waegis three years ago which consists of 26991 items including 3780 ham and 23211 spam items.
Using *** spam rules with an initial and logical configuration based on my experience, I got good results with an overall accuracy of 92.75% including 88.63% for false-positives and 93.42% for false-negatives.
I stored the data in a database with a single table that I've comitted to a new repository on Git named ***. It consists of several columns including the contents of the comment, trackback/pingback, or forum post, as well as a SpamScore column that assigns the overall spam score calculated for each item by Waegis. It also has *** columns named ***, ..., and *** that represent the scores assigned to each item from each rule.
One of the rules doesn't play a role here since it's designed to track the trends of incoming data online which doesn't happen here. One other rule also had a low effect (surprisingly) beca