keyvan/Waegis Development Conversation on Best Calculation Method

## Waegis Development Conversation on Best Calculation Method
Hello,

I ran a preliminary training on the data that I had collected in a private Alpha version of Waegis three years ago which consists of 26991 items including 3780 ham and 23211 spam items.

Using *** spam rules with an initial and logical configuration based on my experience, I got good results with an overall accuracy of 92.75% including 88.63% for false-positives and 93.42% for false-negatives.

I stored the data in a database with a single table that I've comitted to a new repository on Git named ***. It consists of several columns including the contents of the comment, trackback/pingback, or forum post, as well as a SpamScore column that assigns the overall spam score calculated for each item by Waegis. It also has *** columns named ***, ..., and *** that represent the scores assigned to each item from each rule.

One of the rules doesn't play a role here since it's designed to track the trends of incoming data online which doesn't happen here. One other rule also had a low effect (surprisingly) because spammers were smart enough to pass it. Since time has passed, I suppose they're smarter now, but I'm not sure yet whether I want to keep those rules or not. So essentially, these performance numbers are coming from a subset of the rules mentioned in the list. By better configuration, being in a live environment, and adding new rules (particularly an algorithmic one that we don't have now) I can be hopeful to have an overall accuracy bigger than 96-7% which is something that academics are only dreaming about at night with their small prototypes even though the current accuracy is still much higher than what they can achieve in their papers!

As of the data set, here are some information:

The current calculation engine finds the scores for all the rules for an item then gets the average of top *** biggest numbers. If the result is bigger than ***, it is marked as spam. Therefore, if you check the SpamScore column and it has a number bigger than *** (not equal), it's marked as spam by current Waegis settings otherwise, it is a ham/legitimate content.

The IsSpam column has a Boolean value specifying the actual nature of content that we are sure about. If it is true, then it's a spam, otherwise it's a ham. So you can run the following queries to see how accurate the current version of Waegis is:

SELECT COUNT([ID]) as TotalHams
  FROM [dbo].[TrainingContents]
  WHERE IsSpam = 0

SELECT COUNT([ID]) as FailedHams
  FROM [dbo].[TrainingContents]
  WHERE IsSpam = 0 AND SpamScore > ***

  SELECT COUNT([ID]) as TotalSpams
  FROM [dbo].[TrainingContents]
  WHERE IsSpam = 1

SELECT COUNT([ID]) as FailedSpams
  FROM [dbo].[TrainingContents]
  WHERE IsSpam = 1 AND SpamScore <= ***

I'm going to apply better configurations to current rules and also apply some more recent data training sets to train and test the system in the coming days (after completing some works at campus that are stacked up from last week that I was sick), but in the meantime, it's great to know the best way (in respect to accuracy and efficiency both) to perform the calculations and address the following questions:

    * Is it good to use a scale of 0 to 100 or it's better to just assign an integer score where its bigness represents the spamness of the item?
    * Which method of final calculation is better? An average on some top biggest scores or an NN, or KNN?
    * If we're going to get an average, which average type is better? An arithmetic one (linear or weighted), geometric one, or harmonic one? At the moment I'm doing a truncated linear arithmetic average.
    * Do you have any other suggestions that work better than these propositions especially for an online system?

Final note: replace any uses of "spams" word in the above text with "spam" or "spam content" since the plural of spam is not spams but it's hard to always remember that ;-)
	Hello,

	I ran a preliminary training on the data that I had collected in a private Alpha version of Waegis three years ago which consists of 26991 items including 3780 ham and 23211 spam items.

	Using *** spam rules with an initial and logical configuration based on my experience, I got good results with an overall accuracy of 92.75% including 88.63% for false-positives and 93.42% for false-negatives.

	I stored the data in a database with a single table that I've comitted to a new repository on Git named *. It consists of several columns including the contents of the comment, trackback/pingback, or forum post, as well as a SpamScore column that assigns the overall spam score calculated for each item by Waegis. It also has * columns named *, ..., and * that represent the scores assigned to each item from each rule.

	One of the rules doesn't play a role here since it's designed to track the trends of incoming data online which doesn't happen here. One other rule also had a low effect (surprisingly) because spammers were smart enough to pass it. Since time has passed, I suppose they're smarter now, but I'm not sure yet whether I want to keep those rules or not. So essentially, these performance numbers are coming from a subset of the rules mentioned in the list. By better configuration, being in a live environment, and adding new rules (particularly an algorithmic one that we don't have now) I can be hopeful to have an overall accuracy bigger than 96-7% which is something that academics are only dreaming about at night with their small prototypes even though the current accuracy is still much higher than what they can achieve in their papers!

	As of the data set, here are some information:

	The current calculation engine finds the scores for all the rules for an item then gets the average of top * biggest numbers. If the result is bigger than , it is marked as spam. Therefore, if you check the SpamScore column and it has a number bigger than ** (not equal), it's marked as spam by current Waegis settings otherwise, it is a ham/legitimate content.

	The IsSpam column has a Boolean value specifying the actual nature of content that we are sure about. If it is true, then it's a spam, otherwise it's a ham. So you can run the following queries to see how accurate the current version of Waegis is:

	SELECT COUNT([ID]) as TotalHams
	FROM [dbo].[TrainingContents]
	WHERE IsSpam = 0

	SELECT COUNT([ID]) as FailedHams
	FROM [dbo].[TrainingContents]
	WHERE IsSpam = 0 AND SpamScore > ***

	SELECT COUNT([ID]) as TotalSpams
	FROM [dbo].[TrainingContents]
	WHERE IsSpam = 1

	SELECT COUNT([ID]) as FailedSpams
	FROM [dbo].[TrainingContents]
	WHERE IsSpam = 1 AND SpamScore <= ***

	I'm going to apply better configurations to current rules and also apply some more recent data training sets to train and test the system in the coming days (after completing some works at campus that are stacked up from last week that I was sick), but in the meantime, it's great to know the best way (in respect to accuracy and efficiency both) to perform the calculations and address the following questions:

	* Is it good to use a scale of 0 to 100 or it's better to just assign an integer score where its bigness represents the spamness of the item?
	* Which method of final calculation is better? An average on some top biggest scores or an NN, or KNN?
	* If we're going to get an average, which average type is better? An arithmetic one (linear or weighted), geometric one, or harmonic one? At the moment I'm doing a truncated linear arithmetic average.
	* Do you have any other suggestions that work better than these propositions especially for an online system?

	Final note: replace any uses of "spams" word in the above text with "spam" or "spam content" since the plural of spam is not spams but it's hard to always remember that ;-)