AWegnerGitHub/Spam Analaysis Engine.md Secret

## Spam Analaysis Engine.md

      
    Raw
  

              Spam Analaysis Engine.md
            
          
    Spam Detection Engine

Background

Once upon a time, there was baby SmokeDetector. It had a few regular expressions, a small group of people sitting behind it and an entire network of Stack Exchange sites to watch and monitor. Over time, the human spam validators expanded the regular expressions to catch more spam and be more accurate. Stack Exchange flourished in spam free world. But, the validators expanded the system further. They had the system flag the spam automatically, and spam died even more quickly. The validators enjoyed their victory, yet were still unsettled to see spam attempt to take root. So they turned their gaze to the internet at large. Where others had cowered at the vast scope of spam and garbage, these brave spam fighters (and a mature SmokeDetector) took it as a challenge and made their first move...
That move was to reach out to the platforms that host the spam and explain how that is violating their own Terms of Service. Some of the platforms have been helpful and removed it, others haven't done much at all. Medium, however, asked if there was a way to integrate SmokeDetector into their platform in some form. Initial tests against our existing blacklists were promising. There were, however, differences in how Stack Exchange handles and classifies spam and how Medium handles and classifies spam. These differences, between just two systems, indicate that there will be a number of things that need to be customizable by platform to handle this correctly.
This proposal is a radical departure from SmokeDetector/metasmoke's existing infrastructure. This is important to know at this point in the document. If we are truly committed to building a platform that can help any other platform classify spam, we'll need a lot of changes. On top of that, there will be some important nice to haves that will make life much better. This proposal is for the high level architecture of the entire system. Each of the three proposed subsystems would need a detailed proposal as well (especially, the Analysis Engine one).
Proposal

The basic structure of our system is: Get text, analyze text/report, do something based on the report. The most important bit, in my opinion, is that middle step - analyze text/report. That is the product that other platforms want to utilize. My proposal is to break out these three parts in to three distinct systems. This allows us to logically separate the various components.
Get text

In this new system, the "Get Text" layer's job is to gather as much information that we need to analyze from the source system, transform it into the expected object(s) and pass it to the Analysis Engine. Information that could be gathered at this level includes the text, username, email address, IP address, user agent, referrer, other HTTP headers, etc. Obviously not all of this will be provided and other platforms may not want to share it. However, some existing spam platforms do utilize this information to determine if a post is legitimate. Adding new attributes to retrieve should be simple.
Do something based on the report

This step is going to be specific to the target platform. Additionally, if we are doing something like allowing hosted message boards to integrate with us, it may be specific to "customer", not platform. Example: Users A and B host their own Discourse forums. User A wants certain actions to occur based on one set of rules. User B wants certain actions to occur based on other rules. There may be some overlap, but it's not guaranteed. Ideally, this type of thing could be handled by a plugin their configure on their boards - it sends data to us, we send a response back, the plugin looks at the response and makes the POST/NO POST decision.
Analyze Test

This is the heart of the system. It utilizes our existing tools to determine a spam/not spam rating. It does this by looking at regular expressions, WHOIS information, usernames, etc. As the system grows, it'd also utilize the newer information, such as spam from specific IP addresses or IP address ranges, common user agents, email domains, etc. It spits the spam/not spam result out with some additional stats (ie. hit which regular expression reasons, hit common spammer IP, etc). This part of the system has to be extremely flexible though. It has to be able to handle individual rules per platform (or customer). Going back to the two Discourse forums, User A may want to see posts that solicit Fiverr posts while User B doesn't want any of that type of traffic. Medium, on the other hand, wants to allow it but doesn't want to allow it if the text also matches a "Pharmacy Spam" pattern. In short, rules need to be combined/enabled/disabled with some advanced logic per integration point.
Infrastructure

The infrastructure behind this would need to have some power. The flexibility that is needed, just in the Analysis Engine, ensures that this can't run on a set of Raspberry Pis any longer. The multiple input sources and methods (they push to our API or we go to their API/firehose) across larger systems mean we need the bandwidth to handle it and the reliability between Input and the Analysis Engine. The same reliability is needed between the Analysis Engine and the outgoing "Do something". If a "Do Something" requires that we perform a set of steps (ie. Flag a post on Stack Exchange, but only flag up to three times with three different users versus send a text blob back to a webhook so their plugin can determine the appropriate action to take) we need the hardware to perform that at scale.
If we are going to build this, we need to correct some flaws that exist with the current system as well. The biggest is the lack of detailed logs that are accessible to users that need them. With two major integration points into and out of the Analysis Engine, we'd need logging on both ends so that we can track a request from point of origin, through the analysis and back out. We'd need to see what it looked like coming in, what it was transformed into for the analysis and what it looked like going back out.
The only input into the Analysis engine would come via the input API. That ensures that we always receive a consistent input. There is only one output from the Analysis engine. This ensures that the Do Something portion of the system will also have consistent input.
The system needs to hand input from multiple sources at a high frequency. Additionally, the Analysis engine needs to be able to perform it's tasks against multiple inputs at the same time. Output needs to be able to handle multiple tasks at the same time. In short, no single request should block any other request at any point in the Input->Analyze->Output flow. To ensure that we can transition between each of the subprocesses, it might be a good idea to look at setting up guaranteed delivery queues. Ideally, we'd not have a pile of items sitting in the Input Queue but with a queue between the transitions, we'd have the ability to fail without losing data.
This is a very high level diagram of how everything would be laid out. The items on INPUT and DO SOMETHING could continue indefinitely and will depend on developer time and support for new integrations.

Nice to haves


Expanding our ways of detecting spam would be important as we grow. We've already touched on some of these tests with just Stack Exchange, but continue to fall back on the tried and true Regular Expressions. I don't think this will work if we bring on other larger platforms (or even a decent number of smaller individual forums). We'll need some kind of automated heuristics to alert us to patterns that are emerging.


Pre-existing integrations for a few major content systems. The way this proposal is written means that we can have two different levels of "Do Something" right now. We can have a system level - like we have for Stack Exchange and are looking at for Medium - and individual level in a platform. Think of something like a Discourse plugin, a GitHub bot, an Wordpress blog comment plugin, etc. These are installed and maintained by individual users, not the owner of the platform (yes, I know Smokey isn't owned by SE). If we select a few targets to start with - Wordpress, Discourse, PHPBB, GitHub - big names, lots of content, it gives us an immediate appeal and the potental for "Well, might as well try it." Integrations would not all be in the same language. Discourse is written in Ruby, PHPBB is not. The integrations would have to be written for each platform. The system level integrations (ie. what we do for Stack Exchange) should be written in a common language so that developers can move between them easily.


Usage tiers. Right now time, talent and infrastructure for the entire project is donated by generous users and community overlords. If this grows as quickly as we hope, those donations could become more substantial. It might be worth considering our realistic expected usage volume and determining if we can implement usage tiers that would cover at least infrastructure costs. The major down side of this is that we'd have to do some business related paper work to set up billing, payments, etc. It'd also drastically change the dynamic of how we operate because we'd be getting payment to cover some aspects of our operation. I'm not proposing charging for using our system at all. I'm in favor of a very generous "Free Tier" (or "Community Tier", sounds better) for smaller personal targets. Think things that get less than 100/500/1000/X posts a day. That's going to be most things. The costs would come from larger, system level, integrations. (Probably not Stack Exchange...they get the nice "Strategic Partner" association and free goodness). We'd maintain the open community and receive input from more sources of content, but payments would allow us to cover real costs that are currently being hidden. Yes, I just proposed building a (very) small business out of Charcoal.