Skip to content

Instantly share code, notes, and snippets.

@ArtOfCode-
Last active March 5, 2018 04:24
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
  • Save ArtOfCode-/355adf8c2afc595ea2726587dfeb4020 to your computer and use it in GitHub Desktop.
Save ArtOfCode-/355adf8c2afc595ea2726587dfeb4020 to your computer and use it in GitHub Desktop.
Autoflagging meta post

Things Left To Do

  • RFC from mods

Post follows.


T: We'd like to cast more flags on spam automatically. What do you think?

TL;DR: Charcoal is the organisation behind SmokeDetector. Since January 2017, we've been casting up to 3 flags automatically on posts that our systems are confident are spam. We'd like to increase that to 5 automatic flags to reduce the time spam spends alive on the network.

Who are you?

Charcoal is a user-run organisation that is primarily responsible for the spam-detecting bot, SmokeDetector. Over the past four years, with the aid of SmokeDetector, we've looked for spam on the Stack Exchange network to flag and destroy manually. In January 2017, with the blessing of Stack Exchange, we started running an "autoflagging" project, wherein our systems automatically cast up to three flags on a post if they're confident that it's spam. If you missed that happening entirely, we wrote a meta post on Meta Stack Exchange - or there's a slightly more concise explanation on our website.

How's that been going for you?

Good. We currently have 215 users who have opted into the autoflagging system (you can sign up too, if you're interested). We've flagged around 30 000 (29 592) posts, of which the vast majority (29 526) were confirmed spam - that's 99.7% accurate.

What are you proposing?

We'd like to expand our autoflagging system. At present we cast up to 3 flags on posts we're confident are spam; we'd like to increase that number to 4 or 5 flags.

Why?

Just so we're up-front about this: this is an experiment. Ultimately, we're trying to do these things:

  • Reduce the time that spam spends on the sites before being deleted;
  • Lower the number of humans who involuntarily have to see or interact with spam.

Increasing the number of flags we cast automatically on spam should accomplish both of these things:

  • Automatic flags are near-instant; manual flags take multiple minutes to be cast - that means that increasing the ratio of automatic to manual flags results in a shorter time before 6 flags accumulate and the spam is deleted.
  • Automatic flags are not cast by a human. Fewer humans, therefore, are forced to see/interact with the spam.

The data we have backs this up. In terms of time to deletion, we saw a significant drop in the time it took to delete spam when we started our autoflagging project. Take a look at this graph from the meta post on the subject for an excellent visual representation of that. Before we started autoflagging, spam spent an average of 400 hours per day alive across the network; with autoflagging in place, the average is 10x less, at around 40 hours per day.

What would this change mean for sites?

If this change goes ahead, these things are likely to happen:

  • It will only take 1 or 2 manual flags from users to spam-nuke an autoflagged post, instead of the current 3. Posts that are not autoflagged will, of course, still require 6 flags to nuke.
  • There may be an increase in posts spam-nuked entirely by Charcoal members, who may or may not be active on the site.
  • You will see a reduction in the time spam spends on the site before being deleted.
  • Fewer humans will have to involuntarily see each spam post.

The last two of those are indisputably good things. The first two, however, are more controversial, and are the reason we want to have a discussion here on meta before we make this happen. What follows are the major concerns we've seen, and what we can do about them or why we don't think they're an issue - we'd like to hear your thoughts.

The major thing we're looking for out of this is a reduction in time to deletion. The following graph shows how long spam currently spends alive on the top few sites; we're hoping to see a moderate reduction in the average times, and a significant reduction in the top outliers.

The following graph is from an experiment we've been running over the past week, casting between 1 and 5 flags randomly on each post matching the settings we're considering.

In raw numbers, that's this:

PostCount  FlagCount  ATTD      StdDev  CommonMax
55         1          190.5091  197.59  585.69
55         2           85.0182  109.86  304.74
62         3           48.9355   83.57  216.07
68         4           26.9559   51.98  130.92
56         5           10.2143    5.60   21.41

PostCount is the sample size; FlagCount the number of flags cast on each post in the sample; ATTD the average time to deletion, and CommonMax is the maximum of a 97% confidence interval. The major takeaway from these stats is that we're likely to see a ~5x drop in the average time to deletion, and a ~10x drop in the outliers.

Accuracy & false positives

Spam flags are a powerful feature that need some care in applying correctly. This is a concern that came up when we originally built the autoflagging system, so we already have safeguards built in.

  • We only flag a post if we're more than 99.5% sure it's spam. (Technically, the precise certainty varies by conditions set by the users whose accounts we use, but it's always above 99.5% - more detail on that on our website).
  • If the system breaks down or bugs out and starts flagging things it shouldn't, all Charcoal members and all network moderators have access to a command that immediately halts all flagging activity and requires intervention from a system administrator to re-enable. Outside of testing, that kill-switch has never had to be used.
  • We never unilaterally nuke a post. There are currently 3 manual flags required in addition to the automatic flags to nuke a post; this increase proposal still retains at least one manual flag.

We also make sure that everything has human oversight at all times. While only 3 humans currently have to manually flag the post, there are always more users than that reviewing the system's decisions and classifications; if a post is flagged that shouldn't have been, we are alerted and can alert the relevant moderators to resolve the issue. Again, this is very rare: over the past year, we've flagged 66 posts that shouldn't have been, compared to 29 592 spam posts (that's 99.7% accurate overall). We allow users to set their own flagging conditions, provided they don't go below our baseline 99.50% certainty. We recommend, however, a higher value that has a certainty of 100.00% - those who set their conditions below that are likely to see more false positives flagged using their account.

This proposal decreases the required manual involvement to nuke a post; to compensate for that lower human-involvement barrier, we will correspondingly increase the required accuracy before casting the extra automatic flags. For example, we currently require 99.5% accuracy before casting autoflags; we could require 99.9% accuracy for 4 autoflags, and 99.99% accuracy for 5 autoflags. (For reference, humans are accurate 95.4% of the time, or 87.3% on Stack Overflow - those are stats that jmac (a former Community Manager) looked up for us last year when we started autoflagging).

In the rare event of a legitimate post getting autoflagged, we also have systems in place to ensure it isn't accidentally deleted and forgotten about. Multiple people review each post we catch, whether it's autoflagged or not, and classify it as spam or not; if an autoflagged post is classified as not-spam, the system posts an alert to chat to let us know. That lets us ping the necessary people to retract their flags, and keep an eye on the post to make sure it doesn't get deleted.

To make it starkly clear how accurate this could be, here's a visualisation:

300x100 grid of green squares, with one red square near the top

That's a chronological representation (left-right, top-bottom) of every post that would have been flagged under the settings we're considering for 5 flags, and whether they were spam (green squares) or legitimate (red squares).

Community agency & involvement

As I said earlier, this proposal reduces the required manual involvement to nuke a post. Since Charcoal members also cast manual flags on top of the automatic flags cast by the system, that's also likely to increase the number of posts that are nuked entirely by Charcoal members, without involvement from users who are active on this site. Some posts already have 6 flags cast on them by Charcoal (including autoflagging and manual flags), but the proportion of posts that applies to is likely to increase.

We don't think this is an issue in terms of subject matter expertise: the spam we see on the Stack Exchange network is broadly the same wherever you go - you don't need any subject matter expertise or activity on a particular site to be able to tell what's spam and what's not. We do, however, recognise that it's possible that a site's community may want to handle its own spam; if that's the case, we're happy to turn the autoflagging system off on this site or to retain it at its current levels.

What now?

We want to increase the number of automated flags from 3 to 5 to reduce the time spam spends alive on the network. We'd like to hear your thoughts. We appreciate that quite a lot of the stuff we do at Charcoal is fairly invisible to the sites, so we want to be as open as possible. If you'd like data or specific reports, let us know and we'll try to add them in - we already have a lot of reporting around autoflagging, so it may already exist. If there's other stuff we can do to explain or to help you make an informed decision about whether you want this, drop an answer or a comment on this post. Charcoal members will be hanging around this post to respond to your concerns, or you can also visit us in chat.


1 There are some flags from Charcoal members that don't appear in these statistics, for various reasons, but the majority are there.

@AWegnerGitHub
Copy link

Minor things:

I work with an organisation called Charcoal...

Change "work" to "volunteer" or "help". "Work" implies getting paid.


we've flagged 65 posts that shouldn't have been

Can we expand on this? Why were they flagged? Bug, crap, spam and then edited? What was the fate of these 65 posts? If we flagged them incorrectly, how did they fare when the 3 required humans showed up?


This proposal decreases the required manual involvement to nuke a post; to compensate for that lower human-involvement barrier, we will correspondingly increase the required accuracy before casting the extra automatic flags. For example, we currently require 99.5% accuracy before casting autoflags; while we haven't decided on concrete numbers yet, it would be possible to require 99.9% accuracy for 4 autoflags, and 99.99% accuracy for 5 autoflags. (For reference, humans are accurate 95.4% of the time, or 87.3% on Stack Overflow).

Let's expand this a bit too. With 99.5% we flagged 29K+ posts. How would that have changed with 99.9% or 99.99%? Can you link to the humans are accurate stats? I remember those being posted, but can't find it. I think that claim needs the citation to back it up.


Final thoughts: This feels very cookie cutter like. It needs a section that is personalized to the site we are posting on. That should include information such as how many of the 29K posts we've detected on this site, how long the average spam post lives (can this be broken down by autoflagged ones and non-autoflagged ones), and other involvement we have on the site (do we have a chat room Smokey posts to, do we have a number of autoflaggers over 101 rep, do we have involved moderators, etc).

@angussidney
Copy link

Additionally, I think we should mention that all but 2 of the FPs were due to people choosing to run the system at a lower accuracy than recommended by us.

Also, just to check: were they 65 posts or 65 flags?

@magisch
Copy link

magisch commented Feb 17, 2018

I think going so "official" as to call it "I work with an organization called charcoal" doesn't work in our favor. Maybe something individual and more lighthearted would serve better for the purpose we're trying for.

@DavidPostill
Copy link

And to expand on the comment by @angussidney, perhaps we should not allow users to set an accuracy lower than the recommended. Make the recommended limit a hard lower limit.

@Undo1
Copy link

Undo1 commented Feb 17, 2018

Now that we're looking at mSE only, due to a Shog directive, this probably becomes simpler - no site-specfific data needed, but it'd be good to link to relevant searches.

We should be able to reword this for the whole network pretty simply.

@SulphurDioxide
Copy link

site's community may want to handle its own spam; if that's the case, we're happy to turn the autoflagging system off on this site.

To me, this seems a little 'all-or-nothing'. Either let us increase the flag count or we'll turn autoflagging off completely. We're not talking about whether we should autoflag or not on the site because we're already doing that. The pushbacks we've got so far have been along the lines of 'We like it how it is, why change?' not 'we want it disabled completely'.

Is it possible to set a maximum number of flags on a site? Then we could rephrase to say something like:

If that's the case, we can leave the maximum number of flags for your site as it currently is (3).


I agree with @DavidPostill, if a user choosing lower accuracy makes it look like autoflagging is flawed, we should probably limit the lowest accuracy to a figure we deem reasonable.

@AWegnerGitHub
Copy link

Pretty graph ideas:

  • TTD - This one is effective at showing how changes we make impact spam life across the network
  • TTD for heavy spam sites. Something like this chart could quickly compares multiple sites (from here)
  • Do we have interesting insights about waves of spam? Can we visualize a wave before autoflagging and another wave with autoflagging?
  • Spam volume over time of the project

@ArtOfCode-
Copy link
Author

Some edits made addressing various points made here and feedback Shog gave us as well.

On the idea of restricting the lowest accuracy to a higher value - for 3-flag autoflagging, 99.5% is an acceptably high value. If you want to have settings that flag everything above that, it's fine - we'll pick more accurate conditions for more autoflags, somehow. The recommended settings are higher because we wanted those to be a balance between accuracy and post count, rather than going for one end of the scale.

@makyen
Copy link

makyen commented Feb 19, 2018

"spends alive on the site" would be better as "spends alive on each site" or "spends alive on Stack Exchange sites"

"looked for spam on the network to flag" should be "looked for spam on the Stack Exchange network to flag"

"In late 2016-early 2017" state one. You don't "start" over a range. You might ramp-up over a range, be starting almost always has a definite time, even if you started, stopped, started, stopped, etc.

"(you can sign up too, if you're still reading)" should be "(you can sign up too, if you're interested)" "still reading isn't an actual requirement. As such it comes off as a stale attempt at humor.

"Between them, we've flagged" should be "With them, we've flagged" Using "between" implies either decisiveness, or that you're going to explain how they are separated. "With" is inclusive.

"around 30,000 (29592) posts, of which the vast majority (29526) were" should consistently use, or not, a thousands separator. Try: "almost 30,000 posts (29,592 to be exact), of which the vast majority, 29,526, were"

"We'd like to expand our autoflagging system. " use "increase", not "expand". You're not wanting to expand the system, you're wanting to increase the number of flags cast automatically. Using "expand" implies changing how posts are classified to be auto-flagged.

"are spam; we'd like " use a period. "are spam. We'd like" While these are closely coupled for you, they aren't for your audience.

"spends on the sites before getting deleted" would be better as "spends on the sites before being deleted"

"Lower the number of humans who have spam involuntarily shoved in their faces." is a significant change in tone, and is much less professional than the other portions of the text. You might want to go with something like: "Lower the number of humans who are forced to see, or interact with, spam."

"should accomplish both of these things:" should have a period, not a colon. You're not introducing the next unordered list, you're stating a conclusion. You could add another sentence which introduces the next unordered list. It could wither be longer like: "Increasing the number of flags we cast automatically on spam should accomplish both of these things. Additional auto-flags will accomplish these goals because:", or you could make it all one sentence like "Increasing the number of flags we automatically cast on spam should accomplish both of these things, because:

"Automatic flags are near-instant; manual flags take multiple minutes to be cast. Increasing the ratio of automatic to manual flags logically results in a shorter time before 6 flags accumulate and the spam is deleted." You went into a bulleted list, but then start explaining. Each bullet should state the premise, then justify that assertion. Also, avoid stating an argument as "logically". Stating it that way implies that you feel people won't look at it logically and that they are lesser than you for not doing so. You could do something like:

  • Increasing the ratio of near-instant automatic flags to flags manually placed by users, which can take several minutes, results in the spam being deleted faster, due to the reduction in time required for 6 flags to accumulate.
    (I'm not really happy with the wording, but the assertion should be in the first sentence.)

"Automatic flags are not cast by a human. Not as many humans, therefore, have the spam shoved in their faces." Again, both: A) the assertion should be in the first sentence, then justification (if needed); and B) "shoved in their faces" again, is both a jarring change in tone and implies that you're not being that professional about this. You've stated that the goal is to convince people. Being unprofessional is counter to that goal. If such a tone/verbiage is used, which it can be for emphasis, or to drive a point home, it shouldn't be a recurring motif. How about something like: "Fewer humans are forced to see and interact with the spam due to requiring fewer manual flags prior to deletion."

After getting to this point, I've realized that the current ":" after "both of these things:" may have lead me to misunderstand what you're trying to say. Looking back on it, you might be attempting to list the advantages of automatic flags. If so, the bullet points (other than the "shoved in their faces" are) are OK. They just need a sentence which introduces them as the advantages of auto-flags. Something like:

Auto-flags have the following advantages:

  • near instant ...
  • not cast by human ...

"Take a look at this graph from the meta post on the subject" While the graph is nice, it doesn't really do a good job of highlighting the change from 0 autoflags to 1 then 3. The problem with it is that the lead-in is too long and the part of interest is the little bit over on the left. Perhaps put a break in the graph to indicate that you have data for a long period of no autoflagging, but that it's all basically the same.

"(If you prefer numbers, I can instead tell you that before we started autoflagging, an average of 400 person-hours per day were spent on getting spam deleted across the network; with autoflagging in place, the average is 10x less, at around 40 person-hours per day.)" This is really too long for a parenthetical. It's not actually parenthetical to the subject of the paragraph, which is "The data we have backs this up." Just use: "Before we started autoflagging, an average of 400 person-hours per day were spent getting spam deleted across the network. With autoflagging in place, the average is 10x less, at around 40 person-hours per day." (some grammar changed too).

However, is what you're saying there really what you intend? What you're saying is that you have measured the number of person-hours spent working on removing spam (not the time it's visible/existent). I wasn't aware you had this data, and I'm not sure hour you would have reliably obtained it. What I think your trying to say is something like: "Before we started autoflagging, the accumulated time which spam existed was an average of 400 hours per day across the network. With autoflagging in place, the average is 10x less, at around 40 hours per day."
(The wording here still needs a bit of work.)

"What does that mean for sites?" This doesn't match what you're talking about in the section and/or what "that" refers to in unclear. Perhaps: "What does increasing the per post number of autoflags mean for sites?"

"If this change goes ahead, these things are likely to happen:" is unclear/ misses an opportunity to reinforce proposal. Use: "If the number of autoflags per post are increased, the likely results are:" Alternately, "If this proposal moves forward, the likely results are:" Which to use deends on what you've actually titled the Question to be, and/or if you want to take the opportunity to reinforce the major direction of your proposal.

"Posts that are not autoflagged still require 6 flags to nuke." I'd add "of course" Like: "Posts that are not autoflagged will, of course, still require 6 flags to nuke." Not doing something along these lines implicitly confirms to people who've gotten confused that you might be affecting the underlying system that they might have been correct to be thinking that way. Having the "of course" allows people to realize that they were conceptually wrong while keeping the nudge on your part more of a gentle reminder, i.e. implying that they really didn't need it.

"There is likely to be an increase in posts spam-nuked entirely by Charcoal members, who may or may not be active on this site." You've already stated that you're listing the "likely" things (at a minimum, use a synonym), in addition it singles out Charcoal members without needing to and without taking the opportunity to be inclusive. You could go with "There will probably be an increase in the number of posts that are spam-nuked entirely by people monitoring SmokeDetector reports, who may or may not be active on the site where the spam was posted. Currently, SmokeDetector reports into the following chat rooms: Charcoal HQ, ... . If a site requests it, SmokeDetector can report just the spam for that site into a room of the site moderators' choosing, as is being done for SOCVR and Stack Overflow."

More later.

@ArtOfCode-
Copy link
Author

@makyen Thanks, bunch of changes made

@Undo1
Copy link

Undo1 commented Feb 20, 2018

I actually printed this out and read it. Here goes:

T: We'd like to cast more flags on spam...

I don't understand the T: here. Something related to the TL;DR? Distracted me for a few seconds

Since early 2017

Any reason not to say January 2017? We started live runs on 1/2/2017.

reduce the time that spam spends alive on each site

Maybe change "on each site" to "across the network"? Might help move people into a network mindset instead of a per-site 'agency' mindset.

Over the past three years, with the aid of SmokeDetector...

Manish's first commit was January of 2014, so four years would be valid to say. I could see an argument for increased chat presence starting around three years ago. Either way.

If you missed that happening entirely and would like more detail about it before reading any further, we wrote a meta post on Meta Stack Exchange, or there's a slightly more concise explanation on our website.

Any reason we can't drop the "and would like more detail about it before reading any further" part and concise-ify the website part? "If you missed that happening entirely, we wrote a meta post on Meta Stack Exchange and a more concise explanation on our website."

Automatic flags are near-instant; manual flags take multiple minutes to be cast

Drop "multiple"?

Before we started autoflagging, an average of 400 person-hours per day were spent on getting spam deleted across the network

Where's this data from? If it's just average TTD * post count, that's kinda misleading since everyone didn't stare at it for TTD, then flag. Might be misunderstanding this.

You will hopefully see a reduction in the time spam spends on the site before being deleted.

Can we drop "hopefully" here? There isn't a scenario where this increases TTD, and it only stays equal if a site already has three flaggers that can outrace metasmoke... which would be impressive.

The last two of those things are good things.

How about "undisputed good things"? Show that we believe all 4 are good things, but only the first two are controversial. Current wording sounds to me like we think they might be bad amongst ourselves. Not going to fight anyone on this one, though.

We appreciate that quite a lot of the stuff we do at Charcoal is fairly invisible to the sites, so we want to be as open as possible.

Worth putting in here that invisibility is almost a goal? Maybe provocative, but people not knowing we exist is a rather impressive achievement for a project with our level of impact.

We do, however, recognise that it's possible that this site's community may want

s/this/a/g for posting this on mSE.

let us know and we'll be able to add them in

"May be able to" or "we'll try to add them in"? Don't want to sign us up for that one guy who wants hourly reports of Charcoal users' meal consumption to correlate with flagging efficiency.


Not-individual things:

  1. Do we care about British v. American spelling? It's a little distracting for me, but probably nitpicky to care about either way.
  2. "(29 526)" - same as above. Comma-ify for the Americans, or is it worth it? This distracted me less than the point above.

That's all I can come up with for now. Will probably re-read in the morning.

Also, still hold off on mod RFC until we (1) have data, and (2) decide whether to even do it.

@zalmanlew
Copy link

zalmanlew commented Feb 20, 2018

At present we cast up to three flags on posts we're confident are spam; we'd like to increase that number to 4 or 5 flags.

Any reason to first spell out the number (three) and in the 2nd and 3rd time you just put the number - it's kind of confusing to me.

(I did it on purpose)

@angussidney
Copy link

Take a look at this graph from the meta post on the subject for an excellent visual representation of that.

It'd be awesome if we could get an updated version of this graph to show

Before we started autoflagging, an average of 400 person-hours per day were spent on getting spam deleted across the network; with autoflagging in place, the average is 10x less, at around 40 person-hours per day.

I'd like to know how you calculated these stats, because they don't sound right. I think we should probably leave this part as 'From a subjective perspective, this has caused a very large noticable drop in the time that spam exists on-site' or some variant of that.


Overall, this sounds pretty good.

However, I'd like to see some more inline pictures - yes the fancy headings will break it up, but this post still looks like an intimidating wall of text.

@rjrudman
Copy link

@angussidney I agree here, 400 person-hours per day seems a lot. Even 40 person-hours per day seems high.

With 6 flags, and assuming each person takes an entire minute (very likely a high estimate here) to make a decision, we're talking about 4,000 spam flags to be reviewed a day. Considering it took years to detect 100,000 posts, this doesn't seem right.

@gparyani
Copy link

I made some grammar corrections and clarifications in my fork; please take a look at those. (cc @angussidney)

@angussidney
Copy link

The diff of the above fork is here for @Art to review. A lot of it seems to be American spelling changes, we might want to decide whether we want to standardise this or not.

@ArtOfCode-
Copy link
Author

@gparyani Thanks, bunch of edits made.

@gparyani
Copy link

gparyani commented Feb 24, 2018

@ArtOfCode- While much of my fork's changes was just pedantic spelling and wording changes (which I can understand why you left out), I'm wondering why you chose to leave out my text in the initial paragraph "on behalf of willing users". I think that text is needed because someone just skimming the post may think we're using a bunch of bot accounts, so I think it's better to mention twice that it's done on behalf of willing users. Also, I think changing "30 000" to "30,000" is a good idea, as many world locales (not just the USA, but also India and some others) use commas.

@Undo1
Copy link

Undo1 commented Feb 24, 2018

I'd be tempted to just leave out stats on the 'community agency' thing - we're going to get into debates about methodology on collecting that data, and that methodology is iffy at best.

@tripleee
Copy link

Good stuff. Let me repeat a point from Andy's first comment -- it would be nice to know what happened to those 66 misflags ultimately. Is there an easy way to get a list of them so we can review them if this hasn't already been done?

@ArtOfCode-
Copy link
Author

@gparyani I left "on behalf" out because it's kinda unnecessary in the TL;DR. If we were doing it on behalf of unwilling users, we wouldn't exactly be shouting about it on meta. It's explained further down. I'm leaving out commas as thousands separators because while the US and India and the UK and a bunch of other places all use it, Europe reverses it. SI standard: 30,000.295. Europe: 30.000,295.

@angussidney
Copy link

angussidney commented Mar 3, 2018

while we haven't decided on concrete numbers yet, it would be possible to require 99.9% accuracy for 4 autoflags

To make it starkly clear how accurate our recommended settings are, here's a visualisation

Those two sentences are conflicting. The first one says that we haven't decided on the numbers for >3 autoflags, whereas the graph is of our proposed setting (>300 weight). One of these will need to be changed.

Multiple people review each post we catch, whether it's autoflagged or not, and classify it as spam or not;

This sentence/paragraph sounds a bit clunky. Maybe it should be:

No matter whether it's been autoflagged or not, multiple humans review every post that we catch and manually classify it as spam or not spam. If an autoflagged post is classified as not-spam, the system posts an alert to chat to let us know, so that we can ping the necessary people to retract their flags and keep an eye on the post to make sure it doesn't get deleted.

@terdon
Copy link

terdon commented Mar 3, 2018

Could you please also add some data about how long spam survives currently? You are proposing a change to improve X (spam survival time) but give no data on X at all. As already discussed (at length) elsewhere, going from a survival rate of say 1 hour to one of 1 minute is a very different thing from going from 10 seconds to 5 seconds and I feel like the single most important point this post needs to make is to explain what improvement the change would offer. And it can't do this unless you discuss how long spam survives on the network at the moment.

Here's one you can use:

foo

The white dot is the mean time to deletion. I used this SQL query to get the data:

SELECT 
	a.created_at, a.deleted_at, TIME_TO_SEC(TIMEDIFF(a.deleted_at, a.created_at)) AS ttd, b.site_name 
FROM 
	metasmoke_dump.posts a
INNER JOIN
	metasmoke_dump.sites b ON a.site_id = b.id
WHERE a.deleted_at IS NOT NULL AND a.is_tp = 1 AND a.is_naa = 0 AND a.created_at IS NOT NULL 
AND a.created_at > '2017-2-1' 
AND
	a.site_id 
        IN (
		SELECT site_id 
	        FROM metasmoke_dump.sites 
		WHERE site_name 
                IN 
                (
		'Stack Overflow', 'Ask Ubuntu', 'Ask Different', 
                 'Graphic Design', 'Drupal Answers', 'Super User', 'The Workplace', 
                 'Meta Stack Exchange', 'Astronomy', 'Information Security'
		)
        )
ORDER BY ttd DESC

Saved the results as topsites.tp.tsv and then, in R:

df<-read.table("smokey/topsites.tp.tsv",header=T,sep="\t")
library(ggplot2)
ggplot(df, aes(x=site_name,y=ttd, fill=site_name)) + geom_boxplot() + 
scale_y_continuous(limits = quantile(df$ttd, c(0.1, 0.9))) + 
stat_summary(fun.y=mean, geom="point", shape=21, size=3,col="black", fill="white" ) + 
scale_fill_brewer(palette="Paired") + ylab("Time to Deletion (sec)") + xlab("")
ggsave("smokey/foo.png")

@Undo1
Copy link

Undo1 commented Mar 3, 2018

Maybe change this:

That's a chronological representation of every post that could have been flagged under the recommended settings, and whether they were spam (green squares) or legitimate (red squares).

to specify that it's looking at 300 weight and clarify the order:

That's a chronological representation (left to right, top to bottom) of every post that could have been flagged under the settings we're considering for 5 flags, and whether they were spam (green squares) or legitimate (red squares).

That's an absolutely insane graph. I love it. Just need to clarify that it's for 300 weight, not our recommended 280 for normal flags.

@AWegnerGitHub
Copy link

In the "what would this change mean" section, the "the last two..." paragraph should be directly after the bullet points and before the "The major thing we're looking for" paragraph and graph.

In the green graph, I am a bit confused on its meaning. In the preceding explaination, it says there are 66 misflagged posts. I see a single red dot. I'd expect 65 more dots scattered in the 29.5K (presumably green dots).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment