zelaznik/comparing_tdd_and_stats.md

## comparing_tdd_and_stats.md

      
    Raw
  

              comparing_tdd_and_stats.md
            
          
    Comparing Statistics To Test Driven Development:

I’m assuming a lot of people in the audience haven’t studied statistics, but because this is Rubyconf, plenty of you know the principles of test-driven-development (TDD).  If you haven’t studied statistics before, it follows the same principle as TDD.
In TDD, you demonstrate that your code is correct in two steps.  First, assume your code is wrong.  Second, try to disprove that assumption.  The first step is when you write the test so that it fails.  The second step is to change your application code so that the test passes.
In statistics we do the same thing.  We first assume the opposite of what we want to prove.  If we want to show that a drug treats a disease, we first assume that this drug has no effect.  That’s what the placebo group is for.  The placebo group is the “red” portion of “red-green refactoring.”  The group that’s treated with the drug is (hopefully) the “green” portion of “red-green” factoring.
A statistical test will never PROVE that the drug works, just like a passing test doesn’t PROVE that your code works.  Both are tools to give you more confidence.
Overfitting in stats and in TDD:

Let's say I want to write a function that returns the absolute value of a number:
def abs(v)
  if v == 2
    1
  elsif v == 1
    1
  elsif v == 0
    0
  elsif v == -1
    1
  elsif v == -2
    2
  end
end
And then I wrote my tests in rspec:
describe "abs" do
  it "returns 2 when given 2" { expect(abs(2)).to eq(2) }
  it "returns 1 when given 1" { expect(abs(1)).to eq(1) }
  it "returns 0 when given 0" { expect(abs(0)).to eq(0) }
  it "returns 1 when given -1" { expect(abs(-1)).to eq(1) }
  it "returns 2 when given -2" { expect(abs(-2)).to eq(2) }
end
So what's wrong with this?  You wrote your specs, and you implemented the function to make your specs pass?  This is the equivalent of overfitting in statistics.

  
## talk_outline.md

      
    Raw
  

              talk_outline.md
            
          
    Talk Outline: Probability and Stats for Software Engineers

Introduction


Title: Probability and Stats for Software Engineers
Objective: Help software engineers understand and apply basic statistical concepts to improve their diagnostics and decision-making.
Relevance: Examples of everyday software engineering problems where probability and statistics can help (e.g., flaky tests, error rates).

Preface:

Some jargon:


Null hypothesis
Confidence interal (95% by arbitrary convention)
Rejecting the null
Failing to reject the null
Type 1 error vs Type 2 error

Examples of said jargon:


Try to fix the flaky test
Null hypothesis "I haven't fixed my flaky test"
Type 1 error: "Thinking I fixed the test when I haven't"
Type 2 error: "Actually fixing the test but not confident I did"
Confidence interal:
Run enough tests to that if I haven't fixed the flaky test, there's a 95% chance that at least one of those builds will be red
How do we know how many times to run the test suite?  (That's what this talk is for)

Part 0: Comparing Stats and Test Driven Development (No, really!)


more details comparing stats and tdd

Part 1: Binary Questions in Software Engineering

Scenario 1: Flaky Tests


Problem: Is the test genuinely flaky or not?
Concept: Bernoulli distribution
Explanation: Simple probability - it either fails (p) or passes (1-p).
Calculation: Probability of no failures (or one or more failures) over multiple runs.

Scenario 2: Error Rate Monitoring


Problem: Is the spike in errors a sign of a new bug or just random chance?
Concept: Poisson distribution for rare events.  Show visual demonstration that Poisson is a good approximation of binomial for rare events.
Explanation: Model the expected number of errors over time.
Calculation: Probability of zero errors (or one or more errors) given the historical average.

Part 2: Central Limit Theorem (CLT)

Introduction to CLT


Concept: Regardless of the original distribution, the distribution of the sample means will tend to be normal (bell curve) if the sample size is large enough.
Importance: Simplifies many real-world problems since the normal distribution is easy to work with (mean and standard deviation).

Interactive Demonstration:


Website Tool: Allow users to select different probability distributions and run multiple trials.
Visualization: Show the cumulative results forming a bell curve.
Explanation: Demonstrate how the sample mean approximates a normal distribution as the number of trials increases.

Application of CLT:

Example: Error rates before and after a fix.


Calculation: Using the mean and standard deviation to estimate the likelihood of reduced error rates.  Use Poisson distribution from previous section to give the mean and standard deviation.
Interpretation: Quick sanity checks and ballpark estimates using normal distribution properties.

Example: Load Capacity Of Websites:


Use average use time to get a poisson mean and standard deviation
Use CLT to make a bell curve.  Then calculate how much capacity is required for 99% uptime, 99.9% uptime, etc

Example (if time permits): queueing theory


Useful if the server is keeping a websocket open for each connection
There's also a normal approximation for this, details to be given later

Example: A/B Testing:


Click-through rates follows a bernoulli distribution.
From this we can get a normal approximation for group A, and a normal approximation for group B
The hypothesis test is the A - B > 0, and (A-B) has a normal distribution of mu-a + mu-b, sqrt(sigma-a**2 + sigma-b**2)

Part 3: Practical Applications and Caveats

Quick Calculations:


Mean and Standard Deviation: How to calculate and use them for back-of-the-hand estimates.
Binary Questions: Apply the bell curve to make decisions (e.g., Is the error rate significantly reduced?).

Caveats:


Limitations: These methods are useful for ballpark estimates and sanity checks but not for rigorous statistical analysis.


Real-world Use: Emphasize that software engineers are not statisticians, and these tools are meant for practical, everyday use.


Knowing when events are and aren't independent:

Users clicking the send button multiple times when they should click once (independent)
A power outage at your data center takes down your database for lots of users (NOT independent)
Two separate runs of a flaky test (independent)


Conclusion


Summary: Recap the key concepts (Bernoulli distribution for binary questions, Poisson distribution for rare events, and CLT for simplifying complex problems).
Q&A: Open the floor for questions, prepared to delve deeper into any of the topics discussed.
Resources: Provide links to further reading and tools (e.g., interactive website, basic stats textbooks, online courses).`