bruntonspall/01-life-after-human-error.md

## 01-life-after-human-error.md

      
    Raw
  

              01-life-after-human-error.md
            
          
    Life after human error


Dr Steven Shorrock
Can we move on from this idea?
Psychologist, why talk on human error
Your systems are highly business critical, and increasingly safety critical
Studied human error for 6 years for PhD and after
Came to conlusion he doesn't know what human error anymore
Someone did something they weren't supposed to do according to someone else.
80% of plane crashs, 90% of road accidents, 70% of data breachs
People can hold 7 +- 2 items
In the media, error always says 80% +- 20% (but always plus)
Human error is a dubious explanation of failure.
"The incident was due to human error and has now been closed" - Common view of how this works.
HAL9000 explanation, "this has happened before and it's always been human error"
Words create worlds, they create how we think about safety.
Error psychology, error taxidermy - "lapse, inattention, mistake, distraction, deviation, negligence, slip"
Not all errors are created equal.  We'll admit to distraction, but not to recklessness.
Player piano by Kurt Vonegut - "If it weren't for the people, earth would be an engineers paradise"
What we call human error is a shapeshifting persona.  It tends to be a post hoc social judgement.
It's hard to define in advance what an error is.
Requires an inerrent specification, which tends not to exist
Can stigmatise or scapegoat.
bit.ly/ST4SAFETY - White paper into human error
How to deal with error
Be mindful of your mindset, how you think about errors
Do you think about who did it, or what happened?
Studying the system in the context of normal work.  People aren't trying to do something different.
You can stufy at any time, not just as a post-mortem.
But must be done with field experts, not as an analyst going in.
Underrstand how the system reacts to demand and how they deal with presure.
Systems are supposed to deal with demand.  Within the resources and constraints
A consistent feature of accidents is the presence of pressure on the system and people at the time of the accident.
Adjusting is what we do.  Performance variability is something you need in the system, it will always happen
To deal with varying performance we need to make trade-offs.  We trade off efficiency vs thoroughness.
Prior to accident we say you must be efficient, after we say you should have been more thorough


## 01-readme
Blank file

## 02-continuous-security.md

      
    Raw
  

              02-continuous-security.md
            
          
    Continuous Security


Security - We're the guys who delay project.
No you can't have a security test, especially not at the rate you are deploying (100s times per day)
Still stuck in the waterfall model
One control gate before deploy.  Focus all testing in the one place
Suffers from the same problems as other testing, feedback comes far to late
Pen Tests are performed by experts

But they are "security" experts, not experts in your domain


To fix this, we can look at other properties of the system
You can't just buy a security box and make it work.
Lets look at how quality testing works in agile.

We shifted responsiblity
Everybody is response for quality.
We shifted quality testing earlier in the cycle and made developers responsible
Automate all the things


This is where security testing should move to

Move it closer to developers
Make it earlier so we get feedback earlier and can fix it faster


Quality testing and Security testing aren't fundementaly different
Security tests are slightly different.
Central part of the process, we need a threat model

Who is going to attack
Where they will attack
Do I care?


We feed in business context, online games and banking are very different
We can arrive at a list of potential threats for the organisation

Lightweigth risk anaylsis, impact and liklihood values, and do we care
If we care, implement the appropraite control
If we don't write it down as accepted


We can define into Why, What and How.

Why - Threat model
What - Security requirements (functional and non-functional)
How - Security Tests


This needs to be inclusive.  The entire team should be involved in the process of building the threat model.
Data leakage by functional requirements.  In threat modelling we can decide how relevant.

Example, Can you find out whether someone is registered on the site?
If you are online shopping, impact to you or consumer is pretty low
If you are Ashely Madison (dating for cheating spouses) then you do care


Requirements need to have two features# Continuous Security
Security - We're the guys who delay project.
No you can't have a security test, especially not at the rate you are deploying (100s times per day)
Still stuck in the waterfall model
One control gate before deploy.  Focus all testing in the one place
Suffers from the same problems as other testing, feedback comes far to late
Pen Tests are performed by experts

But they are "security" experts, not experts in your domain


To fix this, we can look at other properties of the system
You can't just buy a security box and make it work.
Lets look at how quality testing works in agile.

We shifted responsiblity
Everybody is response for quality.
We shifted quality testing earlier in the cycle and made developers responsible
Automate all the things


This is where security testing should move to

Move it closer to developers
Make it earlier so we get feedback earlier and can fix it faster


Quality testing and Security testing aren't fundementaly different
Security tests are slightly different.
Central part of the process, we need a threat model

Who is going to attack
Where they will attack
Do I care?


We feed in business context, online games and banking are very different
We can arrive at a list of potential threats for the organisation

Lightweigth risk anaylsis, impact and liklihood values, and do we care
If we care, implement the appropraite control
If we don't write it down as accepted


We can define into Why, What and How.

Why - Threat model
What - Security requirements (functional and non-functional)
How - Security Tests


This needs to be inclusive.  The entire team should be involved in the process of building the threat model.
Data leakage by functional requirements.  In threat modelling we can decide how relevant.

Example, Can you find out whether someone is registered on the site?
If you are online shopping, impact to you or consumer is pretty low
If you are Ashely Madison (dating for cheating spouses) then you do care


Requirements need to have two features# Continuous Security
Security - We're the guys who delay project.
No you can't have a security test, especially not at the rate you are deploying (100s times per day)
Still stuck in the waterfall model
One control gate before deploy.  Focus all testing in the one place
Suffers from the same problems as other testing, feedback comes far to late
Pen Tests are performed by experts

But they are "security" experts, not experts in your domain


To fix this, we can look at other properties of the system
You can't just buy a security box and make it work.
Lets look at how quality testing works in agile.

We shifted responsiblity
Everybody is response for quality.
We shifted quality testing earlier in the cycle and made developers responsible
Automate all the things


This is where security testing should move to

Move it closer to developers
Make it earlier so we get feedback earlier and can fix it faster


Quality testing and Security testing aren't fundementaly different
Security tests are slightly different.
Central part of the process, we need a threat model

Who is going to attack
Where they will attack
Do I care?


We feed in business context, online games and banking are very different
We can arrive at a list of potential threats for the organisation

Lightweigth risk anaylsis, impact and liklihood values, and do we care
If we care, implement the appropraite control
If we don't write it down as accepted


We can define into Why, What and How.

Why - Threat model
What - Security requirements (functional and non-functional)
How - Security Tests


This needs to be inclusive.  The entire team should be involved in the process of building the threat model.
Data leakage by functional requirements.  In threat modelling we can decide how relevant.

Example, Can you find out whether someone is registered on the site?
If you are online shopping, impact to you or consumer is pretty low
If you are Ashely Madison (dating for cheating spouses) then you do care


Security Requirements need to have two features

Visible, actionable
Useful, testable


Example:

Bad: Userdata should be encrypted in transit
Better: Data X should use Y encryption when going from Service A to B


BDD Specs (Given/When/Then) for testing
http://github.com/continuumsecurity/bdd-security
Uses JBehave, Selenium, ZAP, Nessus
Plus some internal security tools and some pre-written baseline security specifications
Aiming for some middle of the road apps, so webapps that have user login, change password type things.
Examples

Narrative - Why do we want the spec.
Scenario: Only the required ports should be open
Given the target host from the base URL
When TCP ports from 1 to 65535 are scanned using 100 threads and a timeout of 300 millisecons
And the open ports are selected
Then only the following ports should be open
|port|
|80|
|443|


Test failures can fail the build, or fail the deploy
Use the same processes that we do for intergration tests for QA
Automated security scanning tools, we can fail if nessus returns anything higher than severity 2 for exampl

Also avoid the false positives


This mirrors what your security guy probably already does, but now it's repeatable and visible
Manual testing via browser and Zap becomes Selenium and Zap-API
Nice demo of the features here
Value in keeping false positives list is that the file is in source control, and so you know who added an exception and why
Automating testing access control

Requires more domain specific selenium work
Framework creates an inverse access control matrix, i.e. if you say only bob should be able to see page X, it will try as all other users anc guarantee that nobody can see bob's private data


The BDD-Security repository can be used to jumpstart your automated testing.
Other alternatives

ZAP-JUnit github.com/continuumsecurity/zap-webdriver
Gauntlet - gauntlt.org
Mittn - github.com.F-Secure/mittn


## 03-math-for-ops.md

      
    Raw
  

              03-math-for-ops.md
            
          
    Math


CEO for Circonus, data analysis etc.
High level talk - not lots of math, it's more Statistical, you can get tools to do that
We gets lots of numbers, streams of data.

Some come slowly, like once a minute
Others quickly, like 20,000,000 per second
People often think of minute by minute, Theo wants to either inspire despair or give you some ideas


StatsD type solutions, often have weird metrics

Counting cups of coffee is different to room temprature measures.
So signals have types.


We then get two problems.  We want to be able trend the data as well
When data is flowing in, you don't have the computational power to use offline mathmatical systems
But most monitoring or alerting tools want to tell you "am I screwed right now".
Some people use different approaches

Statistical models
Machine learning
Ad-hoc (for exmaple 75% disk full)


Or you can apply mathmaticians instead.
Maths people wanted be given a clean problem, we don't know what widget_5_count_rate means, but they think it's important
We need to understand charactaristcs of the graph

Rule 1 - Never monitor the rate of things.
Don't monitor the cups of coffee consumed, as you have to specify a time
So how many cups consumed in last minute, has to be asked every minute or you lose data
Instead count number of coffee cups, and poll.  You can calculate time
Derivation of the cups = rate of value change / rate of time


Term for this is a counter, generally only increments
How do we know to use the derivative?
Other metrics might look like two different types

For example disk space
Guage, how full is it right now
Also at fill rate, how fast am I filling the database


Humans applied to the problem, means that humans do the right thing generally
We can run algorithms against the data, deciding things like do it have periodicity, does it grow constantly etc.
Then see who graphs similar data and what graph do they use
Apply bayesian model of features to categories.
How do we identify brokenness?
Look at a graph, "that would have been interesting to know yesterday"
Because of the rate of sampling, the graph appears to have features, but actually there are 30,000 data points
We probably average, any time you turn thousands of data points down to one, you are losing information.
If we zoom in, the data shows its more complicated.
p-values should apply to statistical models.
Does a new datavalue fit the current model?  What confidence do you have that the model is right

We have two options:
Increase the frequency of collection
Widen the collection framework (i.e track 500 machines CPU's, not just 1)


Increasing the frequency is the only option often (say if you only have 1 database server)
Not just increase in time.
We could poll current disk usage very minute
Or we can ask every IOp to report to us a new block being allocated
Diversion, can we take out what we know about the signal?

For example, signal with periodicity, can we take out the periodicity instead


Models need to change, we can't set a baseline of an average of say 7, since that will change voer time.
Control systems use EWM - Exponential Weighted Mean

Uses little storage space.  Say take previous value, take 80% of last value and add 20% of the new value.
Constantly accumulating the change
Hard to do offline, because it's hard to find the starting point


Sliding model

Take new value, add to model and remove oldest value from the model
To remove the old data, you must keep the entire dataset in memory (for the size of the window)
Really easy to repeat offline.


Lurching windows

Sliding windows of 3 day buckets, 2 days ago, yesterday and today
EWM throughout the day, and keep the days average and put it into the SWM.  Only have to store 6 numbers in memory


Asking question, how well does my data fit the model, over how well does it fit the null hypothesis.

This tests the hypothesis as well, a crap model may fit, but so would the null model


Can apply CUSUM, applys the hypothesis to the moving average.
Gives you an increasing confidence that the system does not work.
Relatively slow on this data, because we don't have enough data.
Looking at alterantive Tukey test.
Your data is not evenly distributed.
Many statistical models are designed to work on nromal distributions (and standard deviations assume normal distribution)
If you can show the distribution, you can start to work statistics properly.
High volume data is different again, Doing 10b to 1 trillion measurements a second
Therefore need a different model, so we need to cheat
How do we cheat?

Compress the data more
Drop the second granuality, so they're in a minute bucket, which gives mea count
Drop value to 2 significant digits in base 10.  So 1.1, 1.18, 1.19 goes into 1.1-1.2 bucket.
These are now counted, so we know in minute 7, there were 25 numbers in the 1.1 bucket, that's only 1 number enach.
Extract useful less-dimensional charactistics, because our data isn't normal
e.g. Tukey doesn't apply to normal distribution


Brendan Gregg - N-value
Counts the number of up and down angles in the graph.  Gives a good representation of workload.
Can do Quantiles, MetaMarkets website
Inverse quantiles are also interesting, what %-age of requests are in the 99 percentile.