Skip to content

Instantly share code, notes, and snippets.

@bruntonspall
Last active August 29, 2015 14:09
Show Gist options
  • Save bruntonspall/d708155e1d50fd558f25 to your computer and use it in GitHub Desktop.
Save bruntonspall/d708155e1d50fd558f25 to your computer and use it in GitHub Desktop.
Velocity

Life after human error

  • Dr Steven Shorrock
  • Can we move on from this idea?
  • Psychologist, why talk on human error
  • Your systems are highly business critical, and increasingly safety critical
  • Studied human error for 6 years for PhD and after
  • Came to conlusion he doesn't know what human error anymore
  • Someone did something they weren't supposed to do according to someone else.
  • 80% of plane crashs, 90% of road accidents, 70% of data breachs
  • People can hold 7 +- 2 items
  • In the media, error always says 80% +- 20% (but always plus)
  • Human error is a dubious explanation of failure.
  • "The incident was due to human error and has now been closed" - Common view of how this works.
  • HAL9000 explanation, "this has happened before and it's always been human error"
  • Words create worlds, they create how we think about safety.
  • Error psychology, error taxidermy - "lapse, inattention, mistake, distraction, deviation, negligence, slip"
  • Not all errors are created equal. We'll admit to distraction, but not to recklessness.
  • Player piano by Kurt Vonegut - "If it weren't for the people, earth would be an engineers paradise"
  • What we call human error is a shapeshifting persona. It tends to be a post hoc social judgement.
  • It's hard to define in advance what an error is.
  • Requires an inerrent specification, which tends not to exist
  • Can stigmatise or scapegoat.
  • bit.ly/ST4SAFETY - White paper into human error
  • How to deal with error
  • Be mindful of your mindset, how you think about errors
  • Do you think about who did it, or what happened?
  • Studying the system in the context of normal work. People aren't trying to do something different.
  • You can stufy at any time, not just as a post-mortem.
  • But must be done with field experts, not as an analyst going in.
  • Underrstand how the system reacts to demand and how they deal with presure.
  • Systems are supposed to deal with demand. Within the resources and constraints
  • A consistent feature of accidents is the presence of pressure on the system and people at the time of the accident.
  • Adjusting is what we do. Performance variability is something you need in the system, it will always happen
  • To deal with varying performance we need to make trade-offs. We trade off efficiency vs thoroughness.
  • Prior to accident we say you must be efficient, after we say you should have been more thorough

Continuous Security

  • Security - We're the guys who delay project.
  • No you can't have a security test, especially not at the rate you are deploying (100s times per day)
  • Still stuck in the waterfall model
  • One control gate before deploy. Focus all testing in the one place
  • Suffers from the same problems as other testing, feedback comes far to late
  • Pen Tests are performed by experts
    • But they are "security" experts, not experts in your domain
  • To fix this, we can look at other properties of the system
  • You can't just buy a security box and make it work.
  • Lets look at how quality testing works in agile.
    • We shifted responsiblity
    • Everybody is response for quality.
    • We shifted quality testing earlier in the cycle and made developers responsible
    • Automate all the things
  • This is where security testing should move to
    • Move it closer to developers
    • Make it earlier so we get feedback earlier and can fix it faster
  • Quality testing and Security testing aren't fundementaly different
  • Security tests are slightly different.
  • Central part of the process, we need a threat model
    • Who is going to attack
    • Where they will attack
    • Do I care?
  • We feed in business context, online games and banking are very different
  • We can arrive at a list of potential threats for the organisation
    • Lightweigth risk anaylsis, impact and liklihood values, and do we care
    • If we care, implement the appropraite control
    • If we don't write it down as accepted
  • We can define into Why, What and How.
    • Why - Threat model
    • What - Security requirements (functional and non-functional)
    • How - Security Tests
  • This needs to be inclusive. The entire team should be involved in the process of building the threat model.
  • Data leakage by functional requirements. In threat modelling we can decide how relevant.
    • Example, Can you find out whether someone is registered on the site?
    • If you are online shopping, impact to you or consumer is pretty low
    • If you are Ashely Madison (dating for cheating spouses) then you do care
  • Requirements need to have two features# Continuous Security
  • Security - We're the guys who delay project.
  • No you can't have a security test, especially not at the rate you are deploying (100s times per day)
  • Still stuck in the waterfall model
  • One control gate before deploy. Focus all testing in the one place
  • Suffers from the same problems as other testing, feedback comes far to late
  • Pen Tests are performed by experts
    • But they are "security" experts, not experts in your domain
  • To fix this, we can look at other properties of the system
  • You can't just buy a security box and make it work.
  • Lets look at how quality testing works in agile.
    • We shifted responsiblity
    • Everybody is response for quality.
    • We shifted quality testing earlier in the cycle and made developers responsible
    • Automate all the things
  • This is where security testing should move to
    • Move it closer to developers
    • Make it earlier so we get feedback earlier and can fix it faster
  • Quality testing and Security testing aren't fundementaly different
  • Security tests are slightly different.
  • Central part of the process, we need a threat model
    • Who is going to attack
    • Where they will attack
    • Do I care?
  • We feed in business context, online games and banking are very different
  • We can arrive at a list of potential threats for the organisation
    • Lightweigth risk anaylsis, impact and liklihood values, and do we care
    • If we care, implement the appropraite control
    • If we don't write it down as accepted
  • We can define into Why, What and How.
    • Why - Threat model
    • What - Security requirements (functional and non-functional)
    • How - Security Tests
  • This needs to be inclusive. The entire team should be involved in the process of building the threat model.
  • Data leakage by functional requirements. In threat modelling we can decide how relevant.
    • Example, Can you find out whether someone is registered on the site?
    • If you are online shopping, impact to you or consumer is pretty low
    • If you are Ashely Madison (dating for cheating spouses) then you do care
  • Requirements need to have two features# Continuous Security
  • Security - We're the guys who delay project.
  • No you can't have a security test, especially not at the rate you are deploying (100s times per day)
  • Still stuck in the waterfall model
  • One control gate before deploy. Focus all testing in the one place
  • Suffers from the same problems as other testing, feedback comes far to late
  • Pen Tests are performed by experts
    • But they are "security" experts, not experts in your domain
  • To fix this, we can look at other properties of the system
  • You can't just buy a security box and make it work.
  • Lets look at how quality testing works in agile.
    • We shifted responsiblity
    • Everybody is response for quality.
    • We shifted quality testing earlier in the cycle and made developers responsible
    • Automate all the things
  • This is where security testing should move to
    • Move it closer to developers
    • Make it earlier so we get feedback earlier and can fix it faster
  • Quality testing and Security testing aren't fundementaly different
  • Security tests are slightly different.
  • Central part of the process, we need a threat model
    • Who is going to attack
    • Where they will attack
    • Do I care?
  • We feed in business context, online games and banking are very different
  • We can arrive at a list of potential threats for the organisation
    • Lightweigth risk anaylsis, impact and liklihood values, and do we care
    • If we care, implement the appropraite control
    • If we don't write it down as accepted
  • We can define into Why, What and How.
    • Why - Threat model
    • What - Security requirements (functional and non-functional)
    • How - Security Tests
  • This needs to be inclusive. The entire team should be involved in the process of building the threat model.
  • Data leakage by functional requirements. In threat modelling we can decide how relevant.
    • Example, Can you find out whether someone is registered on the site?
    • If you are online shopping, impact to you or consumer is pretty low
    • If you are Ashely Madison (dating for cheating spouses) then you do care
  • Security Requirements need to have two features
    • Visible, actionable
    • Useful, testable
  • Example:
    • Bad: Userdata should be encrypted in transit
    • Better: Data X should use Y encryption when going from Service A to B
  • BDD Specs (Given/When/Then) for testing
  • http://github.com/continuumsecurity/bdd-security
  • Uses JBehave, Selenium, ZAP, Nessus
  • Plus some internal security tools and some pre-written baseline security specifications
  • Aiming for some middle of the road apps, so webapps that have user login, change password type things.
  • Examples
    • Narrative - Why do we want the spec. Scenario: Only the required ports should be open Given the target host from the base URL When TCP ports from 1 to 65535 are scanned using 100 threads and a timeout of 300 millisecons And the open ports are selected Then only the following ports should be open |port| |80| |443|
  • Test failures can fail the build, or fail the deploy
  • Use the same processes that we do for intergration tests for QA
  • Automated security scanning tools, we can fail if nessus returns anything higher than severity 2 for exampl
    • Also avoid the false positives
  • This mirrors what your security guy probably already does, but now it's repeatable and visible
  • Manual testing via browser and Zap becomes Selenium and Zap-API
  • Nice demo of the features here
  • Value in keeping false positives list is that the file is in source control, and so you know who added an exception and why
  • Automating testing access control
    • Requires more domain specific selenium work
    • Framework creates an inverse access control matrix, i.e. if you say only bob should be able to see page X, it will try as all other users anc guarantee that nobody can see bob's private data
  • The BDD-Security repository can be used to jumpstart your automated testing.
  • Other alternatives
    • ZAP-JUnit github.com/continuumsecurity/zap-webdriver
    • Gauntlet - gauntlt.org
    • Mittn - github.com.F-Secure/mittn

Math

  • CEO for Circonus, data analysis etc.
  • High level talk - not lots of math, it's more Statistical, you can get tools to do that
  • We gets lots of numbers, streams of data.
    • Some come slowly, like once a minute
    • Others quickly, like 20,000,000 per second
    • People often think of minute by minute, Theo wants to either inspire despair or give you some ideas
  • StatsD type solutions, often have weird metrics
    • Counting cups of coffee is different to room temprature measures.
    • So signals have types.
  • We then get two problems. We want to be able trend the data as well
  • When data is flowing in, you don't have the computational power to use offline mathmatical systems
  • But most monitoring or alerting tools want to tell you "am I screwed right now".
  • Some people use different approaches
    • Statistical models
    • Machine learning
    • Ad-hoc (for exmaple 75% disk full)
  • Or you can apply mathmaticians instead.
  • Maths people wanted be given a clean problem, we don't know what widget_5_count_rate means, but they think it's important
  • We need to understand charactaristcs of the graph
    • Rule 1 - Never monitor the rate of things.
    • Don't monitor the cups of coffee consumed, as you have to specify a time
    • So how many cups consumed in last minute, has to be asked every minute or you lose data
    • Instead count number of coffee cups, and poll. You can calculate time
    • Derivation of the cups = rate of value change / rate of time
  • Term for this is a counter, generally only increments
  • How do we know to use the derivative?
  • Other metrics might look like two different types
    • For example disk space
    • Guage, how full is it right now
    • Also at fill rate, how fast am I filling the database
  • Humans applied to the problem, means that humans do the right thing generally
  • We can run algorithms against the data, deciding things like do it have periodicity, does it grow constantly etc.
  • Then see who graphs similar data and what graph do they use
  • Apply bayesian model of features to categories.
  • How do we identify brokenness?
  • Look at a graph, "that would have been interesting to know yesterday"
  • Because of the rate of sampling, the graph appears to have features, but actually there are 30,000 data points
  • We probably average, any time you turn thousands of data points down to one, you are losing information.
  • If we zoom in, the data shows its more complicated.
  • p-values should apply to statistical models.
  • Does a new datavalue fit the current model? What confidence do you have that the model is right
    • We have two options:
    • Increase the frequency of collection
    • Widen the collection framework (i.e track 500 machines CPU's, not just 1)
  • Increasing the frequency is the only option often (say if you only have 1 database server)
  • Not just increase in time.
  • We could poll current disk usage very minute
  • Or we can ask every IOp to report to us a new block being allocated
  • Diversion, can we take out what we know about the signal?
    • For example, signal with periodicity, can we take out the periodicity instead
  • Models need to change, we can't set a baseline of an average of say 7, since that will change voer time.
  • Control systems use EWM - Exponential Weighted Mean
    • Uses little storage space. Say take previous value, take 80% of last value and add 20% of the new value.
    • Constantly accumulating the change
    • Hard to do offline, because it's hard to find the starting point
  • Sliding model
    • Take new value, add to model and remove oldest value from the model
    • To remove the old data, you must keep the entire dataset in memory (for the size of the window)
    • Really easy to repeat offline.
  • Lurching windows
    • Sliding windows of 3 day buckets, 2 days ago, yesterday and today
    • EWM throughout the day, and keep the days average and put it into the SWM. Only have to store 6 numbers in memory
  • Asking question, how well does my data fit the model, over how well does it fit the null hypothesis.
    • This tests the hypothesis as well, a crap model may fit, but so would the null model
  • Can apply CUSUM, applys the hypothesis to the moving average.
  • Gives you an increasing confidence that the system does not work.
  • Relatively slow on this data, because we don't have enough data.
  • Looking at alterantive Tukey test.
  • Your data is not evenly distributed.
  • Many statistical models are designed to work on nromal distributions (and standard deviations assume normal distribution)
  • If you can show the distribution, you can start to work statistics properly.
  • High volume data is different again, Doing 10b to 1 trillion measurements a second
  • Therefore need a different model, so we need to cheat
  • How do we cheat?
    • Compress the data more
    • Drop the second granuality, so they're in a minute bucket, which gives mea count
    • Drop value to 2 significant digits in base 10. So 1.1, 1.18, 1.19 goes into 1.1-1.2 bucket.
    • These are now counted, so we know in minute 7, there were 25 numbers in the 1.1 bucket, that's only 1 number enach.
    • Extract useful less-dimensional charactistics, because our data isn't normal
    • e.g. Tukey doesn't apply to normal distribution
  • Brendan Gregg - N-value
  • Counts the number of up and down angles in the graph. Gives a good representation of workload.
  • Can do Quantiles, MetaMarkets website
  • Inverse quantiles are also interesting, what %-age of requests are in the 99 percentile.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment