Skip to content

Instantly share code, notes, and snippets.

@gtrak
Created September 28, 2015 17:32
Show Gist options
  • Save gtrak/24029124ed2f96313902 to your computer and use it in GitHub Desktop.
Save gtrak/24029124ed2f96313902 to your computer and use it in GitHub Desktop.

Gil Tene - understanding latency workshop

  • Latency, oh shit
  • previous talks
  • He's more interested in retrofitting correctness over what you already have
  • Latency
    • The time it took one operation to happen
    • We care about how latency 'behaves'
    • Measures don't tell people the whole story
    • Never spend time thinking about latency percentile charts
      • that's the chart of the good stuf
      • don't spend time thinking about it
      • the max isn't plotted because it's range is wayyy out there
    • Median measures the 'good half'
      • how good was the good half?
    • It's easy to go overboard with statistics
      • it's not a normal distribution
      • it's overabstraction
      • the outliers are important
    • Turns out that it usually ramps up sharply at the end
    • Look at ALL of the data
      • if you don't, you tend to project the missing parts
    • Aggregate times waiting
      • useful Amdahl's law effects
      • percentage of time being serial
    • Bumps are usually smooth
    • People tend to build to their requirements, and then they stop
      • latency requirements don't have to be harsh, but you shouldn't ignore them
    • Bends happen everywhere in new systems
    • Bends happen at .9's% boundaries in production, because the people looking at them design the systems to meet them.
  • Nonsense words - when you find yourself saying it, kick yourself
    • 'outliers'
      • they're the name you give something when you're about to ignore it
    • 'average'
      • meaningless
      • it doesn't 'give you a feel for the common case'
        • that would be the median, which actually happened
      • encourages bad thinking
    • stdev
      • it creates misperception
      • in his example, the 99.999% is 184 standard deviations away from the mean
    • no latency dataset is a normal distribution
  • can't care about 2 9's because whatever you do gets aggregated 100s of times
    • CDNs
    • union of probabilities, all operating at 2 9's gives >50% chance of hitting past it for popular websites
    • for his common website example
      • .003% of people will not experience something worse than 95th percentile
      • 18% will experience something worst than 99.9%
  • Response over time
    • spikes tend to be accumulated work that needs to get done - hiccups
      • GC
      • reindexing of a DB
      • log flush
    • hiccups are strongly multimodal
    • people say things like '99.9% is projected 2 stdev's from the average'
      • but it's not a normal dist
    • get the actual shape
  • Hiccups by percentile distribution - much more useful
    • you can plot requirements on it directly
  • Profilers
    • they only will help you with the hot code
      • that's the best case, not what you want when fixing latency
    • often the reason you have a bad latency bump on the tail is because you did too much optimization
      • branch predictors/pipelining example
      • most optimizations make common case better at the expense of the worst
  • JVMs pet peeve
    • monitor inflation
    • biased locking
      • on the fast locking path, you do an atomic CAS to lock/unlock, which is slower than not doing it
      • says threads own the lock, only works if no contention
      • if you ever need to take away the lock
        • stall the thread at a safepoint
        • swizzle the stack when it's stopped to look like it's doing a regular lock
      • to lock a biased lock, you have to bring a thread to a stop
        • to do that you bring the jvm to a stop…
          • shit.
      • works poorly with producer/consumer
      • an example of 80% fast-path, high cost in outlier case
      • automatically disabled for 30s at jvm startup to cheat the benchmarks
  • Q: how do you compose SLA's from things like file-systems, jvms, kernels?
    • it's kind of a black art.
    • what do i know that looks like that?
    • just basic human pattern recognition
    • don't ignore data, try to visualize things well and comprehensively
  • If you're going to plot a line, the max is the one you never want to ignore.
    • conversely, plotting the max makes it easier to ignore stuff at smaller ranges, which is also bad
  • Stating Requirements
    • they have to be pass/fail, not "I want it to be fast"
    • don't just use somebody else's
    • measurements should provide data to evaluate requirements
      • you don't know what to measure without having requirements
        • how fast can it go?
          • doesn't tell you when you're going to lose money
    • Process for establishing requirements
      • Q: What are your latency requirements? (what do you think they are?)
      • A: We ne need an avg response time of 20ms
      • Q: Ok. Typical/average of 20ms.. so what is the worst case requirement?
        • before stuff like retries, the system as a whole
      • A: We don't have one
        • wrong.
      • Q: Is it ok for some things to take more than five hours.
      • A: No way!
      • Q: So I'll write down '5 hours worst case'
      • A: No, that's not what I said, make that 'nothing worse than 100ms'.
        • figure out requirements backwards
      • Q: Are you sure? even if it's only 2 times a day?
      • A: make that 2s
      • Q: Ok, how many times am I allowed to have a 1 second response.
      • A: (annoyed) I thought you said only a few times a day.
      • then eventually they'll figure this out and self-regulate
      • better to drop the good requirements than the worst cases
      • need to go up to 3 9's ish
      • 2 9's means 'are you ok with 1% being at the max?'
      • need to be precise about 'over a given time period'
      • it's ok to be more precise at different times of day
    • You need to write this down, then you can create experiments to measure them.
    • Simple charts are good, because others tend to know what they mean, including CEOs
  • Sysadmins and capacity planners want to plan with a nice error margin
  • Experimentally, it's nice to measure right up to the point where the system's about to fail
    • 'Sustainable Throughput' is what performance testing should be measuring
      • how fast can I run while meeting the requirements?
  • Coordinated omission problem
    • do the ctrl-z test
      • a tool that doesn't correct for this will show no data where the app is suspended,
        • that means it's not weighting pauses appropriately, so you can't see those scenarios
    • common example A - load testing
      • going as fast as you can is not that interesting
        • just tells you that you crashed
        • tells you nothing about the latency when the system is not saturated, which should be most of the time
      • mwasure/log response time for each request
      • works only if all responses fit within interval
        • implicit 'automatic back off' coordination
        • response times logging aren't thread-safe, so you need to serialize your requests to get accurate metrics
          • apparently this is a problem with cassandra
      • you HAVE to measure the same number of slow requests as your fast ones, otherwise you don't have good data.
        • not hypothetical, can lead to chasing phantom problems
      • leads to microoptimization
        • they were trying to improve a number they were already better at simply because they measured poorly
      • Correction: cheating twice
      • basically, a lot of bad stuff can happen because of poor measurement
        • jmeter with constant throughput does this
        • jhiccup can correct for coordinated emission
        • YCSB is a nosql test thing
          • has the same problem
  • HdrHistogram - http://hdrhistogram.github.io/HdrHistogram/
    • a high dynamic range histogram
      • covers a configurable dynamic value range with configurable precision
    • builtin compensation for coordinated omission
    • a fast data structure, fixed time and spaec
    • the trick to this was about asking the right question
    • can think about it like a floating point number
      • a combination of linear and logarithmic buckets
    • ends up being just some bitswizzling and indexing into an array, like a hash algo
      • ported to lots of langs in the same github org
      • needs a JS port!
  • Charts
    • the average within a percentile is meaningless
    • what does it mean to average the max in the last 2 hours?
      • it's somewhere between the best and the worst
    • there's no way to reconstruct 5 9's from data about 10 seconds (which is what statsd gives you)
      • this is why you need hdrhistogram
    • Interesting project: https://cronos.codeplex.com/
    • statsd
      • every chart is mislabeled
        • upper95 % is upper 5%, dammit
    • Stalling latency looks like a sharp corner in the latency graph and a line (curve on a log graph)
    • With a queue, you'd expect
    • Step functions are partly explained by coordinated omission
      • implicit queuing via cpu scheduling
      • multimodals can be explained by queueing
    • Usually one problem will dominate the graph
    • If spikes are precisely the same height
      • look for a systemic effect
    • Service time vs Response time
      • response time takes into account a growing queue
      • service time is how fast the barista makes a cup of coffee
      • when you go past a system's limits
        • response time will grow every second
      • To sanity test your performance-like tool, expect a linear graph at load
    • Useful to plot a bunch of lines on top of each other at various loads
  • jHiccup
    • precursor to hdrHistogram
    • incontinuities in java platform execution
    • shows the maximum hiccup size over time
    • since the range of the max is so large, it would hide anything else inside the spike, so it's the only good measure
    • runs as a java agent
    • stupid simple
      • adds a background thread that wakes every millisecond, that's it.
      • if a thread that does nothing couldn't run for 20ms, then nothing else could either
      • 600 lines of code
    • suspending with ctrl-z will show up on the chart
    • cpu contention, power saving, anything will show up on the chart
    • use it to show that your other tools are right
      • if jmeter percentiles are much better, then it's wrong
    • jHiccup shows the best case for response times for anything that can happen in the jvm,
      • creates an upper bound on how good anything that does actual work can be
    • In order to isolate source of hiccups
      • run a separate jhiccup process that does nothing
      • run jhiccup on your app
      • compare the two
      • if they match, it's not your code, it's a whole-system glitch
        • like your vmware instance got moved somewhere
      • if only app jvm shows hiccups
        • look at jvm
      • if neither shows hiccups
        • fix your code
    • Q: what's the most common source of hiccups?
      • a cron job starts
  • latency-utils.org
    • not as widely used
    • common in-process recording
    • sensitive to coordinated omission scenarios
  • You need to be able to drain faster than your max push into the queue, because peaks can be orders of magnitude slower than average
  • being able to plot arrival counts per ms would be really useful
    • you'd be able to model your load tests to recreate production scenarios
  • financial systems have harmonic bursts at intervals because of shared clocks
  • when the server pauses, you don't want to stop banging it

Gil Tene - Faster Objects and Arrays objectlayout.org

  • closing the [last?] inherent C vs java speed gap
    • only one inherent to the structure of the language and what you can express in it
  • working code and proof, targeted for java10ish
  • Gil Tene
    • 'Solved' garbage collection at Azul
    • he's built virtual and physical machines, OS's, cpus, enterprise apps, etc.
    • depresses people by telling them how bad their latency measurements are
  • org.ObjectLayout
    • match the raw speed benefits of C-based languages with common forms of memory layout
    • expose those benefits to normal idiomatic POJOs
    • Speed. For regular Java Objects. On the heap.
  • Not about
    • improved footprint
    • off-heap solutions
    • immutability
    • no relation to value types
    • no relation to packed objects or JNR/FFI
  • Value Types
    • immutable
    • convenient return values without creating classes all the time, tuples
    • on-stack
    • changes happen atomically
    • on-heap
      • atomic changes gets more complicated when they're on heap
    • For new code only
  • Packed Objects
    • precise layout control
      • for eg. TCP headers
    • off-heap or on-heap
    • sharing data
    • For new code only
  • objectLayout (uniquely) works for old code
  • Origin
    • a long argument between Gil and Martin Thompson (mechanical-sympathy)
      • 'we really need structs in java'
        • we build a lot of complicated stuff to do what's easy in C
      • 'we already have structs, they're called objects'
        • what we need is speedy access for the data collection semantics that are faster in C
        • it's about capturing 'enabling semantic limitations'
    • Objects have too many features that makes optimizing hard, so objectlayout requires a subset of those to work
    • Speed comes from
      • dead reckoning
        • lookup by data address without data-dependent-load
      • Streaming (array of structs)
        • sequential access through multiple members
        • predictable striding access in memory
        • prefetch logic compensates for miss latency
      • This comes naturally out of flatter memory structures which are easy to express in C
    • Object[] is slower than struct foo[] because the first has pointers
      • since it's OO, it can be subtypes in the array, so we don't know their size at compile-time
    • C has an immutable array of exact-same-type (and same size) structures
    • mutability of the array and non-uniform member size each force de-referencing and break streaming
    • StructuredArray is an immutable array of potentially mutable same-type T objects
      • supports instantiation, get, but not put
  • Common c-style constructs we seek to match
    • array of structs
    • struct with struct inside
    • struct with array at the end
      • struct packet {int length; char[] body;}
  • Using it
    • Can be done with 0 language changes
      • use the lib, jvm will know what it is, can make it fast
    • Can be written to starting with java7
    • Same semantics
  • Modeled after java.util.concurrent
    • captured semantics for fast concurrent operations
    • implement a prototype with basic java: eg. atomiclong CAS can be done with synchronized
    • convince jdks to make this fast, one x86 instruction
  • StructuredArray
    • instantiated via factory method
      • StructuredArray.newInstance(SomeClass.class, 100)
    • all elts are constructed at instantiation time
      • supports arbitrary constructors and args, API looks kind of like reflection
    • Context-based construction
      • can pass a lambda that takes a 'context' object for eg context.getIndex()
    • liveness
      • different from other runtimes, more intuitive
        • other runtimes, the whole collection can be kept alive if an inner elt is alive
    • elts are regular objects
      • can be locked
      • can have identity hashcode
      • can be passed along to any existing java code
    • it's 'natural'
    • indexes are longs (it's 2015)
    • can be nested
    • not constructable directly, to allow the semantics they need
      • because constructors know the size of the object ahead of time
    • a subclassable class that's not constructable! a contradiction? no super()
      • the impl does have a constructor (because java requires it), but only the factory can call it :-)
  • Optimized JDK impl
    • Requires new heap concepts
      • contained and container objects
      • GC moves the whole array and its elts at the same time if it needs to
      • turns out to be easy to implement on all the jvm's they've tried
    • Streaming benefits come without compiler work, b/c of the layout
    • Calculating offsets for dead-reckoning requires a little compiler work
  • Struct in struct
    • intrinsic objects @Intrinsic annotation
    • Intrinsic objects are constructed with a factory
  • Struct with array at the end
    • you are given subclassable objects where you put the 'other stuff'
  • Everything is composable and can be made to be completely flat in memory
  • Status (Sep 2015)
    • vanilla java impls on github
    • fairly mature semantically
    • Intrinsified impls are coming for both openjdk and zing
    • early numbers look good
      • faster than hashmap.get
    • zing will support 'go fast' mode from java7+

When worst is best in distributed systems design - Peter Bailis, Stanford

  • what if we designed systems for worst case scenarios?
    • cluster provisioning: 7.3B simultaneous users
      • many idle resources!
    • hardware: chips for the next mars rover
      • hugely expensive packaging! if it were used as a normal laptop
    • security: all our developers are malicious
      • expensive code deployment!
  • designing for the worst case penalizes the average case
  • when can it improve the average case?
    • has implications for optimization and HCI
  • Almost every non-trivial application today is becoming distributed
  • Networks make it hard
    • packets may be delayed or dropped
    • asynchronous network model
  • Availablity addresses delays, drops, any replica can respond to any request
    • if system is available, then even when network is fine, nodes don't have to talk to each other: 'coordination-free'
      • leads to infinite scale-out
  • Distributed transactions
    • suffer from amdahl's law effects
      • coordinated case creates a queue for grabbing a lock
    • coordination-free is much faster
  • What about the CAP?
    • Eric Brewer, Berkeley/Inktomi
      • paper about harvest/yield describes tradeoff between availability and right answer
    • properties like serializability require unavailability or require coordination
      • paper from nancy lynch and gilbert proves the CAP
        • we need to communicate if we need to share state
    • There are many things we can do without giving things up
  • What if we built systems that aren't allowed to coordinate?
    • in many cases, we can build designs that avoid it unless necessary.
  • Read committed isolation
    • never read from uncommitted transactions
      • every DB already implements this guarantee
      • dates back to Gray's initial transaction work (uniprocessor, single-node)
      • most DB's grab a lock on update
        • correct, but slow
        • doesn't scale
  • Multi-versioning
    • much faster than lock for same semantics
  • coordination free is worst-case, accounting for it improves average case
  • Popular guarantees from today's RDBMSs can benefit from this idea.
  • research on Coordination-avoiding systems highlights potential for huge speedups
    • CRDTs, I-confluence, RAMP, HAT, BloomL
  • Other situations where worst-case thinking helps average
    • Replication helps capacity, but it's done for fault-tolerance
    • Failover helps (Dev)Ops
      • can kill processes to perform upgrades
      • to manage stragglers
      • to revoke resources (Mesos)
    • Microservices
      • Tail latency on microservices
      • small reduction in avg latency for a single service for work to reduce the max
      • with 100x fanout that gets magnified
      • you service's corner case may be its consumer's average case
    • Universal design
      • curb-cuts
      • subtitles
      • there's a strong business case for accessibility
  • Lessons to apply in our own designs
    • it doesn't always apply
      • it might if corner cases are common
      • environmental conditions are variable
      • when 'normal' isn't as normal as we think
    • It's a good design tool to let you consider other points of a design space
    • What's our scale out strategy?
    • What happens during bit flips? do we need ECC?
    • security
    • It helps you examine your biases, improve performance and robustness

"All In" with Determinism for Performance and Testing in Distributed Systems - John Hugg, VoltDB

  • Active-Active in Theory
    • replicas perform in parallel a series of deterministic operations
    • don't have to worry about repair detection
    • the order of the operations to be done is a log
      • can replicate it and end up with the same state
  • Sources of nondeterminism are bad
    • rand()
    • wall clock time
    • procedural code containing those things
    • external systems
    • ambiguous queries
  • Non-user sources of nondeterminism
    • bad memory
    • libraries that randomize for security's sake
  • How do we avoid it?
    • VoltDB's SQL planner understands determinism
    • 100% of DML is deterministic
    • If a query is in a RW transaction, a row scan might be swapped for a tree-index scan to achieve determinism
    • Seeded random number generator
    • Can hash results of an operator by all the replicas and compare them from the coordinator process
    • Serialiability is the supported isolation level
  • Watch 'Deterministic Simulation Testing' from last year
    • foundationDB
  • How Do We Test ACID?
    • fuzz messages between state machines
    • internal checking within volt itself
    • SQL and parameters for sql are hashed
      • to verify that you've read something correctly, write it out again so hashing works out
    • workload must be nasty

Caitie McCaffrey - Building Stateful Services - Tech Lead 'observability' at twitter

  • Stateless services actually do have state, but we pretend we don't
  • depend too much on DB's
  • one DB doesn't cut it anymore
    • NOSQL
    • sharding
  • the DB state is a leaky abstraction, leaking into our services
  • Chatty clients exacerbate it (games)
  • Load balancing can be wasteful
  • Benefits
    • data locality (the data is where it's being operated on)
      • for low latency and data intensive services
      • Function shipping paradigm
        • the service has its own cache
    • More available consistency models
  • Building
    • Sticky connections
      • the client talks to the same server for the duration of their connection or session
      • Persistent Connections
        • no stickiness once connection breaks
          • get round-robin'd to another server
        • They can pile up on single servers unless you implement backpressure
          • forces client to reconnect
      • Cluster Membership
        • a client is sticky to the cluster
        • who is in my cluster?
      • Work Distribution
        • how am I going to distribute work across my cluster?
  • Cluster Membership
    • Static
      • simple
      • but you have to take down the cluster to make changes, add nodes
      • operationally painful
      • not a good choice for HA services
    • Dynamic
      • add and remove nodes on the fly
      • handles failure/capacity
      • Two ways to do it, availability vs consistency
        • Gossip Protocols
        • Consensus Systems
          • if it's not available, routing can't work
          • last resort if you need HA
  • Work Distribution
    • Random Placement
      • super simple
      • write anywhere
      • read from everywhere
      • not a sticky connection, but can be useful when
        • you have a lot of data
        • queries operate on data that's distributed across the cluster
    • Consistent Hashing
      • deterministic placement of your requests
      • hashing based on the session id or IP or somesuch
      • manhattan db, dynamo, lots of db's use it
      • does not allow you to move work
        • you end up building in extra capacity to compensate
    • Distributed Hash Tables
      • non-deterministic placement
  • Stateful Services in the Real World
    • Scuba, facebook
      • distributed, in memory db
      • always available
      • does code-regression analysis and bug report, revenue, and performance debugging
      • there's a paper on it
    • Ringpop, Uber
      • open-source node.js lib that brings application-layer sharding to many of their dispatching platform services
      • routing logic
      • Swim Gossip Protocol + consistent Hashing
    • Orleans, extreme computing group at Microsoft Research
      • Actor Model
        • inherently stateful
      • Gossip Protocol
        • because they chose to be highly available (Caitie's preference)
      • Consistent Hashing + DHT
        • consistent hash to the actor ID
        • consistent hash basically routes through a DHT, not a lot of work
      • they were able to consistently use 95% cpu because of work-moving via DHT
  • Caution
    • Implicit assumptions
    • Unbounded data structures
    • Garbage collectors
    • Reloading State
      • first connection
      • recovering from crashes
      • deploying new code
  • 'Fast Restarts at Facebook' paper
  • Should I read papers?
    • yes.
    • it's mostly in database literature
    • don't reinvent your own protocols

Transactions, myths, surprises and opportunities - Martin Kleppmann

  • or 'joys, challenges and misery'
  • material is also in ch. 7 of 'designing data intensive applications'
  • History
    • 1975 IBM System R
    • 2008 NoSQL
    • 2012 "NewSQL"
    • Transactional guarantees are what most differentiates these things
  • ACID
    • 'more mnemonic than precise' - Brewer, 2012
    • Durability
      • originally meant 'able to be written to an archive tape'
        • transaction log
      • now it means 'fsync to disk'
      • now being replaced with 'replication'
    • Consistency
      • C in ACID is not the same as the C in CAP
        • run away if someone says we can't have ACID because of CAP
      • 'tossed in to make the acronym work' - Joe Hellerstein
      • now generally means that the transactions move db state from one consistent state to another
        • some invariants are maintained
      • Really a property of how the application uses the database, not the db
    • Atomicity
      • not about concurrency! confusingly
      • about handling faults (crashes)
        • either all or none of the writes in a tx are committed
      • incomplete changes will be rolled back
      • transactions are about manipulating multiple writes as one unit
        • multi-object atomicity
      • could be called 'Abortability' instead
        • the semantics of what happens when things abort
        • includes
          • deadlock
          • network fault
          • constraint violation
          • crash/power-failure
    • Isolation (concurrency)
      • Serializable?
        • the strongest definition of isolation
        • each transaction happens one after another according to some abstract 'God' clock
        • tends to be slow
      • Weaker schemes came about with system R (let's fiddle with our locks)
        • They're unapproachable because named after implementation details of a particular db
        • Repeatable Read
        • Snapshot Isolation
        • Read Committed
          • not allowed
            • dirty reads
              • another tx can't read data from an uncommitted tx
            • dirty writes
              • a race condition
          • read skew can occur under read-committed
            • two ways to prevent it
              • locks 'Repeatable Read'
              • MVCC - Snapshot isolation
                • oracle 'serializable'
                  • actually not serializable :-)
          • write skew
            • by the time the write is committed the premise of the decision is not true
            • invariants can be violated
            • requires serializability to fix
            • reads before write, but it should've backed out if something else commits first
        • Read Uncomitted
      • In practice, we don't understand the differences between isolation levels
  • How do we actually implement serializability?
    • we can do lots of locking
      • 2 phase locking
      • hold the lock until commit/abort
      • if you're reading the entire DB, you stop all writes
      • this is why the weaker levels came into existence
        • people didn't want to lock everything
    • H-Store and SSI 'Literally serial orders' ~2006
      • gotta make the transactions fast
        • this is why it took so long to happen
      • you literally execute everything in a serial order
      • locality of data and computation
      • maybe everything needs to be in memory
    • you can detect conflicts and abort
      • serializable snapshot isolation
        • snapshot isolation with additional checks
        • used in postgres past 9.1
        • locks like 2PL
          • locks don't block, only gather information
          • at the end of a tx, analysis happens over the conflicts that occurred
          • the tx is aborted if the conflicts are bad, and has to retry
          • an optimistic scheme, works well if contention is not high
          • 2PL is a pessimistic approach
  • Elephant in the room
    • all of these are built for a single db
    • microservices?
      • 2way request/response dataflow via REST/RPC
    • stream processing?
      • 1-way dataflow via queues
    • generally people draw a box around a microservice to create a 'transaction boundary'
    • if you wanted serializable transactions across multiple db's
      • you'd have to run an atomic commit protocol, 2PhaseCommit, 3PC
        • equivalent to consensus protocols
          • paxos, raft
        • total ordering (atomic broadcast)
    • This coordination makes systems brittle, undermines microservices
    • consensus over a high latency WAN sucks
  • What do people do instead?
    • compensating transactions, abort at app level after the fact
    • detect and fix constraint violations after the fact
    • apologize to your clients
  • Every sufficiently large deployment of microservices contains an informally specifed… of half of transactions
  • microservice coordination
    • Serializability
    • Causal
    • Eventually Consistency
  • 'consistent snapshot' means it obeys causality
    • can be implemented without coordination
    • but there might be a lot of metadata overhead
  • Read committed can be implemented with coordination-free schemes
    • as can causality
    • so maybe we can do transactions without coordination
    • 1230pm oreilly booth @peabody free books
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment