- 1. Gil Tene - understanding latency workshop
- 2. Gil Tene - Faster Objects and Arrays objectlayout.org
- 3. When worst is best in distributed systems design - Peter Bailis, Stanford
- 4. "All In" with Determinism for Performance and Testing in Distributed Systems - John Hugg, VoltDB
- 5. Caitie McCaffrey - Building Stateful Services - Tech Lead 'observability' at twitter
- 6. Transactions, myths, surprises and opportunities - Martin Kleppmann
- Latency, oh shit
- previous talks
- He's more interested in retrofitting correctness over what you already have
- Latency
- The time it took one operation to happen
- We care about how latency 'behaves'
- Measures don't tell people the whole story
- Never spend time thinking about latency percentile charts
- that's the chart of the good stuf
- don't spend time thinking about it
- the max isn't plotted because it's range is wayyy out there
- Median measures the 'good half'
- how good was the good half?
- It's easy to go overboard with statistics
- it's not a normal distribution
- it's overabstraction
- the outliers are important
- Turns out that it usually ramps up sharply at the end
- Look at ALL of the data
- if you don't, you tend to project the missing parts
- Aggregate times waiting
- useful Amdahl's law effects
- percentage of time being serial
- Bumps are usually smooth
- People tend to build to their requirements, and then they stop
- latency requirements don't have to be harsh, but you shouldn't ignore them
- Bends happen everywhere in new systems
- Bends happen at .9's% boundaries in production, because the people looking at them design the systems to meet them.
- Nonsense words - when you find yourself saying it, kick yourself
- 'outliers'
- they're the name you give something when you're about to ignore it
- 'average'
- meaningless
- it doesn't 'give you a feel for the common case'
- that would be the median, which actually happened
- encourages bad thinking
- stdev
- it creates misperception
- in his example, the 99.999% is 184 standard deviations away from the mean
- no latency dataset is a normal distribution
- 'outliers'
- can't care about 2 9's because whatever you do gets aggregated 100s of times
- CDNs
- union of probabilities, all operating at 2 9's gives >50% chance of hitting past it for popular websites
- for his common website example
- .003% of people will not experience something worse than 95th percentile
- 18% will experience something worst than 99.9%
- Response over time
- spikes tend to be accumulated work that needs to get done - hiccups
- GC
- reindexing of a DB
- log flush
- hiccups are strongly multimodal
- people say things like '99.9% is projected 2 stdev's from the average'
- but it's not a normal dist
- get the actual shape
- spikes tend to be accumulated work that needs to get done - hiccups
- Hiccups by percentile distribution - much more useful
- you can plot requirements on it directly
- Profilers
- they only will help you with the hot code
- that's the best case, not what you want when fixing latency
- often the reason you have a bad latency bump on the tail
is because you did too much optimization
- branch predictors/pipelining example
- most optimizations make common case better at the expense of the worst
- they only will help you with the hot code
- JVMs pet peeve
- monitor inflation
- biased locking
- on the fast locking path, you do an atomic CAS to lock/unlock, which is slower than not doing it
- says threads own the lock, only works if no contention
- if you ever need to take away the lock
- stall the thread at a safepoint
- swizzle the stack when it's stopped to look like it's doing a regular lock
- to lock a biased lock, you have to bring a thread to a stop
- to do that you bring the jvm to a stop…
- shit.
- to do that you bring the jvm to a stop…
- works poorly with producer/consumer
- an example of 80% fast-path, high cost in outlier case
- automatically disabled for 30s at jvm startup to cheat the benchmarks
- Q: how do you compose SLA's from things like file-systems, jvms, kernels?
- it's kind of a black art.
- what do i know that looks like that?
- just basic human pattern recognition
- don't ignore data, try to visualize things well and comprehensively
- If you're going to plot a line, the max is the one you never want to ignore.
- conversely, plotting the max makes it easier to ignore stuff at smaller ranges, which is also bad
- Stating Requirements
- they have to be pass/fail, not "I want it to be fast"
- don't just use somebody else's
- measurements should provide data to evaluate requirements
- you don't know what to measure without having requirements
- how fast can it go?
- doesn't tell you when you're going to lose money
- how fast can it go?
- you don't know what to measure without having requirements
- Process for establishing requirements
- Q: What are your latency requirements? (what do you think they are?)
- A: We ne need an avg response time of 20ms
- Q: Ok. Typical/average of 20ms.. so what is the worst case requirement?
- before stuff like retries, the system as a whole
- A: We don't have one
- wrong.
- Q: Is it ok for some things to take more than five hours.
- A: No way!
- Q: So I'll write down '5 hours worst case'
- A: No, that's not what I said, make that 'nothing worse than 100ms'.
- figure out requirements backwards
- Q: Are you sure? even if it's only 2 times a day?
- A: make that 2s
- Q: Ok, how many times am I allowed to have a 1 second response.
- A: (annoyed) I thought you said only a few times a day.
- then eventually they'll figure this out and self-regulate
- better to drop the good requirements than the worst cases
- need to go up to 3 9's ish
- 2 9's means 'are you ok with 1% being at the max?'
- need to be precise about 'over a given time period'
- it's ok to be more precise at different times of day
- You need to write this down, then you can create experiments to measure them.
- Simple charts are good, because others tend to know what they mean, including CEOs
- Sysadmins and capacity planners want to plan with a nice error margin
- Experimentally, it's nice to measure right up to the point where the system's about to fail
- 'Sustainable Throughput' is what performance testing should be measuring
- how fast can I run while meeting the requirements?
- 'Sustainable Throughput' is what performance testing should be measuring
- Coordinated omission problem
- do the ctrl-z test
- a tool that doesn't correct for this will show no data where the app is suspended,
- that means it's not weighting pauses appropriately, so you can't see those scenarios
- a tool that doesn't correct for this will show no data where the app is suspended,
- common example A - load testing
- going as fast as you can is not that interesting
- just tells you that you crashed
- tells you nothing about the latency when the system is not saturated, which should be most of the time
- mwasure/log response time for each request
- works only if all responses fit within interval
- implicit 'automatic back off' coordination
- response times logging aren't thread-safe, so you need to
serialize your requests to get accurate metrics
- apparently this is a problem with cassandra
- you HAVE to measure the same number of slow requests as your fast ones,
otherwise you don't have good data.
- not hypothetical, can lead to chasing phantom problems
- leads to microoptimization
- they were trying to improve a number they were already better at simply because they measured poorly
- Correction: cheating twice
- basically, a lot of bad stuff can happen because of poor measurement
- jmeter with constant throughput does this
- jhiccup can correct for coordinated emission
- YCSB is a nosql test thing
- has the same problem
- going as fast as you can is not that interesting
- do the ctrl-z test
- HdrHistogram - http://hdrhistogram.github.io/HdrHistogram/
- a high dynamic range histogram
- covers a configurable dynamic value range with configurable precision
- builtin compensation for coordinated omission
- a fast data structure, fixed time and spaec
- the trick to this was about asking the right question
- can think about it like a floating point number
- a combination of linear and logarithmic buckets
- ends up being just some bitswizzling and indexing into an array, like a hash algo
- ported to lots of langs in the same github org
- needs a JS port!
- a high dynamic range histogram
- Charts
- the average within a percentile is meaningless
- what does it mean to average the max in the last 2 hours?
- it's somewhere between the best and the worst
- there's no way to reconstruct 5 9's from data about 10 seconds
(which is what statsd gives you)
- this is why you need hdrhistogram
- Interesting project: https://cronos.codeplex.com/
- statsd
- every chart is mislabeled
- upper95 % is upper 5%, dammit
- every chart is mislabeled
- Stalling latency looks like a sharp corner in the latency graph and a line (curve on a log graph)
- With a queue, you'd expect
- Step functions are partly explained by coordinated omission
- implicit queuing via cpu scheduling
- multimodals can be explained by queueing
- Usually one problem will dominate the graph
- If spikes are precisely the same height
- look for a systemic effect
- Service time vs Response time
- response time takes into account a growing queue
- service time is how fast the barista makes a cup of coffee
- when you go past a system's limits
- response time will grow every second
- To sanity test your performance-like tool, expect a linear graph at load
- Useful to plot a bunch of lines on top of each other at various loads
- jHiccup
- precursor to hdrHistogram
- incontinuities in java platform execution
- shows the maximum hiccup size over time
- since the range of the max is so large, it would hide anything else inside the spike, so it's the only good measure
- runs as a java agent
- stupid simple
- adds a background thread that wakes every millisecond, that's it.
- if a thread that does nothing couldn't run for 20ms, then nothing else could either
- 600 lines of code
- suspending with ctrl-z will show up on the chart
- cpu contention, power saving, anything will show up on the chart
- use it to show that your other tools are right
- if jmeter percentiles are much better, then it's wrong
- jHiccup shows the best case for response times for anything that can happen in the jvm,
- creates an upper bound on how good anything that does actual work can be
- In order to isolate source of hiccups
- run a separate jhiccup process that does nothing
- run jhiccup on your app
- compare the two
- if they match, it's not your code, it's a whole-system glitch
- like your vmware instance got moved somewhere
- if only app jvm shows hiccups
- look at jvm
- if neither shows hiccups
- fix your code
- Q: what's the most common source of hiccups?
- a cron job starts
- latency-utils.org
- not as widely used
- common in-process recording
- sensitive to coordinated omission scenarios
- You need to be able to drain faster than your max push into the queue, because peaks can be orders of magnitude slower than average
- being able to plot arrival counts per ms would be really useful
- you'd be able to model your load tests to recreate production scenarios
- financial systems have harmonic bursts at intervals because of shared clocks
- when the server pauses, you don't want to stop banging it
- closing the [last?] inherent C vs java speed gap
- only one inherent to the structure of the language and what you can express in it
- working code and proof, targeted for java10ish
- Gil Tene
- 'Solved' garbage collection at Azul
- he's built virtual and physical machines, OS's, cpus, enterprise apps, etc.
- depresses people by telling them how bad their latency measurements are
- org.ObjectLayout
- match the raw speed benefits of C-based languages with common forms of memory layout
- expose those benefits to normal idiomatic POJOs
- Speed. For regular Java Objects. On the heap.
- Not about
- improved footprint
- off-heap solutions
- immutability
- no relation to value types
- no relation to packed objects or JNR/FFI
- Value Types
- immutable
- convenient return values without creating classes all the time, tuples
- on-stack
- changes happen atomically
- on-heap
- atomic changes gets more complicated when they're on heap
- For new code only
- Packed Objects
- precise layout control
- for eg. TCP headers
- off-heap or on-heap
- sharing data
- For new code only
- precise layout control
- objectLayout (uniquely) works for old code
- Origin
- a long argument between Gil and Martin Thompson (mechanical-sympathy)
- 'we really need structs in java'
- we build a lot of complicated stuff to do what's easy in C
- 'we already have structs, they're called objects'
- what we need is speedy access for the data collection semantics that are faster in C
- it's about capturing 'enabling semantic limitations'
- 'we really need structs in java'
- Objects have too many features that makes optimizing hard, so objectlayout requires a subset of those to work
- Speed comes from
- dead reckoning
- lookup by data address without data-dependent-load
- Streaming (array of structs)
- sequential access through multiple members
- predictable striding access in memory
- prefetch logic compensates for miss latency
- This comes naturally out of flatter memory structures which are easy to express in C
- dead reckoning
- Object[] is slower than struct foo[] because the first has pointers
- since it's OO, it can be subtypes in the array, so we don't know their size at compile-time
- C has an immutable array of exact-same-type (and same size) structures
- mutability of the array and non-uniform member size each force de-referencing and break streaming
- StructuredArray is an immutable array of potentially mutable same-type T objects
- supports instantiation, get, but not put
- a long argument between Gil and Martin Thompson (mechanical-sympathy)
- Common c-style constructs we seek to match
- array of structs
- struct with struct inside
- struct with array at the end
- struct packet {int length; char[] body;}
- Using it
- Can be done with 0 language changes
- use the lib, jvm will know what it is, can make it fast
- Can be written to starting with java7
- Same semantics
- Can be done with 0 language changes
- Modeled after java.util.concurrent
- captured semantics for fast concurrent operations
- implement a prototype with basic java: eg. atomiclong CAS can be done with synchronized
- convince jdks to make this fast, one x86 instruction
- StructuredArray
- instantiated via factory method
- StructuredArray.newInstance(SomeClass.class, 100)
- all elts are constructed at instantiation time
- supports arbitrary constructors and args, API looks kind of like reflection
- Context-based construction
- can pass a lambda that takes a 'context' object for eg context.getIndex()
- liveness
- different from other runtimes, more intuitive
- other runtimes, the whole collection can be kept alive if an inner elt is alive
- different from other runtimes, more intuitive
- elts are regular objects
- can be locked
- can have identity hashcode
- can be passed along to any existing java code
- it's 'natural'
- indexes are longs (it's 2015)
- can be nested
- not constructable directly, to allow the semantics they need
- because constructors know the size of the object ahead of time
- a subclassable class that's not constructable! a contradiction? no super()
- the impl does have a constructor (because java requires it), but only the factory can call it :-)
- instantiated via factory method
- Optimized JDK impl
- Requires new heap concepts
- contained and container objects
- GC moves the whole array and its elts at the same time if it needs to
- turns out to be easy to implement on all the jvm's they've tried
- Streaming benefits come without compiler work, b/c of the layout
- Calculating offsets for dead-reckoning requires a little compiler work
- Requires new heap concepts
- Struct in struct
- intrinsic objects @Intrinsic annotation
- Intrinsic objects are constructed with a factory
- Struct with array at the end
- you are given subclassable objects where you put the 'other stuff'
- Everything is composable and can be made to be completely flat in memory
- Status (Sep 2015)
- vanilla java impls on github
- fairly mature semantically
- Intrinsified impls are coming for both openjdk and zing
- early numbers look good
- faster than hashmap.get
- zing will support 'go fast' mode from java7+
- what if we designed systems for worst case scenarios?
- cluster provisioning: 7.3B simultaneous users
- many idle resources!
- hardware: chips for the next mars rover
- hugely expensive packaging! if it were used as a normal laptop
- security: all our developers are malicious
- expensive code deployment!
- cluster provisioning: 7.3B simultaneous users
- designing for the worst case penalizes the average case
- when can it improve the average case?
- has implications for optimization and HCI
- Almost every non-trivial application today is becoming distributed
- Networks make it hard
- packets may be delayed or dropped
- asynchronous network model
- Availablity addresses delays, drops, any replica can respond to any request
- if system is available, then even when network is fine, nodes don't have to
talk to each other: 'coordination-free'
- leads to infinite scale-out
- if system is available, then even when network is fine, nodes don't have to
talk to each other: 'coordination-free'
- Distributed transactions
- suffer from amdahl's law effects
- coordinated case creates a queue for grabbing a lock
- coordination-free is much faster
- suffer from amdahl's law effects
- What about the CAP?
- Eric Brewer, Berkeley/Inktomi
- paper about harvest/yield describes tradeoff between availability and right answer
- properties like serializability require unavailability or require coordination
- paper from nancy lynch and gilbert proves the CAP
- we need to communicate if we need to share state
- paper from nancy lynch and gilbert proves the CAP
- There are many things we can do without giving things up
- Eric Brewer, Berkeley/Inktomi
- What if we built systems that aren't allowed to coordinate?
- in many cases, we can build designs that avoid it unless necessary.
- Read committed isolation
- never read from uncommitted transactions
- every DB already implements this guarantee
- dates back to Gray's initial transaction work (uniprocessor, single-node)
- most DB's grab a lock on update
- correct, but slow
- doesn't scale
- never read from uncommitted transactions
- Multi-versioning
- much faster than lock for same semantics
- coordination free is worst-case, accounting for it improves average case
- Popular guarantees from today's RDBMSs can benefit from this idea.
- research on Coordination-avoiding systems highlights potential for huge speedups
- CRDTs, I-confluence, RAMP, HAT, BloomL
- Other situations where worst-case thinking helps average
- Replication helps capacity, but it's done for fault-tolerance
- Failover helps (Dev)Ops
- can kill processes to perform upgrades
- to manage stragglers
- to revoke resources (Mesos)
- Microservices
- Tail latency on microservices
- small reduction in avg latency for a single service for work to reduce the max
- with 100x fanout that gets magnified
- you service's corner case may be its consumer's average case
- Universal design
- curb-cuts
- subtitles
- there's a strong business case for accessibility
- Lessons to apply in our own designs
- it doesn't always apply
- it might if corner cases are common
- environmental conditions are variable
- when 'normal' isn't as normal as we think
- It's a good design tool to let you consider other points of a design space
- What's our scale out strategy?
- What happens during bit flips? do we need ECC?
- security
- It helps you examine your biases, improve performance and robustness
- it doesn't always apply
- Active-Active in Theory
- replicas perform in parallel a series of deterministic operations
- don't have to worry about repair detection
- the order of the operations to be done is a log
- can replicate it and end up with the same state
- Sources of nondeterminism are bad
- rand()
- wall clock time
- procedural code containing those things
- external systems
- ambiguous queries
- Non-user sources of nondeterminism
- bad memory
- libraries that randomize for security's sake
- How do we avoid it?
- VoltDB's SQL planner understands determinism
- 100% of DML is deterministic
- If a query is in a RW transaction, a row scan might be swapped for a tree-index scan to achieve determinism
- Seeded random number generator
- Can hash results of an operator by all the replicas and compare them from the coordinator process
- Serialiability is the supported isolation level
- Watch 'Deterministic Simulation Testing' from last year
- foundationDB
- How Do We Test ACID?
- fuzz messages between state machines
- internal checking within volt itself
- SQL and parameters for sql are hashed
- to verify that you've read something correctly, write it out again so hashing works out
- workload must be nasty
- Stateless services actually do have state, but we pretend we don't
- depend too much on DB's
- one DB doesn't cut it anymore
- NOSQL
- sharding
- the DB state is a leaky abstraction, leaking into our services
- Chatty clients exacerbate it (games)
- Load balancing can be wasteful
- Benefits
- data locality (the data is where it's being operated on)
- for low latency and data intensive services
- Function shipping paradigm
- the service has its own cache
- More available consistency models
- data locality (the data is where it's being operated on)
- Building
- Sticky connections
- the client talks to the same server for the duration of their connection or session
- Persistent Connections
- no stickiness once connection breaks
- get round-robin'd to another server
- They can pile up on single servers unless you implement backpressure
- forces client to reconnect
- no stickiness once connection breaks
- Cluster Membership
- a client is sticky to the cluster
- who is in my cluster?
- Work Distribution
- how am I going to distribute work across my cluster?
- Sticky connections
- Cluster Membership
- Static
- simple
- but you have to take down the cluster to make changes, add nodes
- operationally painful
- not a good choice for HA services
- Dynamic
- add and remove nodes on the fly
- handles failure/capacity
- Two ways to do it, availability vs consistency
- Gossip Protocols
- Consensus Systems
- if it's not available, routing can't work
- last resort if you need HA
- Static
- Work Distribution
- Random Placement
- super simple
- write anywhere
- read from everywhere
- not a sticky connection, but can be useful when
- you have a lot of data
- queries operate on data that's distributed across the cluster
- Consistent Hashing
- deterministic placement of your requests
- hashing based on the session id or IP or somesuch
- manhattan db, dynamo, lots of db's use it
- does not allow you to move work
- you end up building in extra capacity to compensate
- Distributed Hash Tables
- non-deterministic placement
- Random Placement
- Stateful Services in the Real World
- Scuba, facebook
- distributed, in memory db
- always available
- does code-regression analysis and bug report, revenue, and performance debugging
- there's a paper on it
- Ringpop, Uber
- open-source node.js lib that brings application-layer sharding to many of their dispatching platform services
- routing logic
- Swim Gossip Protocol + consistent Hashing
- Orleans, extreme computing group at Microsoft Research
- Actor Model
- inherently stateful
- Gossip Protocol
- because they chose to be highly available (Caitie's preference)
- Consistent Hashing + DHT
- consistent hash to the actor ID
- consistent hash basically routes through a DHT, not a lot of work
- they were able to consistently use 95% cpu because of work-moving via DHT
- Actor Model
- Scuba, facebook
- Caution
- Implicit assumptions
- Unbounded data structures
- Garbage collectors
- Reloading State
- first connection
- recovering from crashes
- deploying new code
- 'Fast Restarts at Facebook' paper
- Should I read papers?
- yes.
- it's mostly in database literature
- don't reinvent your own protocols
- or 'joys, challenges and misery'
- material is also in ch. 7 of 'designing data intensive applications'
- History
- 1975 IBM System R
- 2008 NoSQL
- 2012 "NewSQL"
- Transactional guarantees are what most differentiates these things
- ACID
- 'more mnemonic than precise' - Brewer, 2012
- Durability
- originally meant 'able to be written to an archive tape'
- transaction log
- now it means 'fsync to disk'
- now being replaced with 'replication'
- originally meant 'able to be written to an archive tape'
- Consistency
- C in ACID is not the same as the C in CAP
- run away if someone says we can't have ACID because of CAP
- 'tossed in to make the acronym work' - Joe Hellerstein
- now generally means that the transactions move db state from one
consistent state to another
- some invariants are maintained
- Really a property of how the application uses the database, not the db
- C in ACID is not the same as the C in CAP
- Atomicity
- not about concurrency! confusingly
- about handling faults (crashes)
- either all or none of the writes in a tx are committed
- incomplete changes will be rolled back
- transactions are about manipulating multiple writes as one unit
- multi-object atomicity
- could be called 'Abortability' instead
- the semantics of what happens when things abort
- includes
- deadlock
- network fault
- constraint violation
- crash/power-failure
- Isolation (concurrency)
- Serializable?
- the strongest definition of isolation
- each transaction happens one after another according to some abstract 'God' clock
- tends to be slow
- Weaker schemes came about with system R (let's fiddle with our locks)
- They're unapproachable because named after implementation details of a particular db
- Repeatable Read
- Snapshot Isolation
- Read Committed
- not allowed
- dirty reads
- another tx can't read data from an uncommitted tx
- dirty writes
- a race condition
- dirty reads
- read skew can occur under read-committed
- two ways to prevent it
- locks 'Repeatable Read'
- MVCC - Snapshot isolation
- oracle 'serializable'
- actually not serializable :-)
- oracle 'serializable'
- two ways to prevent it
- write skew
- by the time the write is committed the premise of the decision is not true
- invariants can be violated
- requires serializability to fix
- reads before write, but it should've backed out if something else commits first
- not allowed
- Read Uncomitted
- In practice, we don't understand the differences between isolation levels
- Serializable?
- How do we actually implement serializability?
- we can do lots of locking
- 2 phase locking
- hold the lock until commit/abort
- if you're reading the entire DB, you stop all writes
- this is why the weaker levels came into existence
- people didn't want to lock everything
- H-Store and SSI 'Literally serial orders' ~2006
- gotta make the transactions fast
- this is why it took so long to happen
- you literally execute everything in a serial order
- locality of data and computation
- maybe everything needs to be in memory
- gotta make the transactions fast
- you can detect conflicts and abort
- serializable snapshot isolation
- snapshot isolation with additional checks
- used in postgres past 9.1
- locks like 2PL
- locks don't block, only gather information
- at the end of a tx, analysis happens over the conflicts that occurred
- the tx is aborted if the conflicts are bad, and has to retry
- an optimistic scheme, works well if contention is not high
- 2PL is a pessimistic approach
- serializable snapshot isolation
- we can do lots of locking
- Elephant in the room
- all of these are built for a single db
- microservices?
- 2way request/response dataflow via REST/RPC
- stream processing?
- 1-way dataflow via queues
- generally people draw a box around a microservice to create a 'transaction boundary'
- if you wanted serializable transactions across multiple db's
- you'd have to run an atomic commit protocol, 2PhaseCommit, 3PC
- equivalent to consensus protocols
- paxos, raft
- total ordering (atomic broadcast)
- equivalent to consensus protocols
- you'd have to run an atomic commit protocol, 2PhaseCommit, 3PC
- This coordination makes systems brittle, undermines microservices
- consensus over a high latency WAN sucks
- What do people do instead?
- compensating transactions, abort at app level after the fact
- detect and fix constraint violations after the fact
- apologize to your clients
- Every sufficiently large deployment of microservices contains an informally specifed… of half of transactions
- microservice coordination
- Serializability
- Causal
- Eventually Consistency
- 'consistent snapshot' means it obeys causality
- can be implemented without coordination
- but there might be a lot of metadata overhead
- Read committed can be implemented with coordination-free schemes
- as can causality
- so maybe we can do transactions without coordination
- 1230pm oreilly booth @peabody free books