gtrak/strange-loop-2015-notes.md

## strange-loop-2015-notes.md

      
    Raw
  

              strange-loop-2015-notes.md
            
          
Table of Contents


1. Gil Tene - understanding latency workshop
2. Gil Tene - Faster Objects and Arrays objectlayout.org
3. When worst is best in distributed systems design - Peter Bailis, Stanford
4. "All In" with Determinism for Performance and Testing in Distributed Systems - John Hugg, VoltDB
5. Caitie McCaffrey - Building Stateful Services - Tech Lead 'observability' at twitter
6. Transactions, myths, surprises and opportunities - Martin Kleppmann


Gil Tene - understanding latency workshop


Latency, oh shit
previous talks

http://www.azulsystems.com/sites/default/files/images/HowNotToMeasureLatency_LLSummit_NYC_12Nov2013.pdf
https://www.youtube.com/watch?v=9MKY4KypBzg


He's more interested in retrofitting correctness over what you already have
Latency

The time it took one operation to happen
We care about how latency 'behaves'
Measures don't tell people the whole story
Never spend time thinking about latency percentile charts

that's the chart of the good stuf
don't spend time thinking about it
the max isn't plotted because it's range is wayyy out there


Median measures the 'good half'

how good was the good half?


It's easy to go overboard with statistics

it's not a normal distribution
it's overabstraction
the outliers are important


Turns out that it usually ramps up sharply at the end
Look at ALL of the data

if you don't, you tend to project the missing parts


Aggregate times waiting

useful Amdahl's law effects
percentage of time being serial


Bumps are usually smooth
People tend to build to their requirements, and then they stop

latency requirements don't have to be harsh, but you shouldn't ignore them


Bends happen everywhere in new systems
Bends happen at .9's% boundaries in production, because the people
looking at them design the systems to meet them.


Nonsense words - when you find yourself saying it, kick yourself

'outliers'

they're the name you give something when you're about to ignore it


'average'

meaningless
it doesn't 'give you a feel for the common case'

that would be the median, which actually happened


encourages bad thinking


stdev

it creates misperception
in his example, the 99.999% is 184 standard deviations away from the mean


no latency dataset is a normal distribution


can't care about 2 9's because whatever you do gets aggregated 100s of times

CDNs
union of probabilities, all operating at 2 9's gives >50% chance of hitting
past it for popular websites
for his common website example

.003% of people will not experience something worse than 95th percentile
18% will experience something worst than 99.9%


Response over time

spikes tend to be accumulated work that needs to get done - hiccups

GC
reindexing of a DB
log flush


hiccups are strongly multimodal
people say things like '99.9% is projected 2 stdev's from the average'

but it's not a normal dist


get the actual shape


Hiccups by percentile distribution - much more useful

you can plot requirements on it directly


Profilers

they only will help you with the hot code

that's the best case, not what you want when fixing latency


often the reason you have a bad latency bump on the tail
is because you did too much optimization

branch predictors/pipelining example
most optimizations make common case better at the expense of the worst


JVMs pet peeve

monitor inflation
biased locking

on the fast locking path, you do an atomic CAS to lock/unlock, which
is slower than not doing it
says threads own the lock, only works if no contention
if you ever need to take away the lock

stall the thread at a safepoint
swizzle the stack when it's stopped to look like it's doing a regular lock


to lock a biased lock, you have to bring a thread to a stop

to do that you bring the jvm to a stop…

shit.


works poorly with producer/consumer
an example of 80% fast-path, high cost in outlier case
automatically disabled for 30s at jvm startup to cheat the benchmarks


Q: how do you compose SLA's from things like file-systems, jvms, kernels?

it's kind of a black art.
what do i know that looks like that?
just basic human pattern recognition
don't ignore data, try to visualize things well and comprehensively


If you're going to plot a line, the max is the one you never want to ignore.

conversely, plotting the max makes it easier to ignore stuff at smaller
ranges, which is also bad


Stating Requirements

they have to be pass/fail, not "I want it to be fast"
don't just use somebody else's
measurements should provide data to evaluate requirements

you don't know what to measure without having requirements

how fast can it go?

doesn't tell you when you're going to lose money


Process for establishing requirements

Q: What are your latency requirements? (what do you think they are?)
A: We ne need an avg response time of 20ms
Q: Ok. Typical/average of 20ms.. so what is the worst case requirement?

before stuff like retries, the system as a whole


A: We don't have one

wrong.


Q: Is it ok for some things to take more than five hours.
A: No way!
Q: So I'll write down '5 hours worst case'
A: No, that's not what I said, make that 'nothing worse than 100ms'.

figure out requirements backwards


Q: Are you sure? even if it's only 2 times a day?
A: make that 2s
Q: Ok, how many times am I allowed to have a 1 second response.
A: (annoyed) I thought you said only a few times a day.
then eventually they'll figure this out and self-regulate
better to drop the good requirements than the worst cases
need to go up to 3 9's ish
2 9's means 'are you ok with 1% being at the max?'
need to be precise about 'over a given time period'
it's ok to be more precise at different times of day


You need to write this down, then you can create experiments to measure them.
Simple charts are good, because others tend to know what they mean, including CEOs


Sysadmins and capacity planners want to plan with a nice error margin
Experimentally, it's nice to measure right up to the point where the system's about to fail

'Sustainable Throughput' is what performance testing should be measuring

how fast can I run while meeting the requirements?


Coordinated omission problem

do the ctrl-z test

a tool that doesn't correct for this will show no data where the app is suspended,

that means it's not weighting pauses appropriately, so you can't see those scenarios


common example A - load testing

going as fast as you can is not that interesting

just tells you that you crashed
tells you nothing about the latency when the system is not
saturated, which should be most of the time


mwasure/log response time for each request
works only if all responses fit within interval

implicit 'automatic back off' coordination
response times logging aren't thread-safe, so you need to
serialize your requests to get accurate metrics

apparently this is a problem with cassandra


you HAVE to measure the same number of slow requests as your fast ones,
otherwise you don't have good data.

not hypothetical, can lead to chasing phantom problems


leads to microoptimization

they were trying to improve a number they were already better at simply
because they measured poorly


Correction: cheating twice
basically, a lot of bad stuff can happen because of poor measurement

jmeter with constant throughput does this
jhiccup can correct for coordinated emission
YCSB is a nosql test thing

has the same problem


HdrHistogram - http://hdrhistogram.github.io/HdrHistogram/

a high dynamic range histogram

covers a configurable dynamic value range with configurable precision


builtin compensation for coordinated omission
a fast data structure, fixed time and spaec
the trick to this was about asking the right question
can think about it like a floating point number

a combination of linear and logarithmic buckets


ends up being just some bitswizzling and indexing into an array, like a hash algo

ported to lots of langs in the same github org
needs a JS port!


Charts

the average within a percentile is meaningless
what does it mean to average the max in the last 2 hours?

it's somewhere between the best and the worst


there's no way to reconstruct 5 9's from data about 10 seconds
(which is what statsd gives you)

this is why you need hdrhistogram


Interesting project: https://cronos.codeplex.com/
statsd

every chart is mislabeled

upper₉₅ % is upper 5%, dammit


Stalling latency looks like a sharp corner in the latency graph and a
line (curve on a log graph)
With a queue, you'd expect
Step functions are partly explained by coordinated omission

implicit queuing via cpu scheduling
multimodals can be explained by queueing


Usually one problem will dominate the graph
If spikes are precisely the same height

look for a systemic effect


Service time vs Response time

response time takes into account a growing queue
service time is how fast the barista makes a cup of coffee
when you go past a system's limits

response time will grow every second


To sanity test your performance-like tool, expect a linear graph at load


Useful to plot a bunch of lines on top of each other at various loads


jHiccup

precursor to hdrHistogram
incontinuities in java platform execution
shows the maximum hiccup size over time
since the range of the max is so large, it would hide anything else inside the spike,
so it's the only good measure
runs as a java agent
stupid simple

adds a background thread that wakes every millisecond, that's it.
if a thread that does nothing couldn't run for 20ms, then nothing else could either
600 lines of code


suspending with ctrl-z will show up on the chart
cpu contention, power saving, anything will show up on the chart
use it to show that your other tools are right

if jmeter percentiles are much better, then it's wrong


jHiccup shows the best case for response times for anything that can happen in the jvm,

creates an upper bound on how good anything that does actual work can be


In order to isolate source of hiccups

run a separate jhiccup process that does nothing
run jhiccup on your app
compare the two
if they match, it's not your code, it's a whole-system glitch

like your vmware instance got moved somewhere


if only app jvm shows hiccups

look at jvm


if neither shows hiccups

fix your code


Q: what's the most common source of hiccups?

a cron job starts


latency-utils.org

not as widely used
common in-process recording
sensitive to coordinated omission scenarios


You need to be able to drain faster than your max push into the queue, because
peaks can be orders of magnitude slower than average
being able to plot arrival counts per ms would be really useful

you'd be able to model your load tests to recreate production scenarios


financial systems have harmonic bursts at intervals because of shared clocks
when the server pauses, you don't want to stop banging it

Gil Tene - Faster Objects and Arrays objectlayout.org


closing the [last?] inherent C vs java speed gap

only one inherent to the structure of the language and what you can express in it


working code and proof, targeted for java10ish
Gil Tene

'Solved' garbage collection at Azul
he's built virtual and physical machines, OS's, cpus, enterprise apps, etc.
depresses people by telling them how bad their latency measurements are


org.ObjectLayout

match the raw speed benefits of C-based languages with common forms of memory layout
expose those benefits to normal idiomatic POJOs
Speed. For regular Java Objects. On the heap.


Not about

improved footprint
off-heap solutions
immutability
no relation to value types
no relation to packed objects or JNR/FFI


Value Types

immutable
convenient return values without creating classes all the time, tuples
on-stack
changes happen atomically
on-heap

atomic changes gets more complicated when they're on heap


For new code only


Packed Objects

precise layout control

for eg. TCP headers


off-heap or on-heap
sharing data
For new code only


objectLayout (uniquely) works for old code
Origin

a long argument between Gil and Martin Thompson (mechanical-sympathy)

'we really need structs in java'

we build a lot of complicated stuff to do what's easy in C


'we already have structs, they're called objects'

what we need is speedy access for the data collection semantics that are
faster in C
it's about capturing 'enabling semantic limitations'


Objects have too many features that makes optimizing hard, so objectlayout
requires a subset of those to work
Speed comes from

dead reckoning

lookup by data address without data-dependent-load


Streaming (array of structs)

sequential access through multiple members
predictable striding access in memory
prefetch logic compensates for miss latency


This comes naturally out of flatter memory structures which are easy to express in C


Object[] is slower than struct foo[] because the first has pointers

since it's OO, it can be subtypes in the array, so we don't know their size at compile-time


C has an immutable array of exact-same-type (and same size) structures
mutability of the array and non-uniform member size each force de-referencing and break streaming
StructuredArray is an immutable array of potentially mutable same-type T objects

supports instantiation, get, but not put


Common c-style constructs we seek to match

array of structs
struct with struct inside
struct with array at the end

struct packet {int length; char[] body;}


Using it

Can be done with 0 language changes

use the lib, jvm will know what it is, can make it fast


Can be written to starting with java7
Same semantics


Modeled after java.util.concurrent

captured semantics for fast concurrent operations
implement a prototype with basic java: eg. atomiclong CAS can be done with synchronized
convince jdks to make this fast, one x86 instruction


StructuredArray

instantiated via factory method

StructuredArray.newInstance(SomeClass.class, 100)


all elts are constructed at instantiation time

supports arbitrary constructors and args, API looks kind of like reflection


Context-based construction

can pass a lambda that takes a 'context' object for eg context.getIndex()


liveness

different from other runtimes, more intuitive

other runtimes, the whole collection can be kept alive if an inner elt is alive


elts are regular objects

can be locked
can have identity hashcode
can be passed along to any existing java code


it's 'natural'
indexes are longs (it's 2015)
can be nested
not constructable directly, to allow the semantics they need

because constructors know the size of the object ahead of time


a subclassable class that's not constructable! a contradiction? no super()

the impl does have a constructor (because java requires it), but only the factory can call it :-)


Optimized JDK impl

Requires new heap concepts

contained and container objects
GC moves the whole array and its elts at the same time if it needs to
turns out to be easy to implement on all the jvm's they've tried


Streaming benefits come without compiler work, b/c of the layout
Calculating offsets for dead-reckoning requires a little compiler work


Struct in struct

intrinsic objects @Intrinsic annotation
Intrinsic objects are constructed with a factory


Struct with array at the end

you are given subclassable objects where you put the 'other stuff'


Everything is composable and can be made to be completely flat in memory
Status (Sep 2015)

vanilla java impls on github
fairly mature semantically
Intrinsified impls are coming for both openjdk and zing
early numbers look good

faster than hashmap.get


zing will support 'go fast' mode from java7+


When worst is best in distributed systems design - Peter Bailis, Stanford


what if we designed systems for worst case scenarios?

cluster provisioning: 7.3B simultaneous users

many idle resources!


hardware: chips for the next mars rover

hugely expensive packaging! if it were used as a normal laptop


security: all our developers are malicious

expensive code deployment!


designing for the worst case penalizes the average case
when can it improve the average case?

has implications for optimization and HCI


Almost every non-trivial application today is becoming distributed
Networks make it hard

packets may be delayed or dropped
asynchronous network model


Availablity addresses delays, drops, any replica can respond to any request

if system is available, then even when network is fine, nodes don't have to
talk to each other: 'coordination-free'

leads to infinite scale-out


Distributed transactions

suffer from amdahl's law effects

coordinated case creates a queue for grabbing a lock


coordination-free is much faster


What about the CAP?

Eric Brewer, Berkeley/Inktomi

paper about harvest/yield describes tradeoff between availability and right answer


properties like serializability require unavailability or require coordination

paper from nancy lynch and gilbert proves the CAP

we need to communicate if we need to share state


There are many things we can do without giving things up


What if we built systems that aren't allowed to coordinate?

in many cases, we can build designs that avoid it unless necessary.


Read committed isolation

never read from uncommitted transactions

every DB already implements this guarantee
dates back to Gray's initial transaction work (uniprocessor, single-node)
most DB's grab a lock on update

correct, but slow
doesn't scale


Multi-versioning

much faster than lock for same semantics


coordination free is worst-case, accounting for it improves average case
Popular guarantees from today's RDBMSs can benefit from this idea.
research on Coordination-avoiding systems highlights potential for huge speedups

CRDTs, I-confluence, RAMP, HAT, Bloom^L


Other situations where worst-case thinking helps average

Replication helps capacity, but it's done for fault-tolerance
Failover helps (Dev)Ops

can kill processes to perform upgrades
to manage stragglers
to revoke resources (Mesos)


Microservices

Tail latency on microservices
small reduction in avg latency for a single service for work to reduce the max
with 100x fanout that gets magnified
you service's corner case may be its consumer's average case


Universal design

curb-cuts
subtitles
there's a strong business case for accessibility


Lessons to apply in our own designs

it doesn't always apply

it might if corner cases are common
environmental conditions are variable
when 'normal' isn't as normal as we think


It's a good design tool to let you consider other points of a design space
What's our scale out strategy?
What happens during bit flips? do we need ECC?
security
It helps you examine your biases, improve performance and robustness


"All In" with Determinism for Performance and Testing in Distributed Systems - John Hugg, VoltDB


Active-Active in Theory

replicas perform in parallel a series of deterministic operations
don't have to worry about repair detection
the order of the operations to be done is a log

can replicate it and end up with the same state


Sources of nondeterminism are bad

rand()
wall clock time
procedural code containing those things
external systems
ambiguous queries


Non-user sources of nondeterminism

bad memory
libraries that randomize for security's sake


How do we avoid it?

VoltDB's SQL planner understands determinism
100% of DML is deterministic
If a query is in a RW transaction, a row scan might be swapped
for a tree-index scan to achieve determinism
Seeded random number generator
Can hash results of an operator by all the replicas and compare them
from the coordinator process
Serialiability is the supported isolation level


Watch 'Deterministic Simulation Testing' from last year

foundationDB


How Do We Test ACID?

fuzz messages between state machines
internal checking within volt itself
SQL and parameters for sql are hashed

to verify that you've read something correctly, write it out again so
hashing works out


workload must be nasty


Caitie McCaffrey - Building Stateful Services - Tech Lead 'observability' at twitter


Stateless services actually do have state, but we pretend we don't
depend too much on DB's
one DB doesn't cut it anymore

NOSQL
sharding


the DB state is a leaky abstraction, leaking into our services
Chatty clients exacerbate it (games)
Load balancing can be wasteful
Benefits

data locality (the data is where it's being operated on)

for low latency and data intensive services
Function shipping paradigm

the service has its own cache


More available consistency models


Building

Sticky connections

the client talks to the same server for the duration of
their connection or session
Persistent Connections

no stickiness once connection breaks

get round-robin'd to another server


They can pile up on single servers unless you implement backpressure

forces client to reconnect


Cluster Membership

a client is sticky to the cluster
who is in my cluster?


Work Distribution

how am I going to distribute work across my cluster?


Cluster Membership

Static

simple
but you have to take down the cluster to make changes, add nodes
operationally painful
not a good choice for HA services


Dynamic

add and remove nodes on the fly
handles failure/capacity
Two ways to do it, availability vs consistency

Gossip Protocols
Consensus Systems

if it's not available, routing can't work
last resort if you need HA


Work Distribution

Random Placement

super simple
write anywhere
read from everywhere
not a sticky connection, but can be useful when

you have a lot of data
queries operate on data that's distributed across the cluster


Consistent Hashing

deterministic placement of your requests
hashing based on the session id or IP or somesuch
manhattan db, dynamo, lots of db's use it
does not allow you to move work

you end up building in extra capacity to compensate


Distributed Hash Tables

non-deterministic placement


Stateful Services in the Real World

Scuba, facebook

distributed, in memory db
always available
does code-regression analysis and bug report, revenue,
and performance debugging
there's a paper on it


Ringpop, Uber

open-source node.js lib that brings application-layer sharding
to many of their dispatching platform services
routing logic
Swim Gossip Protocol + consistent Hashing


Orleans, extreme computing group at Microsoft Research

Actor Model

inherently stateful


Gossip Protocol

because they chose to be highly available (Caitie's preference)


Consistent Hashing + DHT

consistent hash to the actor ID
consistent hash basically routes through a DHT, not a lot of work


they were able to consistently use 95% cpu because of work-moving via DHT


Caution

Implicit assumptions
Unbounded data structures
Garbage collectors
Reloading State

first connection
recovering from crashes
deploying new code


'Fast Restarts at Facebook' paper
Should I read papers?

yes.
it's mostly in database literature
don't reinvent your own protocols


Transactions, myths, surprises and opportunities - Martin Kleppmann


or 'joys, challenges and misery'
material is also in ch. 7 of 'designing data intensive applications'
History

1975 IBM System R
2008 NoSQL
2012 "NewSQL"
Transactional guarantees are what most differentiates these things


ACID

'more mnemonic than precise' - Brewer, 2012
Durability

originally meant 'able to be written to an archive tape'

transaction log


now it means 'fsync to disk'
now being replaced with 'replication'


Consistency

C in ACID is not the same as the C in CAP

run away if someone says we can't have ACID because of CAP


'tossed in to make the acronym work' - Joe Hellerstein
now generally means that the transactions move db state from one
consistent state to another

some invariants are maintained


Really a property of how the application uses the database, not the db


Atomicity

not about concurrency! confusingly
about handling faults (crashes)

either all or none of the writes in a tx are committed


incomplete changes will be rolled back
transactions are about manipulating multiple writes as one unit

multi-object atomicity


could be called 'Abortability' instead

the semantics of what happens when things abort
includes

deadlock
network fault
constraint violation
crash/power-failure


Isolation (concurrency)

Serializable?

the strongest definition of isolation
each transaction happens one after another according to some
abstract 'God' clock
tends to be slow


Weaker schemes came about with system R (let's fiddle with our locks)

They're unapproachable because named after implementation details of a
particular db
Repeatable Read
Snapshot Isolation
Read Committed

not allowed

dirty reads

another tx can't read data from an uncommitted tx


dirty writes

a race condition


read skew can occur under read-committed

two ways to prevent it

locks 'Repeatable Read'
MVCC - Snapshot isolation

oracle 'serializable'

actually not serializable :-)


write skew

by the time the write is committed the premise of the decision is not true
invariants can be violated
requires serializability to fix
reads before write, but it should've backed out if something else commits first


Read Uncomitted


In practice, we don't understand the differences between isolation levels


How do we actually implement serializability?

we can do lots of locking

2 phase locking
hold the lock until commit/abort
if you're reading the entire DB, you stop all writes
this is why the weaker levels came into existence

people didn't want to lock everything


H-Store and SSI 'Literally serial orders' ~2006

gotta make the transactions fast

this is why it took so long to happen


you literally execute everything in a serial order
locality of data and computation
maybe everything needs to be in memory


you can detect conflicts and abort

serializable snapshot isolation

snapshot isolation with additional checks
used in postgres past 9.1
locks like 2PL

locks don't block, only gather information
at the end of a tx, analysis happens over the conflicts that occurred
the tx is aborted if the conflicts are bad, and has to retry
an optimistic scheme, works well if contention is not high
2PL is a pessimistic approach


Elephant in the room

all of these are built for a single db
microservices?

2way request/response dataflow via REST/RPC


stream processing?

1-way dataflow via queues


generally people draw a box around a microservice to create a 'transaction boundary'
if you wanted serializable transactions across multiple db's

you'd have to run an atomic commit protocol, 2PhaseCommit, 3PC

equivalent to consensus protocols

paxos, raft


total ordering (atomic broadcast)


This coordination makes systems brittle, undermines microservices
consensus over a high latency WAN sucks


What do people do instead?

compensating transactions, abort at app level after the fact
detect and fix constraint violations after the fact
apologize to your clients


Every sufficiently large deployment of microservices contains an informally specifed… of half of transactions
microservice coordination

Serializability
Causal
Eventually Consistency


'consistent snapshot' means it obeys causality

can be implemented without coordination
but there might be a lot of metadata overhead


Read committed can be implemented with coordination-free schemes

as can causality
so maybe we can do transactions without coordination
1230pm oreilly booth @peabody free books