Skip to content

Instantly share code, notes, and snippets.

@cicorias
Forked from philandstuff/codemesh2015.org
Created November 4, 2015 13:46
Show Gist options
  • Save cicorias/fa044cc96990b91a37ca to your computer and use it in GitHub Desktop.
Save cicorias/fa044cc96990b91a37ca to your computer and use it in GitHub Desktop.
Code mesh 2015 notes

Kush, an introduction to schedulers

about me

  • I work for GDS
  • Cabinet Office
  • we started by building GOV.UK
    • replaced older sites like direct gov, business link
  • we’re note just fixing websites
    • we also build and run digital services
    • working with depts across the country
    • eg: register to vote
      • saw about 200k people register at the same time
  • visibility on govt services
  • I’m not here to talk about GDS
    • go to Phil’s talk

this talk

  • it’s been hard to go to a conf in the last two years without hearing about docker (docker docker docker)
  • I want to demystify schedulers
    • some people feel this is too much too soon
    • mesos, kubernetes, etc
  • when you should use schedulers
    • and when you shouldn’t
  • I see schedulers everywhere

what is it?

  • the job of a scheduler is to fairly allocate work
    • fairness is an artistic term
    • schedulers act fairly based on their defined objectives
  • example: schedule the expression E = AB + CD
    • E = AB + CD
    • E = (1*2) + (3*4) – explicit ordering
    • tree of expressions
      • addition is commutative, so order doesn’t matter

common scheduling problems

  1. unconstrained
  2. time-constrained
    • all ops completed in a time-bound
      • at least two steps
  3. resource-constrained
    • if we only have one multiplication unit

types of scheduling

  • ASAP (as soon as possible)
  • ALAP (as late as possible)
  • list scheduling
  • force-directed scheduling
    • weighting (commonly time)

why?

  • we don’t make the best use of our infrastructure resources

example: linuxkernel cpu scheduler

  • decides which ops get CPU ultilization right now
  • challenge: make sure applications get a fair share of CPU time
    • (called a “time slice”)
  • let’s look at linux 2.4 cpu scheduler
    • task list containing all processes, and work to be completed
      • guarded by a r/w lock
      • each process has a priority
    • it had a number of shortcomings
      • wasn’t efficient
        • had to iterate through whole task list every time
      • bad in SMP environments
        • could easily pin a single CPU and leave others unused
  • was rewritten
    • remove global task list log
    • each process gets its own queue and its own lock
  • rewritten again in 2.6
    • called the Completely Fair Scheduler
    • uses red/black tree instead of queues
  • scheduling algorithms:
    • FF SCHED_FIFO
    • RR SCHED_RR
    • TS SCHED_OTHER (_NORMAL)
  • nice
    • runs a program with modified priority
    • eg: nice -n 19 echo “hello”
  • chrt or “change real time”
    • chrt -p 16
    • can change priority but also scheduling algorithm
  • ionice - set IO scheduling priority
    • ionice -c3 -p89
      • sets process with pid 89 as an idle io process

i see schedulers

  • many queues, locks and a scheduling algorithm over the top
  • not just CPUs!
  • eg in networking (see www.lartc.org)
  • tc - traffice control
    • eg: tc qdisc add dev lo root netem delay 500ms
      • sets queuing discipline
      • network delay of 500ms (netem == network emulation)
  • eg in postgres: has something called autovacuum
    • automates execution of VACUUM and ANALYZE

container schedulers

  • Kubernetes:
type Scheduler interface
{
        Schedule(
                *api.Pod,
                ...
        )
}
  • Docker swarm
  • primitives
    • CPU
    • instances
    • add your own: eg response time
  • ephemeral node
    • with a service on it
    • different nodes have different properties in the cloud
      • bad io
      • noisy neighbours
  • see @kelseyhightower’s talks too on container schedulers

others

  • baggage reclaim
    • a bunch of queues
    • a lock
    • you are the scheduling algorithm - you know what you care about which is your bag
    • baggage reclaim is controlled by a software scheduler
  • premier league football matches
    • domain-specific rules
      • man city and man u can’t play at home on the same day
        • (infrastructure can’t cope)
      • nottingham trent bridge
        • three stadiums next to each other
        • can’t all operate at the same time
  • people
    • pomodoro
    • work planning
  • you can choose to hand over knowledge work to schedulers (but they’re not a silver bullet)
    • can probably do better capacity planning

Amir Chaudhry: Unikernels and hyper-elastic clouds

  • who’s heard of unikernels?
  • who’s actually used them? which ones?
    • halvm
    • mirage
    • rump kernel
  • talk resources

motivation

  • software today
    • is built locally, but deployed somewhere else
    • is complex! (even though most apps are single-purpose)
  • complexity is the enemy

MirageOS

  • OCaml
    • it has a good module system

links

Matthew Podwysocki, Putting You Back in Charge of Your Data

Zach Tellman, Everything Will Flow

Queues!

  • they’re simple, right?
    • producer -> queue -> consumer
  • they’re everywhere
  • java.util.concurrent.Executor
    • thread pool
    • BlockingQueue on requests
  • java.util.concurrent.BlockingQueue
    • producer -> [notFull] [buffer] [notEmpty] -> consumer
  • java.util.concurrent.locks.Condition
    • software threads
    • hardware threads
    • more queues to schedule the threads
  • queues are everywhere!

why?

  • queues separate the “what” from the “when”

queueing theory

closed systems

  • one-at-a-time
  • shell prompt
    • command
    • output
    • command
    • output
  • single person browsing a website
    • click link
    • wait for page to load
    • click link
    • wait for page to load

open systems

  • requests come in
    • while we’re serving one, another can come in at any point

simulating open systems

  • producers: exponential distribution
    • it’s memoryless – the amount of time passed doesn’t change likelihood of producers arriving
  • consumers: pareto distribution
  • simulation examples
    • failure mode: load increases until queueing time dominates
    • the more consumers you add, the longer it takes to fall over
      • …but the more suprising it is when it does
      • echos of “why complex systems fail” here
  • lesson: unbounded queues are fundamentally broken
  • we’ve made assumptions about how users behave, but haven’t told them, or held them to it

dealing with too much data

  • options:
    1. drop the data
    2. reject the data
    3. pause the data
  • Drop
    • only valid if newer data obsoletes older data, or if we just don’t care
      • eg statsd uses UDP by default
        • problem: when we’re most in need of high fidelity data, we’re least likely to have it, because UDP is best-effort and will degrade under load
  • Reject
    • eg twitter fail whale
    • often, this is the only choice available
  • Pause (aka backpressure)
    • often the only choice for a close system, library, or subsystem
    • backpressure swims upstream
      • back to java BlockingQueue:
      • producer -> [notFull] [buffer] [notEmpty] -> consumer
        • the extra pieces are for backpressure reasons
        • you don’t even need the buffer here!

unbuffered queues

  • producer -> [puts] [takes] -> consumer
  • unbuffered queues are closed systems
  • why have buffers if we don’t need them?
    • because backpressure isn’t free
      • restarting threads
      • restarting TCP traffic
      • reading from disk
      • reading from dbs
    • a buffer allows higher, more consistent throughput
      • at a cost of potentially higher latency, because things sit on the buffer waiting to be dealt with

lessons

  • we should always plan for too much data
  • backpressure
  • TCP bufferbloat - example of problem of too many composed buffers

concrete example

  • ma: data system that takes data and takes it to a safe at-rest place
  • arbitrary volume of data -> ELB -> [thing] -> S3
    • what goes in the middle?
      • must horizontally scale
      • must be low-maintenance
      • not real-time
      • loss of data ∝ loss of utility
    • used:
      • aleph -> hadoop s3 client -> s3
    • but then:
      • S3 starts rejecting writes
      • hadoop starts accumulating writes waiting for S3 to come back
  • replaced hadoop s3 client with:
  • still had problems:
    • still acquired data in durable-queue without bound
    • no visibility when this happened; no metrics
    • had to start writing metrics for this
  • netty’s threading model
    • our typical metrics measure from when we start actually processing the request
      • but this ignores the queueing delay!
      • threadpools are not well instrumented
  • dirigiste for better understanding of behaviour of thread pools
    • distinguish task latency from queue latency
    • better metrics
      • not complete – they can’t be
      • whole system includes client

lessons

  • be clear what “completion” is
  • be picky about your tools
    • watch for sneaky unbounded queues!
  • prefer metrics to raw performance
  • you’re never measuring the complete system

publishing

  • eg chat server
  • one-to-many queue
  • timeouts

we talked about queues

  • unbounded queues aren’t. they’re bounded by something we’ve failed to define.

Q&A

  • how do you deal with pain from low defaults (eg in haproxy?)
    • you need some documentation on the knobs are and why you should care
    • a one-size-fits-all parameter is bound to disappoint
  • with the log-latency graphs you showed with the buffered queue example, is that a useful visibility tool for day-to-day systems we use?
    • that’s my general intuition
    • most people use graphite
    • doing distributions in graphite is hard
  • if we follow your example of using buffered queues, were there any metrics you feel we should definitely be tracking?
    • utilization & contention
      • once you get above 60-70% utilization, exhaustion gets quite likely
  • queueing theory often assumes a fixed number of consumers; can it work in an autoscaling world?
    • yes
    • maybe cloud services resembles infinite consumers?
    • can you do formal analysis of a dynamic system like this and predict exactly how it will behave? probably not
      • better to test, benchmark, see what it does under load

Evan Czaplicki, Accidentally Concurrent

  • works for Prezi to work on the Elm language
  • I care a lot about how proglang research from last 30 yrs can be useful day-to-day on the web
    • eg: why is immutability good?

message passing concurrency

  • I take this from concurrent ML
  • lightweight threads
  • basic primitives:
    • channel: () -> Channel a
    • send: Channel a -> a -> ()
    • recv: Channel a -> a
    • spawn: (a -> ()) -> a -> ThreadId

mutation

  • what does mutation look like when you model it as a concurrent system with these primitives?
    • type Msg a = Get | Set a
    • “two spies in the park” approach to communication
    • a new thread per variable
      • becomes unmanagable

encapsulation

function BadGuy() {
    var x = 0;
    var y = 0;
    var health = 100;
    this.follow = function(hero) { ... };
    this.damage = function(n) { ... };
}
  • hides multiple mutable cells behind a single interface
  • introduces hierarchy
  • but: hierarchy ≠ modularity
    • reference can be passed from one code unit to another
      • introduces extra comm lines between subtrees of hierarchy
    • getters and setters on objects
    • setters mean readers have ability to bash on objects

schedulers

  • different kinds
    • cooperative
      • runs a thread until it’s done or it yields
      • used by golang, …
    • preemptive
      • you can bail on something half-way through
      • used by erlang, concurrent ML, …
  • that old thing concurrency ≠ parallelism
  • event loop
    • “a poor man’s concurrency”
    • javascript setTimeout(callback, 0); is the same as spawn callback()
  • “Reactive” Programming
    • “everything is a stream!”
      • type alias Stream a = Channel a
      • build things up with combinators
      • map: (a->b) -> Stream a -> Stream b
    • glitches
      • diamond communication shape
diamond : Stream Float -> Stream Bool
diamond stream =
  merge
    (==)
    (map sqrt stream)
    (map sqrt stream)
  • glitches cont’d
    • looked at as a concurrent system, it’s clear that this will cause glitches
      • there’s no reason to expect synchronization
    • if you thought of your streams as a concurrent system, you probably wouldn’t write it this way in the first place
  • FlatMap is weird!
    • flatMap : (a -> Stream b) -> Stream a -> Stream b
    • it is kind of like a threadpool, but really bizarrely done
    • is it even a reasonable use of flatmap?
  • dependencies between streams become a rat’s nest
    • hero depends on keyboard
      • hero’s health depends on location, bad guys, etc
    • bad guys depend on hero
      • want to chase him
    • pause screen depends on
      • hero’s health
      • keyboard, mouse
    • hero depends on pause screen
      • maybe I can apply a potion to increase health
    • this is crazy

big points

  • writing high-quality concurrent programs is hard
  • everyone is writing accidentally concurrent programs

better stuff

  • immutability
    • means no accidental communication channels
    • revealing information doesn’t change your architecture

Q&A

  • what about Rust’s approach to mutability where it just emphasizes ownership?
    • ownership and encapsulation are separate concerns

Leah Hanson, How Julia Goes Fast

what problem is julia solving?

  • julia is for scientists
    • engineers, mathematicians, finance people
    • non-professional programmers who use programming as a tool
      • least time programming for maximum benefit
  • what do they need?
    • easy to learn
    • easy to use
    • good for scripts
    • good for libraries
    • fast enough for medium to large data sets
    • fast, extensible math
      • linear algebra
      • new numeric types
  • easy and fast
  • how is it better than what they already use? (eg numpy)
    • need to learn python
    • need to add your own function that doesn’t yet exist
      • implement in python
        • too slow
        • (numpy’s things are written in C with python wrappers)
        • learning C is not a good use of our users’ time
  • fast julia code is written in julia

the big decisions

  • dynamic, interpreted, easy

background for implementation

  • JIT compilation
    • at run time, you alternate between compiling and running code
  • compiler needs to be fast
    • static compilers can be slow because they don’t contribute to runtime
    • this is no longer true for JITs
  • compiler has access to runtime information
  • julia is designed up front to be JIT compiled
    • type system
      • abstract or concrete types
        • concrete types:
          • can be instantiated
          • determine mem layout
          • types can’t be modified after creation
          • can’t have subtypes
          • single supertype
        • abstract types:
          • can’t instantiate
          • new subtypes can be added at any time
          • single supertype
          • zero or more subtypes
  • multiple dispatch
    • convenience for extensible math
    • single dispatch is method dispatch we’re familiar with
      • x.foo(y) – code impl depends on type of x
    • multiple dispatch can depend on any param:
      • foo(x,y) – code impl depends on types of x and y
    • all named fns are generic
    • each fn has one or more methods
x = ModInt(3,5)
x + 5 # ModInt + Int
5 + x # Int + ModInt

function Base.+(m::ModInt, i::Int64)
    return m + ModInt(..)
end
  • this just can’t be done nicely with single dispatch

the details of how things go fast

dispatch

  • dispatch happens a lot, so needs to be really fast
    • foo(5) #dispatch, compile, run
    • foo(6) #dispatch, run
    • foo(7) #dispatch, run
  • possible signatures:
    • foo(Int64)
    • foo(Number)
    • foo(Float64)
    • foo(Int64,Int64)
  • do we need to keep redispatching? aren’t we always going to pick foo(Int64)?
    • add a cache
    • foo(5) #cache, dispatch, compile, run
    • foo(6) #cache, run
    • foo(7) #cache, run
    • call into hashmap rather than going through a list

generic functions

  • define Base.* in terms of +(Number,Number) and -(Number,Number)
  • C++: templates vs dynamic dispatch
    • dynamic dispatch is generic
      • single Base.* impl to cover all Number instances
    • templates do aggressive specialization
      • individual Base.* impl for each Number instance
    • tradeoff: code size vs speed
  • inlining
    • makes more optimizations available
    • not reusable
  • devirtualization
    • write down the IP to avoid DNS
    • ie hardcode the method implementation rather doing dispatch
  • why do we normally dispatch?
    • we’re avoiding dispatch as much as possible for performance
    • worth knowing why we have it in the first place
      • repl work
        • redefine a function to fix a bug, want callers to reflect changes
        • issue 265

type stability

  • non-concrete types have to be boxed
  • immutable types
    • kind of concrete type
    • once created, value can’t change
  • boxed immutable values are bad for perf
    • immutable types can’t appear to change
    • we have to copy them each time we want to change them
    • new heap allocation

macros for speed?

  • julia has lisp-style macros
  • macros are evaluated at compile time
  • macros should be used sparingly
  • Horner’s rule
    • a*x*x + b*x + c
    • skip a multiply with (a*x + b)*x + c
    • for a cubic: 6 multiplies down to 3
    • for a quartic: 10 multiplies down to 4
    • etc (O(n^2) down to O(n))
  • Horner’s rule can be implemented as a macro
  • 4x faster than matlab
  • 3x faster than scipy
    • both of which call C/Fortran libs
  • Julia is faster than C!
    • compiled julia methods will have inlined constants, which are very optimizable
    • C would involve a run-time loop over the array of coeffs
    • LLVM can optimize the julia better than the C
    • an abstraction which makes things faster

conclusion

  • scientists like easy & fast
  • dynamic for easy (but with limits)
  • compiled for fast
  • multiple dispatch for making fast code concise

Q&A

  • what about R?
    • R is slow for things that need loops (like monte carlo)
  • does julia have support for openmp?
    • I think so?
    • you can call libs that use OpenMP
    • there’s experimental support for OpenMP within julia

John Hughes, Mary Sheeran, Why Functional Programming Matters

FP à la 1940s

  • who needs booleans?
  • a boolean just makes a choice! They can just be functions!
    • true x y = x
    • false x y = y
  • we can then define if-then-else
    • ifte bool t e = bool t e
  • who needs integers?
    • a (positive) integer just counts loop iterations
      • two f x = f (f x)
      • one f x = f x
      • zero f x = x
    • to recover a “normal” integer…
      • two (+1) 0
        • gives 2!
    • add m n f x = m f (n f x)
    • mul m n f x = m (n f) x
  • factorial!
fact n =
  ifte (iszero n)
    one
    (mul n (fact decr n))

This means: booleans, integers, (and other data structures) can be entirely replaced by functions!

Church encodings

Before you try this at home…

  • “cannot construct the infinite type…”
~fact
(forall a. (a->a) -> a -> a)…~

-

Factorial à la 1960

  • LISP!
  • This got lots of people excited!
    • esp Peter Landin (see The Next 700 Programming Languages and ISWIM)

laws

  • (maplist f (reverse l))(reverse (maplist f l))
    • …only if f is side-effect free

issues with conventional languages

  • John Backus, “Can Programming Be Liberated from the von Neumann Style? A Functional Style and Its Algebra of Programs”
    • Turing award 1977; paper 1978
  • “Inherent defects at the most basic level cause them [conventional languages] to be both fat and weak
  • defects such as:

word-at-a-time operation

  • inherited from the von Neumann machine
    • the von Neumann way of designing machines has a terrible effect on writing programs
    • don’t focus on the word-at-a-time bottleneck; focus on whole values

inability to use “combining forms”

  • what we would today call “higher order functions”
  • he called map f “αf”

lack of useful mathematical properties for reasoning about programs

  • [f1,f2,..,fn] ∘ g ≡ [f1 ∘ g, f2 ∘ g,…, fn ∘ g]
  • g ∘ [f1,f2,..,fn] ≡ [g ∘ f1, g ∘ f2,…, g ∘ fn]

Got a bit APLy:

Def IP = (/ +) ∘ (α ×) ∘ Trans

Peter Henderson

  • functional geometry
  • escher’s square limit fish
  • devised an algorithmic definition of escher’s picture
    • operators on pictures such as rot, over, above
    • higher-order operators cycle, quartet
  • then: define pictures as functions
    • p(a,b,c) is the picture at position a, with height along vector b and width along vector c
    • other combinators become easy to describe here:
      • eg over (p,q) (a,b,c) becomes p(a,b,c) ∪ q(a,b,c)
  • this functional geometry avoids Backus’s defects:
    • combining forms
    • operations on whole values
    • algebra with laws
  • and as a bonus: functions as representations
  • taking it further: Conal Elliot’s Pan

FP à la 1990s

  • Haskell: Paul Hudak got DARPA money to work on Haskell as a prototyping language
  • functions as data
    • type Region = Point -> Bool
      • predicate representation of set of points
  • reactions…
    • “too cute for it’s own good”
    • “higher-order functions just a trick, probably not useful in other contexts”

Lazy evaluation

  • Henderson and Morris, A lazy evaluator
  • Friedman and Wise, CONS should not evaluate its arguments
  • “The Whole Value” can be infinite!
    • the infinite list of natural numbers [0, 1, 2, 3, …]
    • all the iterations of a function
    • iterate f x = [x, f x, f (f x), ...]
  • some things can be expressed very nicely with lazy eval
    • eg newton-raphson, numerical derivative

my paper

  • John Hughes, Why Functional Programming Matters
  • lazy producer-consumer pattern
    • producer: successive approximations of a numerical algorithm (eg derivative, integral)
      • consumer: limit detection
    • producer: search space
      • consumer: search strategy
  • pretty printing
    • make choices between horizontal and vertical layout
      • vertical: preserve indentation
      • horizontal: discard indentation
    • combinators: a $$ b puts a above b; a <> b puts a beside b
    • combinators obey laws:
      • <> and $$ are both associative
    • once we understand the laws, the implementation can use the laws to make the best layout

my other paper

  • Koen Claessen, John Hughes, “QuickCheck: A Lightweight Tool for Random Testing of Haskell Programs”
  • randomly generate test cases from properties (properties based on laws)
  • then shrink failing test cases to minimal counterexamples
  • This is another lazy producer-consumer!
    • producer: space of all possible tests
    • consumer: QuickCheck search strategy

more applications

  • Introduction to VLSI systems
    • was intended to revolutionize programming
    • did revolutionize hardware design
  • A VLSI circuit is not a million miles away from Escher’s square limit!
    • we used functional geometry ideas to design VLSI circuits
    • we need to be certain we haven’t made a mistake as the feedback cycle from the fab was too slow (9 months)
  • use Backus’s FP to create a language that describes streaming circuits
    • then use algebraic laws of the language to optimize the circuits
    • symbolic simulator to test
    • muFP
  • hardware sorting based on doubly-recursive definition:
  • having a language of combining forms makes it much easier to reason about sorting networks
    • though there are still outstanding unsolved problems
  • Satnam Singh: Lava language http://blog.raintown.org/p/lava.html
  • Pentium float division bug
    • very costly mistake
    • they suddenly got interested in formal verifiction
    • forte
  • bluespec bsv
    • bluecheck
      • QuickCheck in hardware design!
      • iterative deepening and shrinking on FPGA designs!

four key ideas

  • functions as representations
  • whole values
  • simple laws
  • combining forms

Martin Thompson, A Quest for Predictable Latency

  • subtitle: Adventures in Java Concurrency
  • @mjpt777
  • latency: responding in a timely manner
  • the more things you try to juggle, the harder it becomes to predict how responsive you can be
  • if you don’t respond in a timely manner, you’re effectively unavailable

causes of blocking

  • concurrent algorithms
    • notifying completion
    • mutex
    • synchronization/rendezvous
  • systemic pauses
    • JVM Safepoints (GC etc)
    • Transparent Huge Pages (linux)
    • Hardware

concurrent algorithms

  • two approachs: locks and CAS
    • both are hard to get right!

what is latency?

  • queueing theory
    • response time is sum of:
      • enqueue time
      • latent time (time in queue)
      • dequeue time
      • service time (time being processed)
  • queueing theory focuses on latent time and service time, but sometimes enqueue and dequeue dominate!
  • relationship between utilization and increase response time

adventures with locks and queues

  • evils of blocking
  • condition variables
    • condition variable RTT (echo service)
    • request: ping; response: pong
    • measure histogram of response times
      • don’t use averages! they lie!
    • 90th percentile: 8 µs
    • 8 µs is loads of time to talk between two threads!
      • 2 µs is enough to go between two machines on the same network!
    • but it gets worse at higher percentiles
      • max is 5525 µs!
  • bad news: this is the best case scenario!
  • context for benchmarks:
  • producers: 1, 2 or 3 (can’t have more on a quad-core machine!)
  • measuring: mean and 99th percentile
  • baseline (wait-free) vs ArrayBlockingQueue vs LinkedBlockingQueue vs ConcurrentLinkedQueue
  • burst length: 1 then 100
    • real world traffic is bursty
    • send a burst of 100; wait for 1 ack
  • other considerations:
    • Backpressure (queues should have bounded size)
    • size methods
    • flow rates
    • garbage
    • fan out
  • in the real world, I cannot use java’s built-in queues, partly because they lack features I need
    • if you call .size() on ConcurrentLinkedQueue, it walks the whole list

some alternative FIFOs

the disruptor

  • get over the garbage problem
    • preallocate a ring buffer and reuse it
  • producers claim a slot using CAS
    • once a slot is claimed, can write to the slot
    • once written, can update the cursor
  • consumer updates gating
  • no need for locks!
  • wrinkle: how do I deal with setting the cursor?
    • I can’t update cursor to n before it has reached n-1
    • thread claims slot; gets swapped out before writing to slot
      • blocks whole queue until thread swaps back in (many ms!)

update in Disruptor 3.0

  • cursor is used as CAS loop
  • add an “available” array to mark when things are written
  • made a huge difference

what if I just need a queue?

  • disruptor isn’t best used as a queue (better for coordinating graph of dependencies)
  • ManyToOneConcurrentArrayQueue (influences: Lamport + Fast Flow)
  • link?
  • tail: CAS counter
  • head: counter
  • reference ring buffer

back to the benchmarks

  • Disruptor and ManyToOneConcurrentArrayQueue behave better in low contention
  • gets worse under the bursty case
  • each over 50 µs at this stage
  • spinning CAS loops are the great equalizer

inter-process FIFOs

  • ring buffer
  • spinning CAS on tail counter
  • write message once space claimed
  • write header to indicate completion
    • (zero out memory on consumption to make this work)
  • MPSC Ring Buffer
  • latency is good but throughput takes a hit from zeroing memory

Aeron IPC

  • http://highscalability.com/blog/2014/11/17/aeron-do-we-really-need-another-messaging-system.html
  • we wanted a CRDT
    • replicated to another machine
    • may be reordered or duplicated
    • want to reassemble deterministically
  • big shared-memory file
    • tail pointer (but no head this time)
    • advance tail, write message, write header (just as above)
  • does the file really just go on forever?
    • problems with this: page faults; page cache churn; VM pressure
    • wash one, wear one, dry one
    • active file, dirty file, clean file
  • XADD instruction
    • available in JVM now
    • ring buffers can’t use this because the tail can overtake the head
    • if you move the tail beyond the end of the buffer, you can detect this
      • your responsibility is to rotate the buffer now
    • messages persist while active & dirty buffers are around
  • append-only data structures can be safe to read without locks
    • they can detect when they’ve reached the end
  • zeroing problem from before: how do we avoid the throughput problem?
    • do it in a background thread!
    • the old “dirty” becomes the new “clean”

back to the benchmarks!

  • aeron IPC is slower than ManyToOneConcurrentArrayQueue in the happy case (no burst, one producer, average time)
  • for bursty, 3 producers, 99th percentile:
    • ManyToOneConcurrentArrayQueue breaks below 50µs to 34µs
      • less “False Sharing”
      • inlined data vs Reference Array
        • less “Card Marking”
    • Aeron gets even further down to 13µs (ish?)
      • also mean is 10µs so this is phenomenal predictability
      • avoids the spinning CAS

logging is a messaging problem!

  • the abominations we have in Java for logging are disgusting!
    • the APIs, the design, etc
  • what design should we use?

where can we go next

  • C vs Java
    • Spin loops
    • modern processors do out-of-order speculation
    • with spin loops, they speculate wrong
      • Proposed: Thread.spinYieldHint()
    • data dependent loads
      • heap aggregates: objectlayout
      • stack allocation - value types
    • memory copying
      • baseline against memcpy() for differing alignment
    • queue interface
      • break conflated concerns to reduce blocking actions
      • API design mistakes
        • conflates “is something available?” with “is it empty?”
        • should not have used normal collections API for concurrent stuff

conclusion

  • AWS recently announced x1 instances
    • 2TB RAM + 100+ vcores!
    • lots of spinning wheels

Q&A

  • can you tell us more about XADD in java?
    • AtomicInteger, AtomicLong, etc

Martin Kleppmann, Transactions: Myths, Surprises and Opportunities

history

  • 1975: IBM System R
  • 2008? NoSQL
  • 2012? “NewSQL”
    • eg SQL on top of HBase
  • really a movement of transactions-or-not

ACID!

  • Härder & Reuter, 1983
  • “More mnemonic than precise” - Brewer, 2012

D for Durability

  • archive tape?
  • fsync to disk?
  • replication?

C for Consistency

A for Atomicity

  • not the same as in “atomic compare-and-swap”
  • not about concurrency
  • it’s about fault tolerance
  • rolls back writes on abort
  • better word might have been Abortability
  • aborts are super helpful because they collapse a whole bunch of error classes:
    • crashes, power failures
    • constraint violation
    • network fault
    • deadlock

I for Isolation

  • this is where we start talking about concurrency
  • = SERIALIZABLE?
  • in practice, multiple isolation levels
    • Repeatable Read
    • Snapshot Isolation
    • Read Committed
    • Read Uncommitted
  • can anyone here describe the difference of the top of their head? (one hand goes up)
  • this is a configurable property of many databases
    • though many don’t go as high as Serializable
    • and some call it Serializable but are actually weaker than this

isolation levels

preventing dirty reads and dirty writes

  • read committed prevents these anomalies
  • default in postgres, oracle, …
  • that’s enough for transactions, right?
  • what are the other weird isolation levels?

preventing read skew

  • this can happen under read committed!
  • concurrent transaction:
      • add 100 to x
      • subtract 100 from y
      • commit
      • read y
      • read x
  • reader can see old value of y, but new value of x
  • Repeatable Read and Snapshot Isolation were invented to prevent this
  • Traditional RR implementation is based on locking
  • SI is used more often now to avoid locks
    • poorly named in most DBs

write skew

  • invariant: at least one doctor on call at all times
  • doctors can trade shifts
  • transaction:
    • count how many doctors are on call
    • if >=2:
      • update
    • commit
  • but: two txns operate concurrently for two different doctors:
    • both see n >= 2
    • both remove themselves from rota
    • invariant violated!
  • pattern:
    • measure
    • decide
    • write
    • commit
    • by the time the write is committed, the premise of the decision is no longer true!
  • solutions
    • two-phase locking (pessimistic)
      • shared lock on measure step
        • block entire table for writes, whenever you make a large read
      • poor for perf
      • motivation for weaker isolation levels
      • System R team just didn’t hold locks for so long
        • this is what created their crazy isolation levels
    • H-Store (Stonebraker et al)
      • VoltDB, datomic
      • literally execute your txns in a serial order
        • trivally correct
        • as long as you make your txns fast enough!
          • otherwise your throughput will be abysmal
    • SSI (Cehill Röhm & Fekete)
      • Postgres
      • detect conflicts & abort
        • optmistic counterpart to pessimistic 2PL
      • Idea: locks like in 2PL
        • but locks don’t block, just gather information

what about other distributed systems?

  • microservices?
  • stream processing?
  • serializability across services?
    • atomic commitment
    • total ordering
    • consensus
      • paxos, raft, zab
  • strong connection between all of these things
    • coordination
    • failure amplification
  • geographic distribution makes this worse
  • distributed txns don’t work particularly well

without cross-service transactions

  • so where are we now?
  • compensating transactions
    • ≅ abort/rollback at app level
    • looks a bit like ACID A
  • apologies
    • detect & fix contraint violations after the fact, rather than preventing
      • overselling beyond stock level; issue voucher instead
    • looks a bit like ACID C

every sufficient large deployment of microservices contains an ad-hoc informally-specified bug-ridden, slow implementation of half of transactions

  • ordering of events
    • unfriend
    • post nastygram to remaining friends
  • friends service and posts service are separate services
  • notifications services pushing out emails etc, reading from posts and friends services
  • reordering due to delays
  • notification goes to old set of friends, not new set of friends. oops!
  • implicit causal relationships leading to violated expectations

current research

  • things between strict serializability and eventual consistency
  • causality:
    • partial order of reads/writes
    • can be maintained without global coordination
    • “consistent snapshot” in SI
      • consistent with causality
    • but also efficient?
  • isolation levels are divided into those that require coordination and those that don’t

references

  • four slides’ worth!

Q&A

  • how did we end up with this mess? naming etc
    • on the academic side, the database community and distributed systems community have historically been completely separate
      • evolved separate language for similar concepts
    • wouldn’t have many of current problems if they had talked to each other in 1980s
    • SQL standard had problems with isolation levels
      • failed to describe them in an implementation-neutral way
      • this is why Postgres uses SI but calls it REPEATABLE READ
  • also see hermitage, a tool to test what kind of anomalies different isolation levels of dbs allow
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment