apg/ricon-2014.org

## ricon-2014.org

      
    Raw
  

              ricon-2014.org
            
          
    Ricon 2014 Notes

Keynote: Marten Mickos (SVP and General Manager, HP Cloud): Open Source Wins

Started with RDBMSes and needed to “scale out”.


  Why? Well, we couldn’t scale up anymore. (talking about MySQL)

Composibility of simple components

Distributed (not monolithic)

Design for failure (Assume everything is always going to fail)


  Shift in trust. (> more reliable)
  Old: Hardware > Software
  New: Software > Hardware

Open Source Projects still need a chief architect (Torvalds, etc)

David Pick (Braintree): Building a Real-Time Data Pipeline with Clojure and Kafka

Have a data warehouse using Redshift, synced from Production database.


  Easily get out of sync, and deletes are impossible to track
    (sometimes due to batch processes, etc)

Thought about reading the postgres replication log


  Required waiting for newer postgres…

Instead adopted a triggering system (Longis? - not sure what it was called)


  The trigger puts it into pgq, which holds for a small time.
    
      pgq has great features:
        
          ACID
          Fast
          Persistant
        
      
      But, pgq can’t keep stuff forever, an hour-ish, which doesn’t give you much buffer.
    
  
Enter Kafka with pgq

(Database) ----pgq----> (Kafka) ------> (Redshift)
                               \______> (Elastic Search)
                                \______> Wherever else

Postgres is shareded by merchant_id, which is also the partition setup in Kafka

Advice


  “Don’t use default configs” – they’re almost never what you want
  “Avoid race conditions”

Christophe Meiklejohn (Basho): Eventually Consistency with CRDTs

Derflow: Distributed deterministic dataflow programming for Erlang (to appear in SIGPLAN ‘14)


  https://github.com/cmeiklejohn/derflow
  datalog like language
  Part of the syncfree project: https://syncfree.lip6.fr which aims to build large scale computation without synchronization

Sean Cribbs (Basho): A Brief History of Time in Riak

Band recommendation: Lost in the Trees

Latency from SF -> NY = 14ms

Lamport defines the “happens before” relationship, logical time.


  “Time, Clocks and the Ordering of Events in a Distributed System”
  Logical clocks, not physical

1983: Stott Parker, et. al, “Detection of Mutal Inconsistency in a Distributed System”


  http://zoo.cs.yale.edu/classes/cs422/2013/bib/parker83detection.pdf
  Defined and hinted at a lot of things like the possibility of CRDTs, CAP theorem, etc
  version vectors
  origin point

VERSION VECTORS ARE NOT VECTOR CLOCKS

Comparing version vectors   [1 3 2] >= [1 3 1]


  Descends A >= B
  Dominates A > B
  Concurrent A | B    ( [2 3 2] | [2 3 1 1] )

Merging version vectors


  A merge B = C   (pairwise maximum)
  C >= A and C >= B
  A | B  ====> C > A and C > B

Paper: Dotted Version Vectors: Efficient Causality Tracking, 2012


  http://haslab.uminho.pt/tome/files/dvvset-dais.pdf

“Time can’t move backwards”

Paper: Global Version Vectors (to appear)

Wes Chow (CTO, Chartbeat) A|B Testing with a Multi-Armed Bandit

Thompson Sampling


  Always pick the best
  Update your beliefs
  How do you know which is the best?
  Answer: Throw darts at the different distributions that you know,
    and pick the biggest (then you update your knowledge when you get
    new information, which changes the distribution, which changes
    the probability that you’ll select the best one, and it
    eventually converges to something)

Linear Probabilistic Counters


  Count number of unique things without a lot of storage
  Hash unique identifier % m, set bit
  Unique Count = - m ln ( (m - w) / m )
    m = size of a bit vector
    w = weight (number of 1s in bit vector)
  Pitfall: when w = m, useless.
  Pitfall: when w ~= m, you’ve got an upper bound (e.g. w = 15, m = 16, 44 max estimate)

HyperLogLog


  Good for mid 100,000 - 1 billion or so, use a linear counter for less

Peter Cline (Comcast): Building Robust Distributed Systems When You’re Not Initially an Expert

Needed to build a large system requiring consensus

Read and understood Paxos, decided to implement


  Should we be doing this?
  “Have you considered ZooKeeper?”
  They did. They did. Continued

Lesson: Do your homework

Lesson: Shoot for “just the right” engineering at the right time (i.e. the minimum effort that doesn’t have to change)


  The WAL was initially just “|” delimited text.
    
      Made debugging easier, eventually became a bottleneck, and then they replaced it.
    
  
Lesson: You aren’t testing enough. Do more of it.


  Things failed in ways they could have tested (and now do).

Lesson: “No exponents anywhere”

Lesson: It’s going to break in ways you can’t even imagine.


  A single NIC took down their entire cluster when it went from
    10Gb to ~5Mb when it was starting to fail.  Failed
    acknowledgements causing resends due to data not being sent
    completely (bandwidth loss) before timeout.
  Be flexible operationally and ensure things don’t happen again.

Colin Hemmings (Dataloop.io) Event Processing with Riak

Anecdotal talk, discussing current architecture, etc. Not much learnings yet, small.

Neha Narula (PhD Candidate CSAIL, MIT): Multicore and Distributed Systems: Sharing Ideas to Build a Scalable Database

Phase Reconciliation, current research (http://pdos.csail.mit.edu/papers/doppel:osdi14.pdf)

Transactions: Serializable and Atomic

Applications contend

Concurrency control forces serial execution

Optimistic concurrency control

     |
     |        _________
txns |    . '
     |  /                         
     | / __________________ cores 


Dopple: http://pdos.csail.mit.edu/doppel


  switches to a split phase. updates to contended items modify per-core state, proceeding in parallel on different cores
  rewrites transactions such that they can be executed in phases
  works with things that commute, but right now INCR, MULT, etc.
  Reconcile in a more serial manner.
  Works best for situations where there are few popular records (i.e. Retweet, Favorite counts on extremely popular tweets)

Keynote: Peter Alvaro (PhD Candidate, UC Berkeley): Outwards From the Middle of the Maze

Jim Waldo, et al. “A Note on Distributed Computing” (http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=CD160AD58C5532B19BCFF16C9F07DF23?doi=10.1.1.41.7628&rep=rep1&type=pdf)

“The Bottom-up Ethos”

“When do we get our guarantees back?” – e.g. transactions, etc

“Composition is the last hard problem”


  How do you compose guarantees?

Distributed systems are HARD


  Asynchrony and Partial failover
    
      Think about both at the same time, or the system sucks
    
  
  FUNDAMENTAL UNCERTAINTY

CRDTS


  Associativity, Commutative, Idempotent

Confluent Components


  They compose!

Reordering, batching, retry/duplication

CALM Theorem


  Monotonicity -> confluent

Molly: Lineage driven fault rejection.


  Molly (to appear) is a proof like tool which uses lineage and a SAT solver to test a crap ton of possible interactions in a system and reject (with explanation) why it doesn’t work, or proves it does.
  Would have found the epic Kafka data loss bug
  Proves 2PC has problems, others
  Seems to actually work. No idea how hard it is to actually use, or build models.

Dedalus language

Fault tolerance


  reasons backwards from outcomes using lineage.

Aysylu Greenberg (Google): Benchmarking: You’re Doing it Wrong! (http://www.slideshare.net/aysylu/benchmarking-youre-doing-it-wrong-ricon-2014)

You’re wrong about the machines

“Caches all the way down”


  Are you sure you’re not testing the cache? L1, L2, L3 might hold all your data.

What’s the goal of the benchmark?


  Do that, and test for that.

Warm up and Timing


  Perhaps don’t measure immediately if you don’t want warmup
  You might be interested in this though, if you’re benchmarking things related to restart, etc.

Periodic Interference: e.g. GC Pauses, etc


  Avoid by measuring throughput, not individual latencies

Test != Prod (in most cases)

Power mode changes


  Is the CPU scaling you up or down for some reason do things (e.g. 1Ghz, to 3Ghz)

You’re wrong about the stats

Two few samples


  Convergence of median on samples
  Know your time scales

Mean not robust with outliers, percentiles are better.

Gaussian (not)

Percentiles in multimodal distributions are useless


  Look at histograms (Mode detection algorithms!)

Investigate outliers


  Possible coordinated omission (serial execution might be 1ms apart, an outlier might be 80ms later, which might not be good)
  Outliers may indicate larger problems, even if they are outliers. Why are they so different. Figure it out.

You’re wrong about what matters

Premature Optimization: 97% of inefficiencies aren’t worth worrying about. Focus on the 3%

Unrepresentative workloads, timescales

Memory pressure: GC after benchmark setup. Then after to measure memory pressure.

Load Balancing

Reproducibility of measurements

Becoming Less Wrong

User Actions Matter


  X > Y for workload Z, with trade offs A, B, and C

Profiling, Code instrumentation

Aggregate over logs

Traces (dist)

Microbenchmarks


  Narrow questions, misleading results
  Not representative of entire program
  “Microbenchmarks should be Microtrusted”
  Choose your N wisely — “Caching all the way down”
  Measure side effects
  Clock resolution – combine and divide
  Do constant work per iteration

Followup


  “How NOT to Measure Latency” by Gil Tene
    http://www.infoq.com/presentations/latency-pitfalls
  “Taming the Long Latency Tail” on highscalability.com
    http://highscalability.com/blog/2012/3/12/google-taming-the-long-latency-tail-when-more-machines-equal.html
  “Performance Analysis Methodology” by Brendan Gregg
    http://www.brendangregg.com/methodology.html
  “Silverman’s Mode Detection Method” by Matt Adereth
    http://adereth.github.io/blog/2014/10/12/silvermans-mode-detection-method-explained

Takeaway #1: Caching all the way down

Takeaway #2: Outliers matter

Takeaway #3: Workload