- Why? Well, we couldn’t scale up anymore. (talking about MySQL)
- Shift in trust. (> more reliable)
- Old: Hardware > Software
- New: Software > Hardware
- Easily get out of sync, and deletes are impossible to track (sometimes due to batch processes, etc)
- Required waiting for newer postgres…
- The trigger puts it into pgq, which holds for a small time.
- pgq has great features:
- ACID
- Fast
- Persistant
- But, pgq can’t keep stuff forever, an hour-ish, which doesn’t give you much buffer.
- pgq has great features:
(Database) ----pgq----> (Kafka) ------> (Redshift)
\______> (Elastic Search)
\______> Wherever else
- “Don’t use default configs” – they’re almost never what you want
- “Avoid race conditions”
- https://github.com/cmeiklejohn/derflow
- datalog like language
- Part of the syncfree project: https://syncfree.lip6.fr which aims to build large scale computation without synchronization
- “Time, Clocks and the Ordering of Events in a Distributed System”
- Logical clocks, not physical
- http://zoo.cs.yale.edu/classes/cs422/2013/bib/parker83detection.pdf
- Defined and hinted at a lot of things like the possibility of CRDTs, CAP theorem, etc
- version vectors
- origin point
- Descends A >= B
- Dominates A > B
- Concurrent A | B ( [2 3 2] | [2 3 1 1] )
- A merge B = C (pairwise maximum)
- C >= A and C >= B
- A | B ====> C > A and C > B
- Always pick the best
- Update your beliefs
- How do you know which is the best?
- Answer: Throw darts at the different distributions that you know, and pick the biggest (then you update your knowledge when you get new information, which changes the distribution, which changes the probability that you’ll select the best one, and it eventually converges to something)
- Count number of unique things without a lot of storage
- Hash unique identifier % m, set bit
- Unique Count = - m ln ( (m - w) / m ) m = size of a bit vector w = weight (number of 1s in bit vector)
- Pitfall: when w = m, useless.
- Pitfall: when w ~= m, you’ve got an upper bound (e.g. w = 15, m = 16, 44 max estimate)
- Good for mid 100,000 - 1 billion or so, use a linear counter for less
- Should we be doing this?
- “Have you considered ZooKeeper?”
- They did. They did. Continued
Lesson: Shoot for “just the right” engineering at the right time (i.e. the minimum effort that doesn’t have to change)
- The WAL was initially just “|” delimited text.
- Made debugging easier, eventually became a bottleneck, and then they replaced it.
- Things failed in ways they could have tested (and now do).
- A single NIC took down their entire cluster when it went from 10Gb to ~5Mb when it was starting to fail. Failed acknowledgements causing resends due to data not being sent completely (bandwidth loss) before timeout.
- Be flexible operationally and ensure things don’t happen again.
Neha Narula (PhD Candidate CSAIL, MIT): Multicore and Distributed Systems: Sharing Ideas to Build a Scalable Database
Phase Reconciliation, current research (http://pdos.csail.mit.edu/papers/doppel:osdi14.pdf)
|
| _________
txns | . '
| /
| / __________________ cores
Dopple: http://pdos.csail.mit.edu/doppel
- switches to a split phase. updates to contended items modify per-core state, proceeding in parallel on different cores
- rewrites transactions such that they can be executed in phases
- works with things that commute, but right now INCR, MULT, etc.
- Reconcile in a more serial manner.
- Works best for situations where there are few popular records (i.e. Retweet, Favorite counts on extremely popular tweets)
Jim Waldo, et al. “A Note on Distributed Computing” (http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=CD160AD58C5532B19BCFF16C9F07DF23?doi=10.1.1.41.7628&rep=rep1&type=pdf)
- How do you compose guarantees?
- Asynchrony and Partial failover
- Think about both at the same time, or the system sucks
- FUNDAMENTAL UNCERTAINTY
- Associativity, Commutative, Idempotent
- They compose!
- Monotonicity -> confluent
- Molly (to appear) is a proof like tool which uses lineage and a SAT solver to test a crap ton of possible interactions in a system and reject (with explanation) why it doesn’t work, or proves it does.
- Would have found the epic Kafka data loss bug
- Proves 2PC has problems, others
- Seems to actually work. No idea how hard it is to actually use, or build models.
- reasons backwards from outcomes using lineage.
Aysylu Greenberg (Google): Benchmarking: You’re Doing it Wrong! (http://www.slideshare.net/aysylu/benchmarking-youre-doing-it-wrong-ricon-2014)
- Are you sure you’re not testing the cache? L1, L2, L3 might hold all your data.
- Do that, and test for that.
- Perhaps don’t measure immediately if you don’t want warmup
- You might be interested in this though, if you’re benchmarking things related to restart, etc.
- Avoid by measuring throughput, not individual latencies
- Is the CPU scaling you up or down for some reason do things (e.g. 1Ghz, to 3Ghz)
- Convergence of median on samples
- Know your time scales
- Look at histograms (Mode detection algorithms!)
- Possible coordinated omission (serial execution might be 1ms apart, an outlier might be 80ms later, which might not be good)
- Outliers may indicate larger problems, even if they are outliers. Why are they so different. Figure it out.
- X > Y for workload Z, with trade offs A, B, and C
- Narrow questions, misleading results
- Not representative of entire program
- “Microbenchmarks should be Microtrusted”
- Choose your N wisely — “Caching all the way down”
- Measure side effects
- Clock resolution – combine and divide
- Do constant work per iteration
- “How NOT to Measure Latency” by Gil Tene http://www.infoq.com/presentations/latency-pitfalls
- “Taming the Long Latency Tail” on highscalability.com http://highscalability.com/blog/2012/3/12/google-taming-the-long-latency-tail-when-more-machines-equal.html
- “Performance Analysis Methodology” by Brendan Gregg http://www.brendangregg.com/methodology.html
- “Silverman’s Mode Detection Method” by Matt Adereth http://adereth.github.io/blog/2014/10/12/silvermans-mode-detection-method-explained