phrawzty/notes.md

## notes.md

      
    Raw
  

              notes.md
            
          
    Intro


Fire exits
CoC - read it, learn it, embrace it

"Private Cloud and the Sunk Cost Fallacy" - Sam Newman


Started with AWS in 2009
S3 is 12 years old (wow)
Amazon: "We sell electricity"; people were electrocuting themselves then blaming the provider.
~ 2009: "Why would I pay for AWS? I already have so many servers."

Dev & test was a good reason / excuse
failover datacentres


PCI certification was a huge boon for AWS adoption (even today)

Effectively offloading PCI compliance


Lately there has been a reduction in cloud providers: today it's the Big Three
OMG Kubernetes is the buzzword du jour
"We don't seem to stop running our own computers."

Not convinced this healthy. k8s is not helping.


The Concorde was an incredibly impressive bit of technology and engineering

It was basically a disaster from day one.
Too much political capital invested in the project. Everybody knew it was a disaster, and it was, and they did it anyway.


"The Upside of Quitting" - Stephen Dubner
In MLB, getting drafted is statistically a bad thing for one's career.

40% less salary for draftees than those who enter the MLB via other means
Don't really understand this - need to learn more about MLB I guess?


"It can be hard to quit when you identify yourself with a job" - sudhir venkatesh
Mammals in general, and human children in specific, don't suffer from sunk cost fallacy.
Fundamentally, sunk cost fallacy denies us the chance to take advantage of opportunities.

Continuing to invest in private cloud is another form of sunk cost fallacy.


According to IDC studies, there is a trend away from on-prem spend, but it's slow. As of 2018, 70% of IT spend is on-prem.

Wow.


"Undifferentiated Heavy Lifting"

Hide that ("giant mountains of crap") with a nice API


Accelerate: State of DevOps report; essential reading

"Elite performers" are 23 time more likely to use a cloud platform.

Correlation sure, but causation? Hmm.
"Cloud" in general - no distinction between oprivate and public.


OpenStack was, at its peak, the world's largest open source project.

Turns out that making a generic version of AWS is really hard.
Next: k8s?


NIST has defined 5 characteristics of cloud computing: on-demand self-service, broad network access, resource pooling, rapid elasticity, measured service

also: quality of service, and feature set


Ok, Google has a good private cloud. You are not Google.

Most private clouds fail at multiple points in the NIST list


People who invest in private cloud rarely seem to know how to measure cuccess
"Can't go to public cloud because we'll get locked in!"

"Costs might go up." - 10 years of evidence show the opposite
"They might shut the service down." - SQS and S3 (12 yo) still exist, still have API compatibility…

As long as a provider makes money, they'll keep doing it.
AWS is 55% of Amazon operating profit in Q2 2018 despite being only 12% of net sales.


Real Options Theory

Use awesome stuff now vs. Use less awesome stuff to defer cost of migrating later
Not adopting now means maybe avoiding costs later, but you actually pay for it now in terms of opportunity cost.


Increased investment in Hybrid solutions. Orgs are hedging their bets.

Effectively a "magical brokerage layer"
Lowest common denominator problem; can only surface what is common to both (all) platforms in system


Instead of hybrid on-prem / public, why not multiple public clouds?
k8s is not a panacea. It's still a magical brokerage layer.

Big holes in k8s, notably data and storage.


The US PATRIOT act means that some organisations can't use the major public cloud providers.
k8s is a better abstraction that can be portable both on and off-prem

So k8s on public cloud, then? Only if you really need to.
Make sure that your on-prem or public cloud k8s setup doesn't just become openstack 2.0

Don't indulge the sunk cost fallacy


"Thanos - Prometheus at Scale" - Bartek Płotka


A single Prometheus server is actually very powerful. Massive amounts of datapoints can be processed bhy a single reasonably-sized node.
Scaling Prom is almost never due to performance reasons. More often: high-availability, life-cycle mgmt, global distribution.
Main problem is called "Global View": single data point but amalgamted from multiple Prom instances.

/federate is the current solution. "Global" Prom instance that scrapes the regional Proms.

This violates some Prom design decisions, notably around namespacing and isolation.
Adds another stateful stack.


Metric retention is an issue.

Potential solution: local storage. This has downsides, including no downsampling.
Potential solution: writing to an external endpoint. This just offloads the issue elsewhere.


These are the problems (Global View, HA, and Retention) that Thanos is meant to solve.
Thanos is deployed as a sidecar withing k8s.
Exposes a gRPC (Store API).

Basically a proxy for metrics.


Also: "Querier" which is a queryable API

Because Querier is independant from Prom, it can point at multiple Proms, which basically solves the Global View problem.
Deduplication logic is possible (based on labels, for example)


Store API and Querier together apparently solve HA as well. Not sure why.
Store Gateway site between Querier and the backend object storage, and efficiently responds to requests.
Gotta be honest, this was well outside of my wheelhouse. I didn't understand too much when Bartek dove into the deep internals of Prom & Thanos.
"Never start with Thanos. Start with Prometheus. When you need to scale, Thanos helps you do that."

"Balancing Observability and Agility in a Startup" - Alex Tasioulis


"Observability" is a term that is still be shaped and (re-)defined today.
Observability has a good definition from Control Theory.

"In control theory, observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs. The observability and controllability of a system are mathematical duals."


Twitter wrote a blog post some years ago, "4 pillars of observability": Monitoring, Alerting/Visualisation, Distributed systems tracing infra, Log aggregation/analytics
"The goal of an observability team is not to collect logs, metrics, or traces. It is to build a culture of engineering based on facts and feedback, and then spread that culture within the broader organisation." - @taotetek
Stage 0: Observability in a non-distributed system

Easy to reason about things because things are simple. Linear workflows, etc. Don't need much to undertand what's going on.


Stage 1: Aggregating logs

Choose a Logging service, send your logs there. Not so cheap but they work - and when you don't have a lot of logs to begin with, it's fine.
Stream logs. Don't store and forward - too much can go wrong.
Logging pipelines. Different types of logs have different pipelines (reliabile delivery, etc)


Stage 1: Metrics & Tracing

"We went with Datadog because it was the most popular metrics platform. It still is!"
Write wrappers around the code in order to generate useful metrics.


Stage 2: Structuring Logs

This is more of a cultural challenge. Have to convince everybody to structure logs in a specific way.

JSON is a good choice. Proscribed fields including timestamps, tags, and other stuff that can be added by the log pipeline.


Don't just cram raw log lines into a field. Parse everything and make good k/v pairs.


Stage 2: Metrics

We wrote some scripts that built Datadog dashboards out of the metric names. Within minutes we had a standardised dashboard for each deploy.

We lacked context around the metrics which meant that this was useful for legacy monoliths, but not new microservices.


Stage 2: Tracing

Honestly not sure what he was getting at here.


Stage 3: "THINGS ARE GETTING EXPENSIVE"
Stage 3: Logs

ELK stack seemed OK, and AWS had a service offering, but it was expensive and had a lot of operational overhead in any case.
Started to think about sampling. If 95% of the logs are successes, why keep them?

Felt we were too early to start sampling though.


Humio? Cool, but still an external service so €. Fluentd? k8s, which we already used, so why not?


Stage 3: Metrics

"What is cardinality in monitoring?" - Baron Schwartz
Most TSDB are not designed to deal with high cardinality.
Many services are expensive re: high-cardinality situations.


Stage 3: Tracing (AWS X-Ray)

We are AWS native and our languages are supported, so this was a natural match.
Helped us to identify things (via service graph) that we didn't know.
Hard to get all code owners to instrument their code consistently.
OSS tracing projects (opentracing) are a mixed bag.


Alex's 3 pillars of observability:

Rich context
Tracing info
Structured events


1 trace = 1 log (per service)

Don't bother sending logs at every step - just emit one log for the entire event at the "exit point".


The Future

The 3 industry pillars are a good start, but it's not self-evident that everybody needs those 3 things specifically.
Instrumenting should not be a pain, nor an exercise in re-inventing the wheel.
"Observability Pipeline" should become an industry standard.


Coordinate Cloud-Native Components Using Distributed State Machines - Bernd Rücker


@berndruecker
Within a closed system everything is fine. Add networking? oh no.
"Fallacies of distributed computing."

"The network is reliable." LOL


Some communication challenges require state handling.

         blocking
            |
            |
sync -------|------- async
            |
            |
      non-blocking


Cascading failures often occur in distributed systems where errors jump back up the stack.
Must accept that failure will occur in any complex system. If that problem jumps back all the way to the end-user, that's bad.
"Stateful retry"

Should occur before the user experiences a blocking error.
Failure should stay in scope.
Network hiccups should be handled this way, generally.


Stateful retry is a business decision

"We send boarding passes asynch. We promise to send it at least 4 hours before the flight."


Requirement: idempotency of services

"If you take away one thing today, it's this."


It is impossible to differentiate certain failure scenarios (independant of communication style).
Strategy: "Clean up". Managing state also means cleaning old cruft.
"Synchronous communication is the crystal meth of distributed programming." - Todd Montgomery & Martin Thompson
Asynch comms doesn't change much from an architecture perspective, but everything from a monitoring and error-handling perspective.
Zeebe messaging demo

Walk through stateful handler for network error situation.
If not 200, 202 is better than 500.


"BPMN": an ISO standard

Great for modelling.
Supports enterprise integration use-cases


"Life beyond Distirubted Transactions" - Pat Helland, "Distributed Systems Guru"

"Grown-ups don't use distributed transactions."


Eventual consistency.

By definition there's a period of temporary inconsistency. That's ok. That's the new normal.


Compensation (and apologies)

In case of failure, trigger compensations. "undo" or "rollback" the stages semantically.


2 alternative approaches: choerography and ?

Event-driven choreography

No central controller, just one step into the next. Event-driven architecture.
De-coupled systems.
Martin Fowler
"If your transaction involves [only] 2 to 4 steps, choreography is a good fit" - Denis Rosa
No isolation which can be problematic


"Modular services with distributed sagas" - Caitie McCaffrey


Stream processing & Single writer
Really moving quick here because he's over time. 😕

Slides look good. Dunno.


Keynote: Observability-Driven Development - Charity Majors


"I'm very polarising."
"I got locked in an elevator. I had a lovely nap."
In the beginning, there were people writing software, and people using software.

Immediate feedback loop. Works? Yay! Doesn't work! Boo!


We started by owning our software, but today, programmers don't often own the product.

The feedback loop is huge - or broken.


"The entire DevOps movement is an attempt to return to grace - that virtuous feedback loop."

Write, Deploy, and Debug. All three.
These feedback loops make software better.


"Despite coming from ops, I've always hated monitoring. Honeycomb is not a monitoring company."
"Observability comes from control theory. I did not make it up."
Can you understand what's happening inside your systems, just by asking questions from the outside? Can you debug your code and its behaviour using its output? Can you answer new questions without shipping new code?"
"It's important to build tools that people understand."
Nagios is over 20 years old and still in production.
"Monitoring is one piece of software checking out another. And this is an outdated model for complex systems."
Complexity is exploding everywhere, but our tools are designed for a predictable world.
"If you can solve your problem with a LAMP stack, please do so. Never inflict a distributed system on yourself if you don't have to."
"You have to gather the detail that will let you find the problem. So. Many. Outliers."
Servers are an increasingly useless abstraction - useful to infra engs, but not to app developers.
First lesson of distributed systems: your system is never really up.

"If you have a dashboard of green, all you know is that your dashboard is lying to you."


"Staging is a waste of engineering resources."

You can only test so much. Complex systems have infinite failure possibilities - impossible to test them all.


What's the missing link for testing in prod? Observability.
Monitoring is the functional equivalent of unit tests.
So many unknown-unknowns. Lots of one-offs. Impossible to monitor for ahead of time.
"The health of the system is irrelevant. The health of each individual request is of supreme consequence."
Known-unknowns are predictable, unknown-unknowns require exploration.
Wild recounting of the origin story of Honeycomb. Wow.
AAAH GOTTA GO TO CATCH MY TRAIN :(