Skip to content

Instantly share code, notes, and snippets.

@zachm
Last active June 6, 2018 23:53
Show Gist options
  • Save zachm/e741c3fe2091383d72c5e7f85be89f9b to your computer and use it in GitHub Desktop.
Save zachm/e741c3fe2091383d72c5e7f85be89f9b to your computer and use it in GitHub Desktop.
Monitorama 2018 day 3

Achieving Google-levels of Observability into your Application with OpenCensus

Morgan McLean - Google

This talk is all about opencensus.io. Morgan is the PM for it. He's arguing that the usual N pillars of observability is not sufficient to... have observability.

Instead he's saying context/topology+status+root cause analysis == observability.

Opencensus does:

  • distributed traces
  • tags
  • metrics

They're providing libraries in a bunch of languages. This seems like a plug-and-play replacement for a prometheus+zipkin stack? https://opencensus.io/faq/index.html

n.b.: The Python clientlibs aren't complete yet. They do tracing but not metrics exporting yet.

Very focused on app-level telemetry. Don't really support system metrics etc.

Overall: out of the box observability. Writing your own exporter is super easy etc.

The present and future of Serverless observability

Yan Cui - DAZN

Sports streaming site. Not launched in the US yet, but soon.

Not being able to install daemons (bc serverless) makes life hard. You don't want it in your critical path etc.

Great Charity Majors quote: With distributed systems you don't care about the health of the system - you care about the health of the event or the slice.

Putting billions of timeseries to work at Uber with autonomous monitoring

Prateek Rungta - Uber

Golden Signals: Something high in signal/noise ratio. Lots of them come from the SRE book.

Auto-create dashboards for your services.

Group alerts (we do this) and set some alerts as dependent on others (we don't quite do this, but we probably could).

400-600M raw metrics per second. 20M stored metrics per second. Seeing about 20% growth quarter over quarter.

2014-2015: Graphite. 2015-2016: Cassandra, with 16x YoY growth. Expensive, more than 1500 Cassandra hosts! Mostly due to compactions and slow repairs, and they ended up turning down replication factor to cope. Sound familiar?!

They looked at OSS products, none scaled that far. They looked at vendors, none that cost effective. So they wrote their own: M3DB. It's open source.

This is now going pretty deep into the M3DB architecture. It seems like a pretty sweet design, but I think if you want to know all the details it's best just to look at Prateek's talk. More info: github.com/m3db

Slides: bit.ly/m3db-monitorama2018

Building Open Source Monitoring Tools

Mercedes Coyle - Sensu

Autoscaling Containers... with Math

Allan Espinosa - Bloomberg

Assisted Remediation: By trying to build an autoremediation system, we realized we never actually wanted one

Kale Stedman - Demonware

Security through Observability

Dave Cadwallader - DNAnexus

Relationship between Ops and Security matters! Security vs Compliance How to automate compliance checking Creating compliance SLOs/SLAs

So DNANexus' whole thing is being a platform for DNA research and storage. This is hXc HIPAA information so they have to do a ton of work around compliance and reporting.

Compliance just means you meet a certain set of requirements at a certain moment in time. You still have to take action at all the other times to be secure.

They use Prometheus to ping things and say "how you doin?" To use this with Cloudwatch, for example, you need an intermediate exporter process.

He shows an example of using linear extrapolation within Prometheus to do "disk will fill up in X hours". It looks pretty simple to do. But he's also using INSPEC (github.com/inspec) which is an auditability framework.

INSPEC wants to SSH to each of your prod boxes and run audits on them. "Eeew!" said the security team. So you can just schedule it on each machine locally, dump to JSON, and then have a bit of code that writes to Prometheus!

I really like his summarization here. He's just doing "passed, failed, skipped" counters for each host. If you need to investigate, go to the darn host and read the logs! Then, you can put the ultimate compliance SLO on your boxes! If you have any failed tests, your detector rule flags it. Perfect.

He's looking for collaborators on his "security through observability" project: https://github.com/geekdave/prometheus_inspec_exporter

I always enjoy Dave's talks. This one was really cool.

How to include Whistler, Kate Libby, and appreciate that our differences make our teams better.

Beth Cornils - Hashicorp

This talk is not about monitoring. It's about D&I.

How she says you should hire people. A lot to unpack here, because it appears - per my own interpretation - essentially unmeritocratic. But I may be wrong on this and missing a lot of nuance.

  • Cand. meets min qualifications for the job?
  • Cand. has capacity to learn/grow into the job?
  • Will cand. contribute to grow a culture of inclusion?

Then we do the privilege walk, except with hand raising.

Then we talk about volunteering and mentoring. Title I schools mentioned.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment