Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
Some wonderful notes, from an amazing developer I work with, on his recent attendance at Velocity Conf.

DEVOPS WEEKLY ISSUE #182 - 29th June 2014

I’m writing this quickly from San Jose airport before flying back to the UK, which is why most people will be receiving this issue 8 hours or so later that usual. Lots of content from Velocity and Devopsdays Silicon Valley this week (and probably next when I get more time to find some of the excellent presentations). It’s been great catching up with lots of folks, but a big shoutout to the organisers who put on two great events.

Sponsor

Devops Weekly is sponsored by Brightbox Cloud - serious UK-based cloud infrastructure from only 1.5p per hour (£10.95/month)

Start your £20 free trial now: http://brightbox.com/devopsweekly

Velocity and Devopsdays

Definitely one of the highlights of Velocity for me, this talk aimed to cover everything you need to know to be good at operations. Ambitious, entertaining and hugely useful.

http://adamhjk.github.io/good-at-ops/#/

Probably a tie for my favourite presentation, this next deck covers what the presenter called minimal viable bureaucracy. Lots of personal stories mixed with some wider observations. Lots to learn about organisation design in here.

https://speakerdeck.com/lauraxt/minimum-viable-bureaucracy-june-2014-edition

Opsweekly is a tool from Etsy for On call alert classification and reporting. The README could do with a screenshot but it’s a very interesting idea which brings together all the data from an on-call rota into one place for both personal tracking and bigger picture planning.

https://github.com/etsy/opsweekly

Another of the talks from Velocity I found interesting, what happens to the infrastructure when a large company buys a smaller one? In this case what and how did Instagrate migrate things over to Facebook?

http://instagram-engineering.tumblr.com/post/89992572022/migrating-aws-fb

One of the Velocity ignite talks providing a quickfire argument that you really really should be caring about security in your development and operations work.

https://speakerdeck.com/barnbarn/velocity-conference-santa-clara-2014-ignite

A topic close to my heart cropped up a few times at Devopsdays Silicon Valley, that of government. This post summaries one of the open spaces and makes a few suggestions for the US federal government.

http://www.mikemcgarr.com/blog/devops-in-the-federal-government.html

The traditional Devopsdays State of the Union was presented at both Amsterdam and at Silicon Valley this week. Riffing on the recent devops survey results, composability of systems and the move to software defined everything.

http://www.slideshare.net/botchagalupe/devopsdays-state-of-the-union-amsterdam-2014

News

This is a nice post from one of the organisers of Devopsdays Brisbane, explaining to people who haven’t come across the event what it is. Doing outreach to a local community like this is a great idea. Also, they have Sidney Dekker speaking!

http://mattcallanan.blogspot.com/2014/06/what-is-devops-days-brisbane-2014.html

Anomaly detection, and other applications of machine learning to monitoring, is a hot topic at the moment. This post is a good high level introduction, focusing on some of the tools you can try out right now.

http://blog.bigpanda.io/a-practical-guide-to-anomaly-detection/

A nice reminder that devops isn’t about killing off existing positions but about specialists working together. This post brings up the spectre of devops killing off the developer as we know it, and them debunks the idea.

http://cfengine.com/company/blog-detail/devops-killing/

Lots of large companies are getting interested in Devops and this next post should be useful to anyone working in such an enterprise. It collects common objections together along with a counter argument.

http://dev2ops.org/2014/06/adopting-devops-in-enterprise-operations/

Ever considered debugging database queries by dropping down to inspecting tcp packets? This next post makes this sound no-crazy with some great examples.

https://vividcortex.com/blog/2014/06/23/discovering-query-bugs-by-tcp-inspection/

Jobs

Having won a number of key customer accounts, Bashton are recruiting at both senior and junior levels to join our team of Linux operations experts. Based in the North West of the UK, we design, build and manage infrastructure primarily on Amazon Web Services, providing ultra reliable solutions to customers in a range of sectors. We can offer the ability to work on large-scale web facing infrastructure without the internal politics of working for a large organisation.

http://www.bashton.com/jobs/

Tools

Given our daily use of version control systems they contain an awful lot of data past just the source code. This tool allows for exporting a git repository into the solr search engine for data mining.

https://github.com/arafalov/git-to-solr

I’ve mentioned OSv previous as an interesting take on the operating system, but trying it out locally had required a lot of effort. Enter capstan, which provides a very nice command line interface to launch OSv instances locally on your machine.

https://github.com/cloudius-systems/capstan

Cayley is an open source Graph database. It supports multiple storage backends, an HTTP based API as well as a REPL and a built-in query editor and visualiser.

https://github.com/google/cayley

If you received this email directly then you're already signed up, thanks! If however someone forwarded this email to you and you'd like to get it each week then you can subscribe at http://devopsweekly.com

http://adamhjk.github.io/good-at-ops/

INCIDENT COMMAND

The First Responder is the default Incident Commander

  • Decides what to do next
  • Coordinates resources
  • Can hand off command
  • Communicates status
  • Not about rank

There is only ONE Incident Commander.

HOW TO RUN A POST MORTEM

  1. Invoke the space: we are here to learn, not to blame
  2. Describe the incident
  3. Establish the timeline
  4. Identify contributing factors
  5. Describe customer impact
  6. Describe remediation tasks for the root cause
  7. Describe improvement tasks for response process

AVAILABILITY ROUNDUP

  • Understand your Availability Targets
  • Track and understand your M*'s
  • Reduce time to detect and repair
  • Use capacity planning to avoid obvious incidents
  • Have an incident response and command process
  • Perform and publish post-mortems for every incident
  • Prioritize the outcomes

People, Process, Technology

http://www.amazon.com/The-Asshole-Rule-Civilized-Workplace-ebook/dp/B000OT8GV2

ASSHOLES ARE INEFFICIENT

  • Positive interactions must outnumber negative ones 5:1
  • Bad interactions have stronger, more pervasive, and longer lasting effects

WHAT YOU CAN DO

  • Don't be an Asshole, and fire or shun those who are
  • Set clear expectations for others
  • Praise people
  • Make friends with, and care about your co-workers
  • Listen to each other
  • Take pride in your work

KAIZEN

SMALL IMPROVEMENTS

Evaluate a process, make it better. Try using the scientific method:

  1. Ask a question
  2. Do research
  3. Construct a hypothesis
  4. Test your hypothesis
  5. Analyze data and draw a conclusion
  6. Communicate your results

EFFICIENCY ROUNDUP

  • Greatest gains are in improving People
  • Continually improve process, be willing to redesign in the face of new challenges
  • Use Scalable Systems Design to improve your technology and automation
From: Matt (https://github.com/mreishus)
Date: Mon, Jun 30, 2014
Subject: Velocity Conference CA 2014 Trip Report

Summary Evaluation of Velocity 2014:

  • The mobile share of internet traffic is on pace to eclipse desktop traffic within 2014. As a whole, developers are doing a poor job of optimizing for mobile and users are frustrated. Mobile sites are actually trending slower year over year, even with faster devices accounted for.
  • Even desktop performance affects business metrics (like conversion rate, bounce rate, page views, etc..). This can usually be measured without taking the time to optimize performance; most sites are serving a mixture of fast and slow experiences to users. Just correlate the metric vs performance while controlling for some variables (like location).
  • From Puppet's State of DevOps Report in 2014 - IT performance was qualified in a statistically valid way and highly correlated with these three independent metrics: MTTR (mean time to recover), lead time for changes, and deploy frequency. Companies with high performing IT departments were significantly more likely to meet their profitability, market share and productivity goals.
  • Surprising Information: According to many speakers, using the mean as statical measure of performance was worthless. Much better to split into quantiles (quadrants of percentiles). Also, many disparaged auto-scaling, including one speaker who called it "the biggest lie in IT".

Knowledge gained at Velocity 2014:

  • Mobile debugging techniques
  • Real User Monitoring techniques
  • General page performance optimization techniques
  • Browser animation optimization techniques and how to prevent "layout thrashing"
  • Concept of "autonomous actors keeping promises" to make operations safer (promise theory)
  • Postmortem
  • Capacity planning
  • "Money Graph" - great metric to have, often a lagging indicator of other invisible problems
  • Etsy's method of continuous experimentation - rolling out features to an increasing % of users and tracking success
  • Google's techniques for reducing latency in a service oriented architecture
  • How to use math to detect anomalies in non-guassian data
  • How to include security tests in continuous integration pipelines
  • Tombstone technique to find dead or unused code­.

Information that may benefit my co-workers:

People, Companies and Projects of Note:

Free Tools

  • DevOps Weekly email newsletter
  • github.com/secure-pipeline (security tests in CI pipeline)
  • Weinre - web based mobile debugger (no USB cable required)
  • Android/iOS native debuggers (require USB cable)
  • Fiddler - the proxy for web developers. Also has a "Bandwidth Simulator" plugin.
  • WebPageTest - see webpage performance data on a variety of real devices loading your site (iphone, desktop, android etc...). One company made a nodejs wrapper of WPT and put it in their CI pipeline!
  • SpeedCurve - GUI tool on top of WebPageTest
  • sitespeed.io - CLI tool on top of WebPageTest
  • Google's "PageSpeed Insights" and "PageSpeed Optimization" tools
  • ModPageSpeed - automatically implement performance optimizations at the nginx level
  • Appium - selenium for android/ios
  • skyline and oculus make up etsy's kale stack - metric measuring and anomoly detection
  • R - stats package
  • PhantomJS - used by Ebay for UI testing
  • zopfli - Google gzip algorithm, backwards compatible w/ broswers but ~5% byte improvement. Jquery 18% improvement.

Paid Tools

  • NewRelic Insights - measure business metrics w/ GA like calls and correlate that with performance and availibility data
  • Verisign - Global load balancer (if we decide to add redundancy to our rackspace datacenter)
  • ThousandEyes - Finds specific source of network problems between you and customer
  • Logentries - "make sense of logs", comphrensive log solution
  • Lognormal - RUM
  • Pagerduty - middleware between alerting systems (nagios, newrelic, etc..) and people's cell phones
  • Keynote - puts performance data in context, tells you how you stack up against others in your industry
  • Ghostfish - replays prod traffic in test environment
  • EdgeCast - CDN
  • CopperEgg - monitoring / metrics
  • Neustar - everything

Action items:

  • Implementing SPDY / HTTP 2.0 is the single biggest performance gain we can make for the least amount of effort. Impact is high and it's easy to do. (Also.. with SPDY we can stop spriting all together!)
  • Start tracking and understanding MTTR/MTTD (mean time to recover and mean time to detect)
  • Then start reducing MTTR/MTTD
  • Add at least one security test to our CI pipeline
  • Asset pre-fetching with link rel="prefetch" - another item with a largish impact (future potential is high, currently only supported by FF) and is easy to do.
  • Consider/discuss New Relic Insights
  • Can we include the Kale stack (skyline and oculus) with our Logstash shipped logs in ElasticSearch? Need to research.

Talks attended:

  • Battle-tested Code Without The Battle - Security Testing and Continuous Integration
  • Debugging and Tuning Mobile Web Sites with Modern Web Browsers
  • RUM: Getting Beyond Page Level Metrics
  • Browser Performance Tools
  • Achieving Rapid Response Times In Large Online Services
  • Performance In Context - Is "Good" Good Enough
  • Exponential Load Testing: Multiply the Power, Multiply the Results
  • Lowing the Barrier to Programming
  • Building on a Bedrock of Failure
  • Responsive Web Performance In the Wild
  • How to Adapt and Innovate for 2018
  • Understanding Slowness
  • Upgrading the Web - Driving Support for New Standards
  • A Look at Looking in the Mirror: PostMortems
  • What Makes Mobile Websites Tick? How Do We Make Them Faster? Insights From WebPagetest and HTTP Archive
  • How to be Great at Operations
  • Some Simple Math to get Some Signal out of Your Ops Data Noise
  • Virtual Machines, Javascript and Assembler
  • Test Driven Mobile Development with Appium, Just Like Selenium
  • Top 10 Lessons Learned Building PageSpeed and trying to Make The Web Fast
  • Web Performance, Why It Really Matters
  • Mobile Web Is Not (Just) a Technical Challenge
  • Lightning Demos
  • Responsive & Fast Pseudo Book Reading: A Tale of Mobile Wwaiting
  • Software Analytics for Performance Nerds
  • Building Self-Adaptive Autonomous Infrastructure with an Advanced Monitoring Architecture
  • A 5 Minute Checklist for Application Monitoring
  • Performance and Maintainability with Continuous Experimentation
  • DevOps Means Business
  • Case Study: How Shifting to a DevOps Culture Enabled Performance and Capacity Improvements
  • Self-Repairing Deployment Pipelines: What We Ought to Mean by Distributed Orchestration
  • 5 Things You Didn't Know NGINX Could Do
  • Human Confirmation Bias In Monitoring of Systems
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment