mwatts15/addo_notes.txt

## addo_notes.txt
Mark Watts' notes on All Day DevOps.
====================================

The talks I viewed and the notes I took reflect my personal interests and
issues which I though would be worthwile to my employer, but there were many
useful talks.

One of the business-side advantages of DevOps is that it's supposed to *save
time and money* by reducing overhead due, principally, to dysfunctional
patterns of behavior within the organization. Many talks in ADDO describe tools
that are intended to solve common problems in software development and
deployment. In addition, many presenters describe processes of organizational
transformation which give insight into possible difficulties and ways to get
through or avoid them.

As I see it, continuous integration (CI) and continous deployment (CD) are
*consequences* of an organization that has built up the habits of communication
between groups, which habits we label DevOps. Without communication that
happens early and often, we end up in situations where software or
infrastructure or IT use 'DevOps tools' but they don't realize the full
potential of those tools to add value and deliver more rapidly. There are notes
for several tools- and techniques-focused presentations below, but the
importance isn't the tools themselves but the models of product development,
deployment, and maintenance they enable through their use.

---

Thanks to the ADDO organizers, presenters, and sponsors!

NOTE: ADDO has clipped the talks so you don't have to scrub through looking for
a talk like I had to, but the links for the originally recorded blocks are used
below since I didn't have the clips at the time. A patch with the updated links
is welcome, but I likely will not make the effort since you can find each talk
by its title or presenter name pretty easily.

Keynote, Jaya Baloo - 3:00 AM, EST
3:00 AM, EST
Derek Weeks, Mark Miller, Co-Founders, All Day DevOps
Jaya Baloo, Chief Information Security Officer, KPN Telecom

    - This one was mostly about quantum computing
    - The DevSecOps tie-in was mostly at the end and had to do with
      anticipating upcoming changes in computing and putting in place defensive
      security infrastructure and thinking about secrecy of cipher texts that
      have already been collected. That's a cross-functional concern, so it
      includes Dev, Ops, and Sec

https://www.youtube.com/watch?v=MnevoY_ACD4
https://www.youtube.com/watch?v=-JosVWcYUsI
DevSecOps and the DevOps Superpattern - Helen Beal

    - Beal @ Ranger4 - a DevOps consulting company
    - History of DevOps
        - Patrick DuBois @ Google
        - Paul Hammond & John Osborne @ Flickr
        - Getting Ops / Management more Agile
    - Newer definition: value stream development
    - CAMS an acronym to describe what DevOps is

      Culture
      Automation
      Measurement
      Sharing
    - "Is DevSecOps a Good Thing?" talk (answer: "yes")
    - slide: "Is security an afterthought?" (answer: "also yes")
    - Parts Unlimited Team - simulated software development org

      Based on "The Phoenix Project" book by Gene Kim, Kevin Behr, and George
      Spafford
      (https://www.amazon.com/Phoenix-Project-DevOps-Helping-Business/dp/0988262509?SubscriptionId=AKIAILSHYYTFIVPWUY6Q&tag=duckduckgo-ffsb-20&linkCode=xm2&camp=2025&creative=165953&creativeASIN=0988262509)
    - Presents some anti-patterns in communications with security team
      ME: seems pretty heavy on things that sec needs to do to 'catch up' vs
      things other groups could do to work better with dev (e.g,, designing
      pipelines in such a way that sec is empowered to not just perform scans,
      but also make and deploy patches)

      ME: Should point out that the slide is in the same language as the Agile
      Manifesto values statement (e.g., "Working software over comprehensive
      documentation"),

    - Everyone has responsibility for security
    - The DevOps "Super-pattern" (slide 12): https://youtu.be/MnevoY_ACD4?t=701
        - DevOps comes from the concepts
            - Holacracy (portmanteau of "holistic" and "democracy")
            - ITSM
            - Agile
            - Lean
            - Three-legged stool (theory of constraints)
            - Learning organization
            - Safety culture
            - Harmonious Polygamist Marriage

    - Table looking at some of the concepts above through CAMS "lenses"
        - Agile Daily collab *including with sec*
        - Holacracy - A flat organizational structure
            - Don't shoot, but reward the messenger
            - Don't isolate sec
        - Agile Service Management (ASM)
            - ITIL (IT infrastructure library) processes and procedures
            - Delivering IT through Agile methods
            - Just enough governance to deliver value
            - Sharing vocabularies and intelligence
        - Lean
            - Automation can help resolve the security skills gap
        ME: This slide could be greatly condensed, at least to the points which
        are highlighted as Beal moves through it. Splitting out into CAMS
        aspects doesn't help me to understand how security can work with other
        groups. A general description of the disciplines would suffice for an
        orientation which could then be coupled with the non-obvious CAMS
        alignment would suffice for laying out the 'superpattern'


https://www.youtube.com/watch?v=2812-6gdbyA
Understand Immutable Infrastructure: What? Why? How? - Quentin Adam

    - This one was hard to follow at times because of the presentation style,
      but it seems that the Adam's company looked at the problems that
      typically cause failures in real architectures and determined that the
      biggest ones come from human interaction with the system, and in
      particular, reconfiguration at runtime.
    - At Clever Cloud, apparently, an SSH login to a server is a 'red alert'
    - Their methodolgy involves deploying new instances to change configuration
      rather than changing the configuration on a running server. According to
      Adam, this means that the state space for a server is reduced from N
      variables (e.g., the space of versions of all peices of software on a
      system) to 2 (working / not working). I suppose that using this
      methodology would reduce the incidence of 'fat-fingering' a command so
      that you accidentally take a whole cluster offline.


https://www.youtube.com/watch?v=6Rf5ChCSoBs
But We Can't Do That Here! - Liz Keogh

    - I liked this talk a lot.
    - Main things I took from it:
        - Cynefin : a framework for thinking about domains of problem-solving /
          decision making
            - Chaotic (act, sense, respond)
            - Complex (probe, sense, respond)
            - Complicated (sense, analyze, respond)
            - Obvious (sense, categorize, respond)
          Knowing which domain you're in is key to not making expensive mistakes.
        - Utility of value-stream mapping, even for teams whose primary
          'customer' is within the organization
            - Related one story about how infrastructure team identified 50
              steps to acquiring a server, then found the points where the
              process was disrupted, then focused on those steps.


https://www.youtube.com/watch?v=IAcUalc5_d0
Increasing the Dependability of DevOps Processes - Ingo Weber

    - Describes a framework for alerting on anomalous occurences in a cloud
      deployment environment
    - Not gov't-focused, but Weber is part of an org that is "technically"
      governmental
    - Like Quentin Adam's presentation, indicates that most failures are the
      result of human error, but actually cites a study to back that up.
    - Framework involves creating models for various actions performed on the
      system. Main example is a rolling deployment of a new release.
    - Based on log events and polling of cloud APIs
    - Overall approach is called POD or Process-Oriented Dependability
    - Alerts based on unexpected state transitions indicated by cloud API
      results and observed log messages (e.g., restarting some group of servers
      and not all servers in the group have a 'shutdown' and a 'startup'
      message before a 'complete' message for the restart command)
    - Offline training for the models ME: Why not online? In general, the
      behavior observed in production can be different in terms of response
      times and nominal log volumes. Although the log events described are
      probably conserved between environments, Weber leaves out a whole class
      of other events with only offline training.
    - Describes some timing-dependent alerts.  ME: The example shows what look
      like 4 different modes, which should be broken up into distict models,
      but still maybe a good approach.
    - Near the end, also discusses how corrective actions could be triggered on
      alerts.
    - ME: It doesn't seem like this approach would serve well for "unknown
      unknowns" or undesirable emergent behavior in a system that, although
      it's undesired, still fits the trained models. Not to mention, the burden
      of creating log events for a bunch of things before you even realize that
      they're predictive of unexpected failures.


Building Technical and Organizational Confidence Through Automated Deployments - Mieke Deenen

    - A very cool "lessons-learned" talk about getting an organization on board
      with a DevOps cultural shift
    - About the social security collection and benefits site for the
      Netherlands, werk.nl
    - Started relatively small with one success, then with gained confidence
      moved onto customer service division
      ME: Not explicitly mentioned, but custsvc seems like an excellent place
      to start from a value-stream perspective considering this is a gov't org
    - Automated deployments were used for flexibility rather than just speed
      ME: This is something I've often thought about: not just "we can deploy
      faster", but now that we don't have to wait long for deployments "what
      freedom does a quick deployment get us to experiment with things"

https://www.youtube.com/watch?v=Ulp91L2zXPE
How We Went From 40 Days to 3 Building Crystal Clear Test Cases While Improving Test Coverage! - Stephen Tyler

    - A model-based testing workflow / toolchain talk
    - Characterized as experimental
    - Primary concern is reducing defects that 'escape' to prod
    - A couple of anecdotes and a few case studies, but no comprehensive data
      showing reduction in defect escape across a variety of programs

Lessons in Leading a Fortune 100 Team to a DevOps Philosophy - Uldis Karlovs-Karlovskis

    - From "Nordis DevOps Lead" at Accenture. Accenture Latvia.
    - About managerial / communication structures
    - Mostly, didn't seem very reproducible, but suggested take-aways:
        1. Assume people are trying to do the right thing
        2. Seek intrinsic rewards rather than extrinsic rewards (e.g., cash rewards)
        3. Engagement is an employee responsibility
        4. Let people lead (in their own way)
    - ME: Not that interesting for a software engineer...


https://www.youtube.com/watch?v=SRXohzWQkp0
Escrow: How To Share Secrets - Kyle Rickman

    - Underarmor: Connected Fitness
    - Rickman is an software development infrastrucure engineer for internal
      teams
    - Backstory for talk is Underarmor acquired 3 different companies with
      different products and toolchains who needed to share key-value data
      between the groups of developers from those companies.
    - Wants to permit sharing and collaboration without revealing data between
      the groups that shouldn't be shared and without requiring interaction
      with the infrastructure engineer in order to share the data.
    - Developed a tool called Escrow for that purpose
    - Escrow has hierarchical key-value lists called "Chains" and each level in
      the hierarchy is called a link
    - Emphasizes "API-first" design, or having web API that devs can use to
      build integrations
    - A group or a user owns a link and can make the link Private or Public
    - "Escrow" addresses integrity through "Rendering" chains, which holds the
      values in the chain constant at whatever point in time the rendering is
      made. The rendering is called an "Artifact"
    - ME: The Artifact concept, doesn't fully address integrity since a group
      can, apparently, write arbitrary keys and values, potentially overwriting
      higher-integrity values farther up in the chain. Perhaps the Chain
      construction process is expected to identify such cases, but it seems
      like there's a real limitation there.

A DevOps State of Mind: Continuous Security with DevSecOps + Containers - Chris Van Tuin

    - Last name pronounced "Van-Tie"
    - RedHat strategist
    - Disuptors
      Empowered organization
      Rapid Innovation
      Data-Driven Intelligience
      Culture of Experimentation (enabled by IT automation)
    - Suggests that containers + cloud enable easier dev/sec integration
    - Describes a pipeline for building out to containers and moving to
      production through promotion steps
    - For security fixes, the model is to fix in the container and deploy the
      container...obviously doesn't address security issues in the container
      runtime (e.g., docker) itself or in the container orchestration platform,
      but it's a more reliable approach from having humans patching servers or
      even of having a script running against live containers. It's actually
      the same idea as the Clever Cloud CEO Quentin Adam's 'immutable
      infrastructure'.

    - ME: The mention of a company-wide docker registry really appeals to me.
      We currently have a docker registry just for our program that houses
      public images as well as our program-specific images. There are security
      issues that come up (as suggested in the talk) when vulns are discovered
      which are still sitting in publicly accesible containers. We should be
      able to make a registry within the co. for, at least, public images.
    - ME: Tesla manufacturing example...probably not the best example
      considering their current 'production hell'


https://www.youtube.com/watch?v=ApVI7-g_wpk
https://www.youtube.com/watch?v=OaojdXYSkpI
Secrets of a High Performance Security Focussed Agile Team - Kim Carter

    - Sensible security model
        - Bruce Snier. 5 steps
        - Talks about his book chapters which is a confusing way to start...and
          somewhat annoying.
        - Talks about 'code monkey' vs 'professional dev'
        - Describes ways to introduce security-centric activities into a sprint
        ME: The presenter sounded ill. Coughs, sniffs and swallows were distracting.
	Mark Watts' notes on All Day DevOps.
	====================================

	The talks I viewed and the notes I took reflect my personal interests and
	issues which I though would be worthwile to my employer, but there were many
	useful talks.

	One of the business-side advantages of DevOps is that it's supposed to *save
	time and money* by reducing overhead due, principally, to dysfunctional
	patterns of behavior within the organization. Many talks in ADDO describe tools
	that are intended to solve common problems in software development and
	deployment. In addition, many presenters describe processes of organizational
	transformation which give insight into possible difficulties and ways to get
	through or avoid them.

	As I see it, continuous integration (CI) and continous deployment (CD) are
	consequences of an organization that has built up the habits of communication
	between groups, which habits we label DevOps. Without communication that
	happens early and often, we end up in situations where software or
	infrastructure or IT use 'DevOps tools' but they don't realize the full
	potential of those tools to add value and deliver more rapidly. There are notes
	for several tools- and techniques-focused presentations below, but the
	importance isn't the tools themselves but the models of product development,
	deployment, and maintenance they enable through their use.

	---

	Thanks to the ADDO organizers, presenters, and sponsors!

	NOTE: ADDO has clipped the talks so you don't have to scrub through looking for
	a talk like I had to, but the links for the originally recorded blocks are used
	below since I didn't have the clips at the time. A patch with the updated links
	is welcome, but I likely will not make the effort since you can find each talk
	by its title or presenter name pretty easily.

	Keynote, Jaya Baloo - 3:00 AM, EST
	3:00 AM, EST
	Derek Weeks, Mark Miller, Co-Founders, All Day DevOps
	Jaya Baloo, Chief Information Security Officer, KPN Telecom

	- This one was mostly about quantum computing
	- The DevSecOps tie-in was mostly at the end and had to do with
	anticipating upcoming changes in computing and putting in place defensive
	security infrastructure and thinking about secrecy of cipher texts that
	have already been collected. That's a cross-functional concern, so it
	includes Dev, Ops, and Sec

	https://www.youtube.com/watch?v=MnevoY_ACD4
	https://www.youtube.com/watch?v=-JosVWcYUsI
	DevSecOps and the DevOps Superpattern - Helen Beal

	- Beal @ Ranger4 - a DevOps consulting company
	- History of DevOps
	- Patrick DuBois @ Google
	- Paul Hammond & John Osborne @ Flickr
	- Getting Ops / Management more Agile
	- Newer definition: value stream development
	- CAMS an acronym to describe what DevOps is

	Culture
	Automation
	Measurement
	Sharing
	- "Is DevSecOps a Good Thing?" talk (answer: "yes")
	- slide: "Is security an afterthought?" (answer: "also yes")
	- Parts Unlimited Team - simulated software development org

	Based on "The Phoenix Project" book by Gene Kim, Kevin Behr, and George
	Spafford
	(https://www.amazon.com/Phoenix-Project-DevOps-Helping-Business/dp/0988262509?SubscriptionId=AKIAILSHYYTFIVPWUY6Q&tag=duckduckgo-ffsb-20&linkCode=xm2&camp=2025&creative=165953&creativeASIN=0988262509)
	- Presents some anti-patterns in communications with security team
	ME: seems pretty heavy on things that sec needs to do to 'catch up' vs
	things other groups could do to work better with dev (e.g,, designing
	pipelines in such a way that sec is empowered to not just perform scans,
	but also make and deploy patches)

	ME: Should point out that the slide is in the same language as the Agile
	Manifesto values statement (e.g., "Working software over comprehensive
	documentation"),

	- Everyone has responsibility for security
	- The DevOps "Super-pattern" (slide 12): https://youtu.be/MnevoY_ACD4?t=701
	- DevOps comes from the concepts
	- Holacracy (portmanteau of "holistic" and "democracy")
	- ITSM
	- Agile
	- Lean
	- Three-legged stool (theory of constraints)
	- Learning organization
	- Safety culture
	- Harmonious Polygamist Marriage

	- Table looking at some of the concepts above through CAMS "lenses"
	- Agile Daily collab including with sec
	- Holacracy - A flat organizational structure
	- Don't shoot, but reward the messenger
	- Don't isolate sec
	- Agile Service Management (ASM)
	- ITIL (IT infrastructure library) processes and procedures
	- Delivering IT through Agile methods
	- Just enough governance to deliver value
	- Sharing vocabularies and intelligence
	- Lean
	- Automation can help resolve the security skills gap
	ME: This slide could be greatly condensed, at least to the points which
	are highlighted as Beal moves through it. Splitting out into CAMS
	aspects doesn't help me to understand how security can work with other
	groups. A general description of the disciplines would suffice for an
	orientation which could then be coupled with the non-obvious CAMS
	alignment would suffice for laying out the 'superpattern'


	https://www.youtube.com/watch?v=2812-6gdbyA
	Understand Immutable Infrastructure: What? Why? How? - Quentin Adam

	- This one was hard to follow at times because of the presentation style,
	but it seems that the Adam's company looked at the problems that
	typically cause failures in real architectures and determined that the
	biggest ones come from human interaction with the system, and in
	particular, reconfiguration at runtime.
	- At Clever Cloud, apparently, an SSH login to a server is a 'red alert'
	- Their methodolgy involves deploying new instances to change configuration
	rather than changing the configuration on a running server. According to
	Adam, this means that the state space for a server is reduced from N
	variables (e.g., the space of versions of all peices of software on a
	system) to 2 (working / not working). I suppose that using this
	methodology would reduce the incidence of 'fat-fingering' a command so
	that you accidentally take a whole cluster offline.


	https://www.youtube.com/watch?v=6Rf5ChCSoBs
	But We Can't Do That Here! - Liz Keogh

	- I liked this talk a lot.
	- Main things I took from it:
	- Cynefin : a framework for thinking about domains of problem-solving /
	decision making
	- Chaotic (act, sense, respond)
	- Complex (probe, sense, respond)
	- Complicated (sense, analyze, respond)
	- Obvious (sense, categorize, respond)
	Knowing which domain you're in is key to not making expensive mistakes.
	- Utility of value-stream mapping, even for teams whose primary
	'customer' is within the organization
	- Related one story about how infrastructure team identified 50
	steps to acquiring a server, then found the points where the
	process was disrupted, then focused on those steps.


	https://www.youtube.com/watch?v=IAcUalc5_d0
	Increasing the Dependability of DevOps Processes - Ingo Weber

	- Describes a framework for alerting on anomalous occurences in a cloud
	deployment environment
	- Not gov't-focused, but Weber is part of an org that is "technically"
	governmental
	- Like Quentin Adam's presentation, indicates that most failures are the
	result of human error, but actually cites a study to back that up.
	- Framework involves creating models for various actions performed on the
	system. Main example is a rolling deployment of a new release.
	- Based on log events and polling of cloud APIs
	- Overall approach is called POD or Process-Oriented Dependability
	- Alerts based on unexpected state transitions indicated by cloud API
	results and observed log messages (e.g., restarting some group of servers
	and not all servers in the group have a 'shutdown' and a 'startup'
	message before a 'complete' message for the restart command)
	- Offline training for the models ME: Why not online? In general, the
	behavior observed in production can be different in terms of response
	times and nominal log volumes. Although the log events described are
	probably conserved between environments, Weber leaves out a whole class
	of other events with only offline training.
	- Describes some timing-dependent alerts. ME: The example shows what look
	like 4 different modes, which should be broken up into distict models,
	but still maybe a good approach.
	- Near the end, also discusses how corrective actions could be triggered on
	alerts.
	- ME: It doesn't seem like this approach would serve well for "unknown
	unknowns" or undesirable emergent behavior in a system that, although
	it's undesired, still fits the trained models. Not to mention, the burden
	of creating log events for a bunch of things before you even realize that
	they're predictive of unexpected failures.


	Building Technical and Organizational Confidence Through Automated Deployments - Mieke Deenen

	- A very cool "lessons-learned" talk about getting an organization on board
	with a DevOps cultural shift
	- About the social security collection and benefits site for the
	Netherlands, werk.nl
	- Started relatively small with one success, then with gained confidence
	moved onto customer service division
	ME: Not explicitly mentioned, but custsvc seems like an excellent place
	to start from a value-stream perspective considering this is a gov't org
	- Automated deployments were used for flexibility rather than just speed
	ME: This is something I've often thought about: not just "we can deploy
	faster", but now that we don't have to wait long for deployments "what
	freedom does a quick deployment get us to experiment with things"

	https://www.youtube.com/watch?v=Ulp91L2zXPE
	How We Went From 40 Days to 3 Building Crystal Clear Test Cases While Improving Test Coverage! - Stephen Tyler

	- A model-based testing workflow / toolchain talk
	- Characterized as experimental
	- Primary concern is reducing defects that 'escape' to prod
	- A couple of anecdotes and a few case studies, but no comprehensive data
	showing reduction in defect escape across a variety of programs

	Lessons in Leading a Fortune 100 Team to a DevOps Philosophy - Uldis Karlovs-Karlovskis

	- From "Nordis DevOps Lead" at Accenture. Accenture Latvia.
	- About managerial / communication structures
	- Mostly, didn't seem very reproducible, but suggested take-aways:
	1. Assume people are trying to do the right thing
	2. Seek intrinsic rewards rather than extrinsic rewards (e.g., cash rewards)
	3. Engagement is an employee responsibility
	4. Let people lead (in their own way)
	- ME: Not that interesting for a software engineer...


	https://www.youtube.com/watch?v=SRXohzWQkp0
	Escrow: How To Share Secrets - Kyle Rickman

	- Underarmor: Connected Fitness
	- Rickman is an software development infrastrucure engineer for internal
	teams
	- Backstory for talk is Underarmor acquired 3 different companies with
	different products and toolchains who needed to share key-value data
	between the groups of developers from those companies.
	- Wants to permit sharing and collaboration without revealing data between
	the groups that shouldn't be shared and without requiring interaction
	with the infrastructure engineer in order to share the data.
	- Developed a tool called Escrow for that purpose
	- Escrow has hierarchical key-value lists called "Chains" and each level in
	the hierarchy is called a link
	- Emphasizes "API-first" design, or having web API that devs can use to
	build integrations
	- A group or a user owns a link and can make the link Private or Public
	- "Escrow" addresses integrity through "Rendering" chains, which holds the
	values in the chain constant at whatever point in time the rendering is
	made. The rendering is called an "Artifact"
	- ME: The Artifact concept, doesn't fully address integrity since a group
	can, apparently, write arbitrary keys and values, potentially overwriting
	higher-integrity values farther up in the chain. Perhaps the Chain
	construction process is expected to identify such cases, but it seems
	like there's a real limitation there.

	A DevOps State of Mind: Continuous Security with DevSecOps + Containers - Chris Van Tuin

	- Last name pronounced "Van-Tie"
	- RedHat strategist
	- Disuptors
	Empowered organization
	Rapid Innovation
	Data-Driven Intelligience
	Culture of Experimentation (enabled by IT automation)
	- Suggests that containers + cloud enable easier dev/sec integration
	- Describes a pipeline for building out to containers and moving to
	production through promotion steps
	- For security fixes, the model is to fix in the container and deploy the
	container...obviously doesn't address security issues in the container
	runtime (e.g., docker) itself or in the container orchestration platform,
	but it's a more reliable approach from having humans patching servers or
	even of having a script running against live containers. It's actually
	the same idea as the Clever Cloud CEO Quentin Adam's 'immutable
	infrastructure'.

	- ME: The mention of a company-wide docker registry really appeals to me.
	We currently have a docker registry just for our program that houses
	public images as well as our program-specific images. There are security
	issues that come up (as suggested in the talk) when vulns are discovered
	which are still sitting in publicly accesible containers. We should be
	able to make a registry within the co. for, at least, public images.
	- ME: Tesla manufacturing example...probably not the best example
	considering their current 'production hell'


	https://www.youtube.com/watch?v=ApVI7-g_wpk
	https://www.youtube.com/watch?v=OaojdXYSkpI
	Secrets of a High Performance Security Focussed Agile Team - Kim Carter

	- Sensible security model
	- Bruce Snier. 5 steps
	- Talks about his book chapters which is a confusing way to start...and
	somewhat annoying.
	- Talks about 'code monkey' vs 'professional dev'
	- Describes ways to introduce security-centric activities into a sprint
	ME: The presenter sounded ill. Coughs, sniffs and swallows were distracting.