juniorz/platform-engineering.md

## platform-engineering.md

      
    Raw
  

              platform-engineering.md
            
          
"Those that know, do. Those that understand, teach."
Aristoteles (supposedly)

Scope: Building and running web-scale distributed cloud systems with container technologies :P
Learning goals


Establish the challenges for engineering cloud-native systems (the problem space).
Establish the approaches to address those challenges (the solution space).
Establish a taxonomy for this segment of the software development industry (the domain).
Define evaluation criteria for tools/products that address the challenges (the solution space).
Assess the landscape of the solution space and how they fit the market.

Don't build a space pen (or use it) unless it's an essential part of the problem.
Taxonomy


A cloud native application is a collection of interrelated, but discrete components (services, tasks, workers) that, when coupled with configuration and instantiated in a suitable runtime, together accomplish a unified functional purpose.


Components: runnable units / executable units: virtual machines, containers, Functions-as-a-Service (FaaS).
Workload type: the components runtime profile according to distinguishing points (replicable, daemonized, service addressable).
Supporting services (managed cloud services): load balancers, object storage, databases, (DNS?).
Traits: operational capabilities - and as such are operational concerns, as opposed to developer concerns. For instance manual scaler, autoscaler, ingress, volume mounter.

Roles and responsibilities:

Application Developers: deliver business value in form of application code via application components.

Understand operational characteristics of the application (writes to a /persistent volume, needs 2 vCPUs, listen on port 8088/tcp) but remain unconcerned with how operational requirements are fulfilled.
Focus on the business domain.


Application Operators: deliver business value by configuring, installing and managing componenets via application configurations.

Focus on strategies for operating the application, rather than infrastructure details.


Infrastructure operators: deliver value by managing low-level infrastructural components and supporting services.

Focus on how the overall infrastructure is managed.


The OAM encourages:

Application management following team structure: app developers (DEV), app operators (SRE), infra operators (INFRA).
An opinionated workflow: app developers throw components over a wall, app operators throw application configurations over a wall, and infrastructure operators satisfy those needs in the cloud infra.

Observability: https://www.honeycomb.io/blog/observability-101-terminology-and-concepts/
Problem Space


Developers should not be burdened with infrastructural concerns.
Operators and runtimes should be free to meet a component's infrastructural needs as they see fit.
A platform should be free to choose a runtime that is capable of running a specific workload type.
Bundling components in higher-level systems (abstraction and reuse) as well as reusable blueprints (standardization).
Managing components and supporting services uniformly.
Operators should manage discrete resources (components) as a single logical unit (artifact) that comprises an app.

Solution Space


Application runtimes (PaaS): CF Application Runtime, Pivotal Application Service, Flynn, Rio (Rancher), Heroku (Salesforce), Platform.sh, Tsuru, Juju (Canonical), Banzai Cloud Pipeline


Container Runtime: Kubernetes, Mesos, Nomad, Docker Swarm, Amazon ECS, Azure Service Fabric, CF Container Runtime.

Kubernetes Distribution: CF Container Runtime, Pivotal Container Service, Charmed Kubernetes (Canonical), MicroK8s (Canonical), Rancher Kubernetes, K3s (Rancher), Openshift, Triton (Joyent), PKE (Banzai Cloud)


Helm: manages the lifecycle of "Kubernetes applications" via charts - artifacts that bundle templates for Kubernetes manifests.


Cloud Native Application Bundle (CNAB): "a standard packaging format for multi-component distributed applications". Packages an application components AND an installer (invocation image) that is able to manage its lifecycle via well-known verbs ("install", "upgrade", "uninstall").


Cloud Native Buildpacks: "a higher-level abstraction for building apps compared to Dockerfiles"


Kubernetes Operators: "software extensions to Kubernetes that make use of custom resources to manage applications and their components". Operators automate Day-1 and Day-n activities by putting operational knowledge into software and abstract applications into declarative resources in order to create, configure, and manage instances of complex stateful applications.

AWS Service Operator
Kubernetes Operator for Java


Kubernetes Service Catalog: "an extension API that enables applications running in Kubernetes clusters to easily use external managed software offerings, such as a datastore service offered by a cloud provider."


Service broker: an implementation of the Open Service Broker API that enables platforms to provision, get access to and manage the services offered by the broker.

AWS Service Broker
Open Service Broker for Google Cloud Platform
Open Service Broker for Azure
Oracle Cloud Infrastructure Service Broker


Operator Lifecycle Manager: A Kubernetes Operator for Kubernetes Operators. Provides "a declarative way to install, manage, and upgrade Operators and their dependencies in a cluster".


Deployment platform spectrum

from less mature to most mature

Bespoke runtime (scheduling, elasticity, etc)
COTS runtime (e.g., Kubernetes, Mesos, Nomad, Amazon ECS)
Platform on top of a (container) runtime
Application Runtimes (PaaS)

Management Automation spectrum

See: "Types of Operators"

(Source: https://operatorframework.io/operator-capabilities/)
Reflections

A platform team exposes an interface (API, control plane) to the organization infrastructure based on its policies.
On common approach is defining (and enforcing, and creating) a set of tools, but it is not ideal: it exposes an API at the wrong layer.
Kubernetes is more than a container runtime. It gives you an interface to the organization infrastructure (via its apiserver)! Doing everything via Kubernetes (applications and managed services) achieves the goal of exposing a uniform API to the org, and the teams can chose whatever tool that talks that API (terraform, kubectl, chef, ansible, you name it).
On the other hand, using tools to define an interface is a practical choice for the platform team. For example, terraform Kubernetes provider did not allow defining arbitrary manifests until recently and the freedom of choice on tools can also bring additional support requests to the platform team.
Glossary

See also: Kubernetes standardized glossary


Day-1 activities: installation, configuration, etc


Day-2 (or Day-N) activities: re-configuration, update, backup, failover, restore, etc.


Kubernetes-native application: application manage by a Kubernetes Operator (as per Operator Lifecycle Manager doc).


Application runtime:


Container runtimme:


Bespoke software: a company creates, maintains and runs/operates themself.


Common off-the-shelf (COTS) software: run/operated by a company that was created (and maintainted) by a third-party.


Managed service: run/operated and maintained by a third-party (which can potentially be another team/department in the same company).


Futurology


Architecture optimization for cost? https://twitter.com/mohapatrahemant/status/1102401615263223809

Specifics


Patterns to route traffic to a private cluster:

https://www.getambassador.io/docs/latest/topics/concepts/kubernetes-network-architecture/#routing-traffic-to-your-kubernetes-cluster
Edge proxy: a L7 (HTTP) proxy that accepts incoming traffic from the external load balancer and route the traffic to in-cluster services.
Ingress controller: an edge proxy that can process Kubernetes Ingress resources.


Observability blueprints for Kubernetes (what to monitor/alert):

https://docs.google.com/document/d/1A9xvzwqnFVSOZ5fD3blKODXfsat5fg6ZhnKu9LK3lB4/edit#heading=h.gt9r2h2gklj3
https://github.com/kubernetes-monitoring/kubernetes-mixin/blob/master/alerts/resource_alerts.libsonnet


Common Problems:

Throttling: https://www.youtube.com/watch?v=UE7QX98-kO0&t=137s


Helm (2?) internals

http://technosophos.com/2017/03/23/how-helm-uses-configmaps-to-store-data.html


What to read


https://about.gitlab.com/devops-tools/
"The Structure of Design Problem Space" https://onlinelibrary.wiley.com/doi/pdf/10.1207/s15516709cog1603_3
"Bringing Buildpacks to Kubernetes" https://www.youtube.com/watch?v=kIJ0xBldhYY&t=7s
Explorer the separation of concerns and responsibilities and how buildpacks fit into the problem space. Good to discuss the problem space.
https://blog.overops.com/pivotal-cloud-foundry-vs-kubernetes-choosing-the-right-cloud-native-application-deployment-platform/
https://blog.colinbreck.com/using-quality-views-to-communicate-software-quality-and-evolution/

Configuration management


https://twitter.com/bgrant0607/status/1121054924979064832
https://github.com/kubernetes/community/blob/master/contributors/design-proposals/architecture/resource-management.md
https://twitter.com/bgrant0607/status/1123620689930358786
https://github.com/kubernetes/community/blob/master/contributors/design-proposals/architecture/declarative-application-management.md
https://docs.google.com/document/d/1cLPGweVEYrVqQvBLJg6sxV-TrE5Rm2MNOBA_cxZP2WU/edit
Configuration as Data (or why I hate Helm  - even more since playing with cue)

https://twitter.com/bgrant0607/status/1245452184575045632
https://twitter.com/bgrant0607/status/1263165797699969024


Infrastructure management


https://cloud.google.com/config-connector/docs/overview
https://github.com/aws/aws-service-operator-k8s
https://github.com/hashicorp/terraform-k8s
https://crossplane.io/
https://github.com/keptn/keptn and https://github.com/keptn/lifecycle-toolkit

Sample (demo) applications


https://github.com/podtato-head/podtato-head
https://github.com/istio/istio/tree/master/samples/bookinfo