juniorz/not-yet.md

## not-yet.md

      
    Raw
  

              not-yet.md
            
          
    Tecnology and Society


http://appropriatingtechnology.org/?q=node/296

Kubernetes autoscaling


kubernetes/kubernetes#1629
https://github.com/kubernetes/community/blob/master/contributors/design-proposals/autoscaling/horizontal-pod-autoscaler.md
https://github.com/kubernetes/community/blob/master/contributors/design-proposals/autoscaling/vertical-pod-autoscaler.md
Scale to/from zero

kubernetes/kubernetes#69687
KEDA (https://github.com/kedacore/keda#getting-started)
https://github.com/DirectXMan12/k8s-prometheus-adapter/blob/master/docs/walkthrough.md


Kubernetes management


https://medium.com/condenastengineering/k8s-federation-v2-a-guide-on-how-to-get-started-ec9cc26b1fa7
https://medium.com/condenastengineering/clusterapi-a-guide-on-how-to-get-started-ff9a81262945
https://www.nickaws.net/aws/elixir/2019/09/02/Federation-and-EKS.html
https://www.infoq.com/podcasts/kubernetes-self-service-cluster-api/

Linux process scheduling


https://engineering.squarespace.com/blog/2017/understanding-linux-container-scheduling
Completely Fair Scheduler (CFS):

https://www.youtube.com/watch?v=MkJfuI5_hjc&t=0s
https://www.kernel.org/doc/html/latest/scheduler/sched-design-CFS.html
https://www.kernel.org/doc/html/latest/scheduler/sched-bwc.html
https://opensource.com/article/19/2/fair-scheduling-linux


https://docs.docker.com/engine/reference/run/#cpu-share-constraint
https://engineering.indeedblog.com/blog/2019/12/unthrottled-fixing-cpu-limits-in-the-cloud/
https://engineering.indeedblog.com/blog/2019/12/cpu-throttling-regression-fix/

cgroups


http://man7.org/linux/man-pages/man7/namespaces.7.html
https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v1/cgroups.html
http://man7.org/linux/man-pages/man7/cgroups.7.html
https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/resource_management_guide/ch01
https://research.google/pubs/pub36669/

X in cgroups/containers


Java: https://engineering.linkedin.com/blog/2016/11/application-pauses-when-running-jvm-inside-linux-control-groups


Network?


https://gist.github.com/CMCDragonkai/6bfade6431e9ffb7fe88
HTTP keep-alive
HTTP/2
Head of Line blocking

distributed


https://474.cmpt.sfu.ca/design-space.html
https://474.cmpt.sfu.ca/resources.html
https://474.cmpt.sfu.ca/schedule.html
https://accelazh.github.io/storage/Tail-Latency-Study
https://jepsen.io/consistency
https://accelazh.github.io/cloud/A-Summary-of-Cloud-Scheduling
https://474.cmpt.sfu.ca/Week3-Mon.html
https://web.archive.org/web/20180227095215/http://474.cmpt.sfu.ca/public/Week4-Fri.html
https://accelazh.github.io/storage/Build-My-Academic-Paper-Feedback-Network
http://home.cse.ust.hk/~weiwa/teaching/Fall15-COMP6611B/reading_list/TheTailAtScale.pdf

performance tools


https://accelazh.github.io/linux/Understand-System-Performance-Commands

tools


https://www.aosabook.org/en/nginx.html
HAProxy architecture

monitoring


https://www.infoq.com/articles/monitoring-SRE-golden-signals/
https://web.archive.org/web/20171023173225/https://www.vividcortex.com/blog/monitoring-and-observability-with-use-and-red
https://medium.com/faun/how-to-monitor-the-sre-golden-signals-1391cadc7524
https://accelazh.github.io/failure/Summarizing-Production-Server-Failure-Modes
https://accelazh.github.io/storage/Storage-Reliability-Calculations

envoy (in practice)


https://www.envoyproxy.io/docs/envoy/latest/
https://medium.com/@copyconstruct/envoy-953c340c2dca
https://blog.christianposta.com/microservices/01-microservices-patterns-with-envoy-proxy-part-i-circuit-breaking/
https://dzone.com/articles/istio-circuit-breaker-with-outlier-detection
https://www.envoyproxy.io/docs/envoy/latest/api-v2/api/v2/cluster/outlier_detection.proto
https://blog.turbinelabs.io/a-guide-to-envoys-backpressure-22eec025ef04
https://www.javacodegeeks.com/2018/01/comparing-envoy-istio-circuit-breaking-netflix-oss-hystrix.html
https://unofficialism.info/posts/envoy-proxy-demos/
https://developers.redhat.com/blog/2017/05/31/microservices-patterns-with-envoy-sidecar-proxy-part-i-circuit-breaking/
https://developers.redhat.com/blog/2017/06/01/microservices-patterns-with-envoy-proxy-part-ii-timeouts-and-retries/
https://developers.redhat.com/blog/2017/06/08/microservices-patterns-with-envoy-proxy-part-iii-distributed-tracing/
https://blog.christianposta.com/microservices/advanced-traffic-shadowing-patterns-for-microservices-with-istio-service-mesh/
https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/upstream/load_balancing/zone_aware
https://www.envoyproxy.io/docs/envoy/latest/faq/configuration/zone_aware_routing
https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/upstream/load_balancing/locality_weight
https://www.envoyproxy.io/docs/envoy/latest/faq/configuration/timeouts.html?highlight=timeout
https://thenewstack.io/lyfts-envoy-provides-move-monolith-soa/
https://learning.oreilly.com/library/view/introducing-istio-service/9781491988770/ch04.html

Cloud Native


https://www.eightypercent.net/post/layers-in-the-stack.html
https://www.eightypercent.net/post/new-container-image-format.html
https://github.com/brendandburns/designing-distributed-systems-labs
https://www.infoq.com/articles/oam-alibaba/
https://azure.microsoft.com/en-us/resources/designing-distributed-systems/
https://docs.google.com/presentation/d/1P7lg13Rw21NQ59ts22PTzI1q81nnYVK-cOVhYTzg9tg/edit#slide=id.g2876e98c14_1_3
Azure/AKS#1373

Concurrecy control


envoyproxy/envoy#7789
https://github.com/Netflix/concurrency-limits
https://github.com/envoyproxy/nighthawk
https://github.com/tonya11en/bufferbloater

Multi-zone

K8s supports running a single cluster in multiple failure zones (zones in GCP, availability zones in AWS).
A single k8s cluster is limited to a single region (and cloud provider). Multi-cloud providers and multi-region requires multiple clusters.

Pods in a replication controller or service are automatically spread across zones.

What is SelectorSpreadPriority?

https://kubernetes.io/docs/concepts/scheduling/kube-scheduler/#scoring


What is the effect of podAntiAffinity with topologyKey: "failure-domain.beta.kubernetes.io/zone" then?

Example: https://blog.verygoodsecurity.com/posts/kubernetes-multi-az-deployments-using-pod-anti-affinity/


How does it interact with topologySpreadConstraints (1.16)?

https://kubernetes.io/docs/concepts/workloads/pods/pod-topology-spread-constraints/


There is no zone-aware (service) routing as per 1.16

traffic that goes via services might cross zones
assumes different zones are located close to each other in the network


topology-aware service routing is planned for 1.17 (kubernetes/kubernetes#72046)

https://github.com/kubernetes/enhancements/blob/master/keps/sig-network/20181024-service-topology.md


ingress-nginx also does not have zone-aware routing (https://github.com/kubernetes/ingress-nginx/blob/master/docs/enhancements/20190815-zone-aware-routing.md)

However, Istio enables locality load balancing by default.

region / zone / sub-zone are automatically configured from k8s well-known annotations.
a Service must be associated with the caller for Istio to determine locality.
outlier detection must be configured in a DestinationRule for each service to determine health.

istio/istio#4702 (comment)
https://istio.io/docs/ops/configuration/traffic-management/locality-load-balancing/


Constraints of having an EBS-backed PV in a multi-zone cluster:

kubernetes/kops#6267 (comment)

Docs:

https://kubernetes.io/docs/setup/best-practices/multiple-zones/
https://istio.io/docs/ops/traffic-management/locality-load-balancing/
https://istio.io/docs/reference/config/istio.mesh.v1alpha1/#LocalityLoadBalancerSetting

Kubernnetes Security


https://github.com/trailofbits/audit-kubernetes
Security audit. Look at the threat model and issues raised (https://github.com/trailofbits/audit-kubernetes/issues?q=is%3Aissue+is%3Aclosed).
https://kubernetes.io/docs/concepts/security/overview/

QoS and oversubscription


https://twitter.com/bgrant0607/status/1153342318277083137
hashicorp/nomad#606 (comment)
https://threadreaderapp.com/user/bgrant0607 (in general)

CPU Limit (and throttling)


Good video explaining the problem: https://www.youtube.com/watch?v=UE7QX98-kO0
k8s issue: kubernetes/kubernetes#67577

kubernetes/kubernetes#67577
kubernetes/kubernetes#51135


EKS issue: aws/containers-roadmap#175

Prometheus

This is a v. good intro to the 4 types of metrics:

Counter: https://www.robustperception.io/how-does-a-prometheus-counter-work
Gauge: https://www.robustperception.io/how-does-a-prometheus-gauge-work
Summary: https://www.robustperception.io/how-does-a-prometheus-summary-work
Histogram: https://www.robustperception.io/how-does-a-prometheus-histogram-work

How other metric collection systems integrate with prometheus metrics?

https://docs.datadoghq.com/integrations/prometheus/#metrics
<histogram>.count with upper_bound tag.

There is also this free course: https://training.robustperception.io/p/introduction-to-prometheus

alertmanager
thanos
cortex

InfluxDB


https://github.com/influxdata/influxdb
https://github.com/influxdata/chronograf
https://github.com/influxdata/kapacitor

Ingress-nginx

ingress-nginx 0.26.0+ takes up to 300s (5 minutes) to terminate while waiting for termination of incoming connections. See release notes for: https://github.com/kubernetes/ingress-nginx/releases/tag/nginx-0.26.0

How long does it take for a pod scheduled for deletion to be removed from the list of backends across all ingress controller instances? Is it configurable?

How long does it take to propagate a removal of pod to its Endpoints?
How long does it take to propagate an Endpoint change to the "Lua handler" (https://kubernetes.github.io/ingress-nginx/how-it-works/#avoiding-reloads-on-endpoints-changes) ?
Can these things be measured? Is there any metric for this?


The "Cloud Native security" landscape


TUF (https://github.com/theupdateframework/specification/blob/master/tuf-spec.md)
in-toto (https://in-toto.io/)

https://www.youtube.com/watch?v=05zN-YQxEAM
https://www.youtube.com/watch?v=hbHa4OFv7Qo
https://www.infoq.com/presentations/supply-grafeas-kritis/


Grafeas (https://grafeas.io/)
Kritis (https://github.com/grafeas/kritis)

https://engineering.shopify.com/blogs/engineering/how-shopify-governs-containers-at-scale-with-grafeas-and-kritis


Podman (https://podman.io/)

https://www.youtube.com/watch?v=Qcys7fKSzB0&t=84


https://github.com/GoogleContainerTools/kaniko
Microsoft Kubernetes landscape

https://azure.microsoft.com/en-us/topic/what-is-kubernetes/

CNAB (https://github.com/deislabs/cnab-spec)
Duffle (https://github.com/cnabio/duffle)
Porter (https://porter.sh/)
Helm (https://v3.helm.sh/)
Keda (https://github.com/kedacore/keda) - and https://cloudevents.io/
OPA (https://www.openpolicyagent.org/)
Brigade (https://brigade.sh/) - and https://github.com/brigadecore/kashti
Draft (https://draft.sh/)

https://daemonza.github.io/2017/02/20/using-helm-to-deploy-to-kubernetes/
https://medium.com/@gajus/the-missing-ci-cd-kubernetes-component-helm-package-manager-1fe002aac680
https://cloudblogs.microsoft.com/opensource/2019/05/06/announcing-keda-kubernetes-event-driven-autoscaling-containers/
The CNCF landscape

https://github.com/helm/helm/releases/tag/v3.0.0-rc.3
Spinakker and Kayenta

https://medium.com/netflix-techblog/automated-canary-analysis-at-netflix-with-kayenta-3260bc7acc69

Flagger

https://docs.flagger.app/usage/progressive-delivery

http://port.us.org/ vs https://github.com/goharbor/harbor/blob/master/README.md
https://brigade.sh/
https://gravitational.com/teleport/docs/kubernetes_ssh/ and https://gravitational.com/teleport/docs/architecture/teleport_architecture_overview/
https://github.com/aquasecurity/kube-hunter
https://github.com/GoogleContainerTools/skaffold vs https://www.deployhub.com/ vs https://tilt.dev/ vs https://squash.solo.io/ vs https://www.telepresence.io/ vs https://okteto.com/ vs https://draft.sh/
https://github.com/vmware-tanzu/octant
Data visualization


"A Tour Through the Visualization Zoo": https://homes.cs.washington.edu/~jheer//files/zoo/
"Metric graphs 101: Timeseries graphs": https://www.datadoghq.com/blog/timeseries-metric-graphs-101/
https://accelazh.github.io/datamining/Time-Series-Learning-Algorithms-Candidates

(Watch|Read)list


https://srcco.de/posts/how-zalando-manages-140-kubernetes-clusters.html
https://www.youtube.com/watch?v=1xHmCrd8Qn8&t=197s
https://blizzard.cs.uwaterloo.ca/keshav/home/Papers/data/07/paper-reading.pdf
https://accelazh.github.io/technology/Roadmap-to-Technical-Leadership
https://people.eecs.berkeley.edu/~rcs/research/interactive_latency.html
https://colin-scott.github.io/blog/2016/03/04/technologies-for-testing-and-debugging-distributed-systems/
https://accelazh.github.io/transaction/Distributed-Transaction-ACID-Study

Delivery


https://blog.spinnaker.io/managed-delivery-evolving-continuous-delivery-at-netflix-eb74877fb33c
https://github.com/spinnaker/keel
https://docs.google.com/document/d/1cgKBdT5xVFvMwut7Wji_-bC_12GoQtyZ2MQ958LDcOY/edit#heading=h.v59gzsv79kfc
https://techbeacon.com/app-dev-testing/how-airbnb-scaled-its-migration-continuous-delivery-spinnaker
https://blog.spinnaker.io/how-netflix-has-extended-spinnaker-baf1a9d6b6e3
https://blog.spinnaker.io/introducing-rollout-strategies-in-the-kubernetes-v2-provider-8bbffea109a
https://glasnostic.com/blog/how-canary-deployments-work-2-developer-vs-operator-concerns
https://github.com/weaveworks/flagger
https://github.com/spinnaker/kayenta
https://medium.com/@NetflixTechBlog/tips-for-high-availability-be0472f2599c
https://medium.com/@copyconstruct/testing-in-production-the-safe-way-18ca102d0ef1
https://accelazh.github.io/storage/Engineering-Reliability-Practices

web server


http://dyszkiewicz.me/programming/http/server/kotlin/2018/07/31/http-part1.html

Problems


https://segment.com/blog/goodbye-microservices/
https://8thlight.com/blog/colin-jones/2018/09/18/microservices-arent-magic-handling-timeouts.html
https://medium.com/@marcus.cavalcanti/lessons-learned-about-run-microservices-b360347c8a77

Kubernetes monitoring architecture (in depth)


https://github.com/kubernetes/community/blob/master/contributors/design-proposals/instrumentation/monitoring_architecture.md
(what's Infrastore): kubernetes/kubernetes#44095 (comment)
https://github.com/kubernetes/metrics
kubernetes-sigs/metrics-server#7 (comment)
https://web.archive.org/web/20180530051700/https://kubernetes.io/docs/tasks/debug-application-cluster/core-metrics-pipeline/
https://github.com/kubernetes/community/blob/master/contributors/design-proposals/instrumentation/resource-metrics-api.md
https://github.com/kubernetes/community/blob/master/contributors/design-proposals/instrumentation/metrics-server.md

Extending Kubernetes


https://kubernetes.io/docs/concepts/extend-kubernetes/extend-cluster/

Multi-region


https://www.atlassian.com/blog/technology/aws-scaling-multi-region-low-latency-service
https://read.acloud.guru/why-and-how-do-we-build-a-multi-region-active-active-architecture-6d81acb7d208