Skip to content

Instantly share code, notes, and snippets.

@liggitt
Created June 25, 2021 15:56
Show Gist options
  • Save liggitt/cf2fc68c87fef87e38a68f70144fb725 to your computer and use it in GitHub Desktop.
Save liggitt/cf2fc68c87fef87e38a68f70144fb725 to your computer and use it in GitHub Desktop.

API Changes

What APIs?

  • REST APIs

    • built-in go-based APIs
    • custom resources
      • x-k8s.io - experimental, fast prototyping
      • k8s.io - "official", get API reviewed
    • most difficult to change over time
      • all (non-alpha) versions have to round-trip to each other losslessly
      • all additions (to non-alpha version) have to preserve existing semantics for previous definitions
    • most visible to users
  • Command line flags

    • distinction between admin-facing (kube-apiserver) and user-facing (kubectl)
  • Config files

    • can provide defaults
    • can be versioned
    • give a way to improve defaults over time at a config file version boundary
    • indicate stability level of a feature/config format
  • "Backend" APIs

    • grpc
      • container runtime interface (CRI)
      • container networking interface (CNI)
      • container storage interface (CSI)
      • kube-apiserver storage transformers
      • kube-apiserver network proxy
    • exec+json
      • client-go exec credential plugins
      • kubelet exec credential plugins

As leads:

  • Know what APIs are in your area
  • Ensure people working on APIs in your area are familiar with API conventions

(Good) APIs are stable

  • If we do our job, people build things to integrate with the APIs we make
  • Clients call REST APIs
  • Integrations build support for backend APIs
  • Deployers script and configure command lines and config files

Be super clear about stability levels

  • Alpha

    • not enabled by default
    • can increment and drop previous alpha versions without migration
    • lessons learned
      • always have a clear picture of how you will transition to beta if the alpha goes well
        • alpha annotations were a disaster to transition to API fields; ended up supporting both in parallel, poorly
      • be explicit that something is alpha... make someone work to enable it
        • things that work that we enabled by default, didn't make clear were alpha, and left unchanged for years are treated as GA (--node-labels)
      • alpha is the time for fast iteration without the compatibility tax
  • Beta

    • typically enabled by default
    • must be forward convertible to next beta version or to GA version
    • lessons learned
      • keep focus on improving and moving towards GA
        • CRDs took 2.5 years (10 releases) to go from v1beta1 to v1, accumulated enormous use on flawed beta versions in the meantime, and another 2 years (6 releases) to deprecate and stop serving the v1beta1 version
        • perma-beta effectively gets treated as GA (people run businesses on these things)
      • limited lifetime (3 releases until deprecation, 3 releases until removal)
      • be confident you've resolved usability/scale/expressiveness issues before going from alpha to beta
        • backwards compatibility and round-tripping with flawed beta versions is hard
  • GA

As leads:

  • Know the stability level of the APIs in your area
  • In general, prioritize stabilizing/graduating those APIs (or deprecating/dropping non-GA APIs) over introducing new features (parallel work is fine, but lots of new alpha APIs without progressing existing ones accumulates poorly supported features)

Good APIs are as small as possible

The bigger the surface area:

  • the harder it is to test thoroughly
  • the harder it is for users to learn/use
  • the more unanticipated combinations/interactions there can be
  • the harder it is to support and evolve while staying compatible

Lessons learned

  • PodSecurityPolicy
    • tried to provide super expressive, fine-grained policy and defaulting control over a big chunk of the Pod API (which itself is very big)
    • ran into trouble staying backwards compatible while adding support for new Pod capabilities while remaining usable
      • some fields defaulted permissive for compatibility (controlling new Pod fields that allowed lowering permissions)
      • some fields defaulted restrictive for compatibility (controlling new Pod fields that allowed raising permissions)
    • replacement (in progress) has a much smaller surface area (level=privileged|baseline|restricted, optional version)

As leads:

  • push back on introducing complexity (sometimes unavoidable, but always worth questioning)
  • push towards layers instead of options (a simple boolean option can ~double the test matrix for a component)

Good APIs take time

Especially true for REST APIs, but most of these are true to some degree for most types of versioned APIs (REST, config, backend)

  • Time to design
  • Time to change
  • Time to implement
  • Time to review
    • Target completing API implementations in the first few weeks of a development cycle
    • Actively coordinate with API reviewer to set up time for review
    • Allow O(week) for initial API review
    • Not uncommon to have several review cycles
  • Time to {unit,integration,e2e,scale} test
  • Time to get feedback from users on alpha versions
    • at least a release
  • Time to promote to beta
    • at least a release
  • Time to document
  • Time to promote to GA
    • at least a release
  • Time to conformance test (REST API specific)
    • general expectation is that new built-in REST APIs will be included in conformance
    • if they are not generally safe or feasible for all clusters to enable, or are not broadly applicable enough to be in conformance, it might be a sign that they should not be built in

As leads, when planning a feature that involves an API:

  • coordinate timing and bandwidth on the implementation and reviewer side
    • ideally in the KEP phase ahead of a development cycle
  • ensure there's a plausible roadmap from alpha to GA
    • baked into the KEP and PRR processes, but actually think about the questions as an author or reviewer
    • understand which steps require release boundaries
    • for those steps, prioritize them early in a release cycle. the slowed cadence of 3 releases a year means missing a planned release is more significant
    • have a plan for who will be doing the work across releases (new API is a ~year process)

Tests and test infrastructure

Kill flakes: https://gist.github.com/liggitt/6a3a2217fa5f846b52519acfc0ffece0

Conformance

Overview/requirements

GA bar is "does it work well, is it supportable, scaleable, bug free, well tested, etc"

Conformance bar is higher: "GA + users expect this to be enabled in 100% of clusters + cluster providers can reasonably enable this"

No alpha or beta features can be in conformance

Tips:

As leads:

  • When planning features, understand if they will be included in conformance testing
    • baked into the KEP process, but actually think about conformance implications from a cluster provider and user perspective
  • Ensure test plans get conformance-eligible tests in place early
  • Structure tests so it is easy to switch from beta to GA endpoint (e.g. import myapi "k8s.io/api/myapi/v1beta1) without rewriting entire test
  • Pay attention to test flakiness (always good, but required for conformance tests)
  • Pay attention to test coverage during beta
    • Aim for 100% non-flaky coverage during beta
    • Makes switching test to v1 trivial
  • https://apisnoop.cncf.io/

Code Organization

"internal"

  • https://github.com/kubernetes/kubernetes/
  • holds core Kubernetes binaries (kube-apiserver, kube-controller-manager, kubelet, kube-scheduler, kube-proxy, etc)
  • not intended for use as a library by applications outside kubernetes/kubernetes

"staging"

As leads:

  • pay attention to where code is going
  • things in "staging" should be expected to be consumed outside kubernetes/kubernetes

"vendor"

As leads:

  • be aware of key dependencies your area has
    • node: depends on cadvisor, runc
    • api-machinery: depends on json/yaml libraries
    • etc
  • work on processes for picking up security/bugfix issues in those dependencies
  • be aware of problematic characteristics, plan to isolate and drop those
    • cloud provider: extract to standalone binaries --> drop cloud provider dependencies
    • storage: volume plugin extraction to CSI --> drop volume plugin dependencies
    • node: dockershim deprecation to CRI --> drop docker dependencies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment