Skip to content

Instantly share code, notes, and snippets.

@jberkus
Last active December 5, 2017 22:56
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
Star You must be signed in to star a gist
Save jberkus/0fd7acb33b53328b72364174068d7749 to your computer and use it in GitHub Desktop.
Kube Dev Summit Notes

Steering Committee Update

Showed list of Steering Committee backlog.

We had a meeting, and the two big items we'd been pushing on hard were:

  • votining in the proposals in the bootstrap committee
  • how we're going to handle incubator and contrib etc.

Incubator/contrib: one of our big concerns are what the the consequences for projects and ecosystems. We're still discussing it, please be patient. In the process of solving the incubator process, we have to answer what is kubernetes, which is probably SIGs, but what's a SIG, and who decides, and ... we end up having to examine everything. In terms of deciding what is and isn't kubernetes, we want to have that discussion in the open. We also recognize that the project has big technical debt and is scary for new contributors.

We're also trying to figure out how to bring in new people in order to have them add, instead of contributing to chaos. So we need contributors to do mentorship. We also need some tools for project management, if anyone wants to work on these.

We're also going to be working on having a code of conduct for the project, if you want to be part of that discussion you can join the meetings. For private concerns: steering-private@kubernetes.io. One of the challenges is deciding how enforcement will work. We've had a few incidents over the last 2 years, which were handled very quietly. But we're having more people show up on Slack who don't know norms, so feel free to educate people on what the community standards are. As our project expands across various media, we need to track the behavior of individuals.

Q: "Can SIGs propose a decision for some of the stuff on the backlog list and bring it to the SC?" A: "Yes, please!"

The other big issue is who owns a SIG and how SIG leaders are chosen. The project needs to delegate more things to the SIGs but first we need to have transparency around the SIG governance process.

Q: "As for What Kubernetes Is: is that a living document we can reference?" A: "There's an old one. I have a talk on Friday about updating it."

There's also the question of whether we're involved in "ecosystem projects" like those hosted by the CNCF. Things will change but the important thing is to have a good governance structure so that we can make transparent decisions as things change.

Roadmap for 2018 (30min summary)

Speakers (check spelling): Apprena Singha, Igor, Jaice DuMars, Caleb Miles, someone I didn't get. SIG-PM

We have the roadmap, and we have this thing called the features process, which some of you may (not) love. And then we write a blog post, because the release notes are long and most of the world doesn't understand them.

Went over SIG-PM mission. We had several changes in how the community behave over 2017. We are moving to a model where SIGs decide what they're going to do instead of overall product decisions.

2017 Major Features listed (workloads API, scalability, networkpolicy, CRI, GPU support, etc.). See slides. The question is, how did we do following the 2017 roadmap?

Last year, we got together and each SIG put together a roadmap. In your SIG, you can put together an evaluation of how close we came to what was planned.

Q: Last year we kept hearing about stability releases. But I'm not sure that either 1.8 or 1.9 was a "stability release". Will 2018 be the "year of the stability release?"

Q: Somehow the idea of stability needs to be captured as a feature or roadmap item.

Q: More clearly defining what is in/out of Kubernetes will help stability.

Q: What do we mean by stability? Crashing, API churn, too many new features to track, community chaos?

Q: Maybe the idea for 2018 is to just measure stability. Maybe we should gamify it a bit.

Q: The idea is to make existing interfaces and features easy to use for our users and stable. In SIG-Apps we decided to limit new features to focus everything on the workloads API.

Proposals are now KEPs (Kubernetes Enhancement Proposals) are a way to catalog major initiatives. KEPs are big picture items that get implemented in stages. This idea is based partly on how the Rust project organizes changes. Every SIG needs to set their own roadmap, so the KEP is just a template so that SIGs can plan ahead to the completion of the feature and SIG-PM and coordinate with other SIGs.

Q: How do you submit a KEP? A: It should live in source control. Each KEP will releate to dozens or hundreds of issues, we need to preserve that as history.

If you look at the community repo, there's a draft KEP template in process. We need to make it a discoverable doc.

Feature Workflow

TSC: Getting something done in Kubernetes is byzantine. You need to know someone, who to ask, where to go. If you aren't already involved in the Kubernetes community, it's really hard to get involved. Vendors don't know where to go.

Jeremy: we had to watch the bug tracker to figure out what sig owned the thing we wanted to change.

TSC: so you create a proposal. But then what? Who needs to buy-in for the feature to get approved?

Dhawal: maybe if it's in the right form, SIGs should be required to look at it.

Robert B: are we talking about users or developers? Are we talking about people who will build features or people who want to request features?

???: Routing people to the correct SIG is the first hurdle. You have to get the attention of a SIG to do anything. Anybody can speak in a SIG meeting, but ideas do get shot down.

Caleb: we've had some success in the release process onboarding people to the right SIG. Maybe this is a model. The roles on the release team are documented.

Anthony: as a release team, we get scope from the SIGs. The SIGs could come up with ideas for feature requests/improvement.

Tim: there's a priority sort, different projects have different priorities for developers. You need a buddy in the sig.

Clayton: review bandwidth is a problem. Review buddies hasn't really worked. If you have buy-in but no bandwidth, do you really have buy-in?

TSC: The KEP has owners, you could have a reviewer field and designate a reviewer. But there's still a bandwidth problem.

Dhawal: many SIG meetings aren't really traceable because they're video meetings. Stuff in Issues/PRs are much more referencable for new contributors. If the feature is not searchable, then it's not available for anyone to check. If it is going to a SIG, then you need to update the issue, and summarize the discussions in the SIG.

TSC: Just because a feature is assigned to a SIG doesn't mean they'll acutally look at it. SIGs have their own priorities. There's so many issues in the backlog, nobody can deal with it. My search for sig/scheduling is 10 different searches to find all of the sig/scheduling issues. SIG labels aren't always applied. And then you have to prioritize the list.

???: Test plans also seem to be late in the game. This could be part of the KEP process. And user-facing-documentation.

Robert B: but then there's a thousand comments. the KEP proposal is better.

???: The KEP process could be way to heavy-weight for new contributors.

???: new contributors should not be starting on major features. The mentoring process should take them through minor contributions. We have approximately 200 full-time contributors. We need to make those people more effective.

TSC: even if you're a full timer, it's hard to get things in and get a reviewer. Every release, just about everything that it's p0 or p1 gets cut, because the person working on it can't get the reviewer all of the stuff lined up.

Caleb: you need to spend some time in the project before you can make things work.

Dhawal: is there a way to measure contributor hours? Are people not getting to things because people are overcommitting?

Jago: The problem is that the same people who are on the hook for the complicated features are the people who you need to review your complicated feature. Googlers who work on this are trying to spread out their own projects to that they have more time at the end of the review cycle.

Jaice: If you're talking about a feature, and you can't get anyone to talk about it, either the right people aren't in the room, or there just aren't enough people to make it happen. If we do "just enough" planning to decide what we can do and not do, then we'll waste a lot less effort. We need to know what a SIG's "velocity" is.

Connor: the act of acquiring a shepard is itself subject to nepotism. You have to know the right people. We need a "hopper" for shepherding.

Tim: not every contributor is equal, some contributors require a lot more effort than others.

Robert: A "hopper" would circumvent the priority process.

Josh: there will always be more submitters than reviewers. We've had this issue in Postgres forever. The important thing is to have a written, transparent process so that when things get rejected it's clear why. Even if it's "sorry, the SIG is super-busy and we just can't pay attention right now."

Dhawal: there needs to be a ladder. The contributor ladder.

TSC: a lot of folks who work on Kube are a "volunteer army." A lot of folks aren't full-time.

Caleb: there is a ladder. People need to work hard on replacing themselves, so that they're not stuck doing the same thing all the time. How do you scale trust?

???: Kubernetes is a complicated system, and not enough is written down, and a lot of what's there we'd like to change. It's a lot easier for a googler to help another googler, because they're in the same office, and the priorities alighn. That's much harder to do across organizations, because maybe my company doesn't care about the VMWare provider.

Jaice: for the ladder, is there any notion that in order to assend the ladder you have to have a certain number of people you shepherded in? There should be.

TSC: frankly, mentoring people is more important than writing code. We need to bring more people into Kubernetes in order to scale the community.

Josh: we need the process to be documented, for minor features and major ones. Maybe the minor feature process belongs to each SIG.

Jaice: the KEP is not feature documentation, it's process documentation for any major change. It breaks down into multiple features and issues.

???: The KEP needs to include who the shepherds should be.

Clayton: reviewer time is the critical resource. The prioritization process needs to allocate that earlier to waste less.

Jeremy: the people we sell to are having problems we can't satisfy in Kubernetes. We have a document for a new feature, but we need every SIG to look at it (multi-network). This definitely needs a KEP, but is a KEP enough? We've probably done too much talking.

Clayton: the conceptual load on this is so high that people are afraid of it. This may be beyond what we can do in the feature horizon. It's almost like breaking up the monolith.

Robert: even small changes you need buy-in across SIGs. Big changes are worse.

Connor: working groups are one way to tackle some of these big features.

What's Up with SIG-Cluster-Lifecycle

Luxas talking: I can go over what we did last year, but I'd like to see your ideas about what we should be doing for the future, especially around hosting etc.

How can we make Kubeadm beta for next year? Opinions:

  • HA
    • etcd-multihost
    • some solution for apiserver, controller
  • Self-hosted Kubeadm

Q: Can someone write a statement on purpose & scope of Kubeadm?

To install a minimum viable, best-practice cluster for kubernetes. You have to install your own CNI provider. Kubeadm isn't to endorse any providers at any level of the stack.

Joe: sub-goal (not required for GA), would be to break out components so that you can just install specific things. Would also like documentation for what kubeadm does under the covers.

Josh: requested documentation on "how do I modify my kubeadm install". Feels this is needed for GA. Another attendee felt the same thing.

One of the other goals is to be a building block for higher-level installers. Talking to Kubespray people, etc. Enabling webhooks used as example.

There was some additional discussion of various things people might want. One user wanted UI integration with Dashboard. The team says they want to keep the scope really narrow in order to be successful. UI would be a different project. GKE team may be working on some combination of Kubeadm + Cluster API. Amazon is not using Kubeadm, but Docker is. Docker for Mac and Windows will ship with embedded kubernetes in beta later this week.

Kubeadm dilemma: we want humans to be able to run kubeadm and for it to be a good experience, and for automation to be able to run it. I don't think that can be the same tool. They've been targeting Kubeadm at people, we might want to make a slightly different UI for machines. Josh says that automating it works pretty well, it's just that error output is annoying to capture.

HA definition:

  • etcd should be in a quorum HA standard (3+ nodes)
  • more than one master
  • all core components: apiserver, scheduler, kcm, need to be on each master
  • have to be able to add a master
  • upgrades

Or: we want to be able to survive a loss of one host/node, including a master node. This is different, if we want to survive the loss of any one master, we only need two then. Argument ensued. Also, what about recovery or replacement case? A new master needs to be able to join (manual command).

What about HA upgrades? Are we going to support going from one master to three? Yes, we have to support that.

Revised 4 requirements:

  • 3+ etcd replicas
  • all master components running in each master
  • all TLS secured
  • Upgrades for HA clusters

Everyone says that we want a production environment, but it's hard to define what "production grade" means. We need to stop saying that. Over time, what matters is, "is it maintained". If it's still being worked on, over time it'll get better and better.

CoreOS guy: trying to do self-hosted etcd. There's a lot of unexpected fragile moments. Just HA etcd isn't well tested upstream, there's not enough E2E tests. Self-hosting makes this worse. The etcd operator needs work. There needs to be a lot of work by various teams. Self-hosted control planes work really well, they host all of their customers that way. It's etcd that's special.

There's some problems with how Kubernetes uses HA Etcd in general. Even if the etcd operator was perfect, and it worked, we couldn't necessarily convince people to use it.

Should Kubeadm focus on stable installs, or should it focus on the most cutting-edge features? To date, it's been focused on the edge, but going to GA means slowing down. Does this mean that someone else will need to be forward-looking? Or do we do feature flags?

SIG-Cluster-Lifecycle should also document recommendations on things like "how much memory do I need." But these requirements change all the time. We need more data, and testing by sig-scalability.

For self-hosting, the single-master situation is different from multi-master. We can't require HA. Do we need to support non-self-hosted? We can't test all the paths, there's a cost of maintaining it. One case for non-self-hosted is security, in order to prevent subversion of nodes.

Also, we need to support CA-signed Kubelet certs, but that's mostly done.

So, is HA a necessity for GA? There are a bunch of automation things that already work really well. Maybe we should leave that to external controllers (like Kops etc) to use Kubeadm as a primitive. Now we're providing documentation for how you can provide HA by setting things up. But how would kubeadm upgrade work, then?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment