-name-
Regarding tonight's site downtime, I have:
- Removed all of the problem commits to a separate branch, and notified anyone who may have pulled during that hour and a half of what to do to make sure that their copy of the repo is in good shape despite the history rewrite.
- Hard reset master to the last good state, and manually re-run the deployment script against that code.
- Restored the DB for the live site from the most recent back-up; luckily that was only about three hours before the site went down.
- Made a suitably vague apology to the support list in response to questions about the site suddenly going down and losing a couple of hours worth of user data. I'm really not sure how to be more transparent with our users without both outing the person who ignored best practices in the first place and exposing our extremely fragile internal workflow.
That last point deserves further discussion. This was not an isolated incident, it was yet another symptom of a problem endemic to this project's existing access policies and workflows. This was not the first such symptom, merely the most publicly visible one, and it will not be the last unless we correct the root problem.
I realize that we've gone rounds on the subject of workflow management and access control before, but I'm writing with the hope that your eyes have been opened by this incident. You'll recall that I predicted this very thing in my email to you on 2/08:
"The end result of the 'trust everybody with everything' approach is, in our specific case, a bunch of well-meaning but inexperienced people letting their excitement run ahead of their skillsets, deploying untested code, making undocumented configuration changes, and so on at alarming rates. Even in a best-case scenario where every single volunteer we have is both extremely competent and extremely well-meaning, at over two dozen contributors there's the inevitable problem of too many cooks in the kitchen. When projects scale, they need to divide and conquer in order to maintain the quality of thei[sic] work. Those that fail to do so topple over."
I don't feel that the problem is -othername- personally. He did what junior coders do -- act impetuously and without enough regard for the things that can and do go wrong from even minor changes to a live system -- that is why we have experienced veterans on this project. Every single one of us was a junior coder at some point, and every single one of us was lucky enough to have some grouchy curmudgeon grab our hand at the last minute before we broke something people depended on in our excitement to get a new feature out. -othername- is a good coder, he just isn't a senior coder yet.
Every large, successful open source project has a heirarchy. It's not undemocratic, it's practical. It is the nature of open source that if our leadership sucks anybody can fork and do better. That protects our contributors and our users. We can also protect our contributors by not making them work with fuzzy objectives and no idea who's been making changes to whatever cog in the works they are trying to manage, and by providing those new to this project in specific or to coding in general with a safety net to prevent these kinds of disasters.
I've been working on and organizing open source projects for a long time. I have had the benefit of learning from people far more experienced and competent than myself. I know exactly how much work your position entails, and I have no interest in unseating you. I have enough other commitments to keep me busy. What I do want to do, under your blessing as project lead, is to institute the technical and social structure that will make this project viable as it grows. That means:
- Limiting every contributor's access to the things he or she needs to do the job he or she does.
- Establishing clear "ownership" of components, so we always know who the responsible expert is, and that person knows everything going on with the components he or she is responsible for.
- Actually enforcing our bug tracker workflow and deployment best practices, especially those with regard to testing.
- Creating an atmosphere of mentorship, so that those of us with expertise can more easily help out those with the potential to learn, and those still learning (which includes the most experienced of us!) feel comfortable approaching others with questions and problems before breakage and data loss happen, or at least before the problem escalates.
I realize that you feel personally uncomfortable exercising BDFL authority over our hard-working volunteers, but 31 people (yes, I checked, there are that many of us now...with admin access on everything) without anyone actually being in charge is not a project, it's a mob. There will be some social chafing at first as we transition into a new structure, but in the end the majority of contributors will find they are happier and accomplishing more when they aren't subject to the current chaos. You are the project lead, this has fallen at your feet whether you like it or not. We will lose our best people if the status quo persists, but you have all the tools you need -- including my help if you'll accept it -- to change it before that happens.
Let me know what you decide.
Sincerely,
Susan Stewart