The following is a series of excerpts from a book by Marianne Bellotti on how to revive and maintain a complex system of software that has been in operational use over a period of time, and has accrued a significant amount of technical and operational debt. Each header before the excerpt is my own or a section title from the book.
Preface "We build our computer systems the way we build our cities: over time, without a plan, on top of ruins." - Ellen Ullman
Restoring legacy systems to operational excellence is ultimately about resuscitating an interative development process so that the systems are being maintained and evolving as time goes on...there is little downside to maintaining all systems as if they are legacy systems. It is easy to build things, but it is difficult to rethink them once they are in place. Legacy modernizations are not hard because they are technically hard - the problems and solutions are usually well understood - it's the people side of the modernization that is hard. Getting the time and resources to actually implement the change, building an appetite for change to happen and keeping that momentum, managing the intra-organizational communication necessary to move a system that any number of other systems connect to or rely upon - those things are hard.
Sometimes it is difficult to compare your use case to the use case of other seemingly similar organizations. The biggest offender on this front is the commercial cloud, precisely because it adds value to such a broad set of use cases...whether Big Data as a Service saves you any money depends on how long it takes it to get that big in the first place. Having petabytes of data collected over a five-year period is a different situation from having petabytes generated over the course of a few hours.
Overall, interfaces and ideas spread through networks of people, not based on merits or success. Exposure to a given configuration creates the perception that it's easier and more intuitive, causing it to be passed down to more generations of technology. The lesson to learn here is the systems that feel familiar to people always provide more value than the systems that have structural elegances but run contrary to expectations.
Large problems are always tackled by breaking them down into smaller problems. Solve enough small problems, and eventually the large problem collapses and can be resolved.
(Emphasis mine) On the other hand, some legacy systems perform their core functions within the parameters the organization needs to be successful, but they are unstable. They are not too slow; they produce the correct result and within the resources the organizations has available for the task, but there are frequent "surprises," such as outages with bizarre black-swan style root causes or routine upgrades that somes go very poorly. Ongoing development work is stopped because unforeseen technical conflicts popup and need to be resolved. In 1983, Charles Perrow coined the term normal accidents to describe systems that were so prone to failure, no amount of safety procedures could eliminate accidents entirely. According to Perrow, normal accidents are not the product of bad technology or incompetent staff. Systems that experience normal accidents display two important characteristics...They are tightly coupled...[and] They are complex
Expectation management is really important. Typically organizations...misjudge how long modernization projects take, and they misjudge how much time they can save and how to save it. Modernization projects have better outcomes...with the following guidelines:
- Keep it simple.
- Spend some time trying to recover context [in the legacy system].
- Tools and automation should supplement human effort, not replace it.
(Emphasis Mine) Legacy modernization projects go better when the individuals contributing to them feel comfortable being autonomous and when they can adapt to challenge and surprises as they present themselves because they understand what the priorities are. The more decisions need to go up to a senior group - be that VPs, enterprise architects, or a CEO - the more delays and bottlenecks appear. The more momentum is lost, and people stop believing success is possible. When people stop believing success is possible, they stop bringing their best work. Measureable problems empower team members to make decisions. Everyone has agreed that metric X needs to be better; any actions taken to improve metric X need not be run up the chain of command.
Bellotti discusses how small teams inevitably build monoliths and why monoliths work. She also discusses considerations to keep in mind when a team is debating moving away from a monolith.
- Design is problem setting. Incorporating it into your process will help your teams become more resilient.
- By themselves, technical conversations tend to incentivize people to maintain status by criticizing ideas. Design can help mitigate those effects by giving conversations the structure of a game and a path to winning.
- Legacy modernizations are ultimately transitions and require leaders with high tolerance for ambiguity.
- Conway's law doesn't mean you should design your organization to look like the technology you want. It means you should pay attention to how the organization structure incentivizes people to behave. These forces will determine what the technology looks like.
- Don't design the organization; let the organization design itself by choosing a structure that facilitates the communication teams will need to get the job done.
A discussion on groups formed that enable work to get done vs groups formed that prevent work from getting done
An emphasis on building simple first and avoiding building for scale before you actually need that scale. Bellotti also has some rules of thumbs on resources required to implement, maintain, and monitor a service.