ben-chin/cw.md

## cw.md

      
    Raw
  

              cw.md
            
          
    Shipping fast, and why you should

The short answer

The main purpose of releasing new features is driven by the business need to meet customer demand and offer a better product than your competitors. When releasing a new product you’ll be primarily concerned with:

the time taken from finishing development of a feature to getting feedback from customers
ensuring that your code doesn’t break when exposed to real, production traffic (whether or not it worked in testing and QA)

If a product takes too long to launch then you run the risk of releasing something that no longer satisfies customer demand or being beaten to market by a competitor and appearing to play catchup.
Conversely, if you release broken software and don’t handle the situation appropriately then you risk damaging your brand and losing market share.
Both of these motivations also clearly affect your bottom line and the current state of your business will determine the weighting you give to each. A modern startup for finding your perfect pet sitter [x] is likely more tolerant of releasing broken software than, say, a hosted database from a cloud provider that risks losing all of your data [x].
We've talked to engineers at two market-leading startups for their perspective on what has allowed them to quickly deliver value to their users.
Stripe are an international payments processor known for their well designed and documented API which has made them very popular with web developers. Engineering teams are free to choose technologies that are best suited to their workload and over time a common set of provisioning and deployment tools have been developed.
Quizlet are an educational technology company who build innovative study tools and games for any type of content for both students and teachers. They are currently the 50th most heavily trafficked website in the US and handle 200,000 transactions per minute on an average day.
Stripe and Quizlet are both companies that capitalise on their ability to alter their offering based on customer demands. This enables them to be responsive and extremely competitive. But what practices help them do this, and what can be learned from it?
A brief aside

Not all software is released equally
The development of software varies as much as its usage. Not all developers have the luxury of changing software once it’s released. Sometimes, you simply have to invest a huge amount of work upfront and make sure the software is as correct as you can prove it to be.
Boxed vs Service
We also spoke to Cosmin Nicolaescu, an ex-Microsoft employee of five years. He highlighted the difference between “boxed products”, those which have a one-to-two year release cycle, and “services”, which can change as quickly as they can be deployed. Boxed products such as Windows and the Office suite of productivity programs have different priorities from services.

The biggest risk is regressions. Boxed products go through a lot more testing since pushing fixes is much harder.

Many use cases for boxed products occur in a context where the software cannot easily be updated, e.g. internet is blocked from updates by a corporate firewall for security reasons.
In this context, Microsoft do not seem to strictly adopt the agile methodologies often recommended as the risk in a purely agile approach is too high.

For the longer cycles, there are several internal milestones to make sure progress gets made towards a release.

Microsoft have a very strong financial incentive to release the highest quality product possible. Failure to do so will result in poor sales and a seemingly unending stream of criticisms [x]. For Microsoft it’s been a very successful strategy to follow the waterfall methodology while building the product and to supplement it with an agile maintenance program to release features to customers able to download updates.

As this article is focused on shipping fast, we'll be analysing the service offerings of Quizlet and Stripe, and their approaches to delivering products to end users.

So how do they do it?

Continuous Delivery

Our journey starts at the beginning of the end. Both companies aim to make the process of releasing software “as boring as possible” and automate all commonly repeated steps. This helps to prevent avoidable mistakes which could lead to damaging brand reputation.
Stripe and Quizlet make use of a Continuous Delivery pipeline that begins once code has been pushed their centralised Git repositories. The pipelines consist of steps such as testing and artifact generation (for example, CSS generated from SASS sources).
Neither company uses automated deploys as the final step of their pipeline. The reasoning is that pushing code to customers should be a conscious decision. Sometimes mistakes are made and code that passes all tests can be sent into the pipeline.
Tavish Armstrong, an engineer on the Financial Operations team at Stripe, talked to us about their internal testing processes. Stripe release code quickly, “multiple times per hour” and because of this pace of release, they can’t depend on manual checking so they rely on “automated tests”, that will be run without need for human intervention.

We don't release code that is hard to test in an automated fashion. Code is often refactored before releasing so that it is possible to add tests to more critical sections of the code.

Larger features are often “cut into many smaller releases” to aid with testability and introduce a smaller surface area of potential mistakes
Stripe have also invested significant engineering effort into making the testing component run as quickly as possible [x] because it dominates pipeline run duration. Reducing the delay between committing code and releasing it allows for smaller changes and features to be delivered to customers more quickly.

Move fast and don't break things

Deploying
Once the tests pass you should have a high confidence that your code is ready to be released to the world. Some companies make the choice that once their code is ready that it’ll be pushed to all production servers at the same time. This isn't the case at Stripe.

The new code will be deployed to a single physical machine and the error rate for that machine will be monitored to detect problems in the new code while limiting the possible impact of a problem to a smaller set of customers.

Stripe have adopted the use of such Canary Deploys [x], where code is rolled out to an increasingly large proportion of servers in order to gain confidence that the new code performs correctly.
Monitoring

If you write perfect code, then there is no danger.
Just kidding.

No sane engineer has 100% faith that a codebase's test suites will catch every bug. So with the inevitability of things going wrong, how do you build recovery into your software development cycle?
Both Quizlet and Stripe have eyes on the health of their systems by making heavy use of monitoring tools.

Once code is deployed, we have a variety of checks across the entire stack to make sure that the code and the underlying infrastructure is healthy. These range from application-level monitoring like New Relic to system health monitoring like Nagios. The monitoring is hooked up to PagerDuty which brings the issues to attention of the engineers.

It is common to Quizlet and Stripe's engineering culture that developers own their deployments. There is no handoff to a product manager to deploy, as may be the case in other companies. An engineer is responsible for taking the code they have developed and ensuring it is released into the wild safely. If the monitoring tools catch a post-deploy anomaly, or spike in errors, the first line of defence is the engineer that pushed the deploy, who is also the engineer that wrote the diff, who is also the engineer that should be best equipped to investigate the failure. This clear assignment of responsibility is crucial to being able to release as quickly as these companies do.
At Stripe, it's also a common strategy to phase in deployment of important features.

The new code will be deployed to a single physical machine and the error rate for that machine will be monitored to detect problems in the new code while limiting the possible impact of a problem to a smaller set of customers

Once engineers have determined if the single instance deploy looks good, they can roll the diff out to all instances, or if further caution is required, just some instances and check again.
With such a large user base, and peak usage occurring during school hours, Quizlet's user feedback centre is a high traffic system, both in and out of office hours. This can be used as another measure of monitoring.

On the product side, we also have heuristics that make sure that we didn't do anything stupid. For example, we monitor the rate of user feedback we're getting. If the rate goes up, it could mean that we introduced a change that significantly impacted the user experience. If the rate goes down, it could mean that the change prevented users from writing in about issues, which is just as bad.

Rollbacks and Feature Flags
No one enjoys having broken code in production. Rollbacks allow engineers to revert a broken change swiftly, so that users are once again in a safe version of the application. With version control, this is quite simple. At Stripe and Quizlet, rolling back is handled by a deploy script that can take the Git commit hash or build queue number, and deploy this version of the application. It also takes care of post-processing tasks, such as safely reverting database schema migrations.
An approach being considered by Quizlet is tying automated rollbacks to the monitoring system.

We'd like to have automatic rollbacks if a deploy causes the error rate to elevate significantly. Currently, we rely on engineers to monitor this. While not automatic, it's a forcing function to make sure that we're mindful of what's being deployed.

This seems to err on the side of caution, but even so, as Jeff says, it's a useful practice for engineers to be aware of how their change is affecting the system. For the same reason that Quizlet and Stripe choose to have a final human step in the deploy process, an engineer that has more contextual understanding of the system health should decide whether a rollback is necessary.
Even faster disaster recovery can be implemented with feature flags [x]. These are interfaces to a centralised data store and are often used in conditionals to change the behaviour of code at runtime. The feature flag server will provide a dashboard of flags, their values, and the ability to set their value based on different parameters. For example, return true for 10% of users, or return true for just users with the following IDs. Their behaviour is very flexible and can be used for a variety of common workloads such as A/B testing or the rollout of a feature (by increasing the percentage of users who access a new code path). If the code must be quickly rolled back, the percentage of users who return true for a feature flag can be set to 0% instantly avoiding the broken code path.
Validation

Of course, one of the main motivators for delivering changes continuously is to validate the deployed feature quickly, which helps the company move faster and more nimbly.
At both companies, this is done both qualitatively and quantitively.
Feedback and Responsiveness

Keeping close to users on a human level is core to the success of the Quizlet product, and shipping constantly allows the team to be extremely responsive to real feedback.

The high volume user feedback centre usually contains plenty of commentary on new and old features. The dedicated user support team tags and responds to these as they come, and work closely with product managers and engineers alike to ensure the people involved in a feature or product understand the user sentiment towards it. This has earned the respect of many of Quizlet's user base, in particular teachers, who Quizlet maintain a strong relationship with.
It's a similar story at Stripe. When asked the reason for Stripe's process of frequent releases, Tavish responds, simply:

Because we can more easily react to customer feedback and therefore more quickly launch new products.

Data-driven product decisions

If I had asked people what they wanted, they would have said faster horses.
-- (supposedly, but very unlikely) Henry Ford

Whether Ford said it or not, the idea is an interesting one. In other words, do users know what's best for them? A sensible approach seems to be a balanced one. Take into account what users say qualitatively and balance it with data on their behaviour.
Stripe and Quizlet utilise analytics software, home-rolled and third party integrations, to collect key quantitive insights on user behaviour. How does Quizlet use this?
In practice, this validation is done with Quizlet's in-house A/B testing software. The product team and engineers will identify key performance indicators (KPIs) associated with a change, and show the change to a select group of users: the experimental group. The other group of users are the control group, and measuring the KPIs allows the team to statistically compare how much of a positive or negative impact the change has made to the product.
This, of course, relies on nuanced decision-making as to what metrics are the most relevant with regards to the business goals and vision, as well as customer satisfaction and product quality. Which of these, amongst other things, to prioritise in decision-making is often hotly contested and has no immediate correct answer.

It's important to look at the data, but it's also important to step back and think about creative solutions to the problems we're challenging. Data doesn't tell the whole story.

Over-reliance on data to shape a product can be damaging. Anecdotally, from conversations with engineers at Facebook, it has been clear that the decision making strategy has been to ruthlessly prioritise engagement. It seems to the authors that this is often at the cost of user satisfaction and product quality. Though the metrics may have improved (e.g. increases in videos watched), by introducing auto-play or bait-and-switching UI interactions [x], we believe that such choices will prove detrimental in the long run. By not prioritising the holistic happiness of the user in order to achieve short term business goal metrics, we believe that customers will increasingly fall out of love with the product.

Software design in the frequent release mentality

After having invested so much time into building effective deployment and monitoring tools, how has this changed Stripe and Quizlet’s approach to the design of new features?
Speaking again to Tavish Armstrong of Stripe we learned about how the tooling that’s been built feeds back into the design process. The key feature he discusses are Feature Flags.
An example of their usage at Stripe are private betas for new countries. Selected users are invited, and the feature flag whitelisted if they choose to accept, to process payments in a new country. The understanding is that the experience won’t be as polished as existing countries, but that they will be able to process payments and collect revenue. Stripe can now test the full flow of a design in production with real traffic and mitigate the damage caused by mistakes made in the integration development. This provides a very rapid feedback cycle and allows for newer countries to reach feature parity with existing integrations very quickly.
At Quizlet, the same iterative philosophy is applied to the design process.

We need to make the tradeoff between agility and long-term stability. In practice, this means making a clear hypothesis on a new feature or product that we're releasing, and validating that hypothesis once shipped. If validated, then we can work on the next iteration which will both improve the user experience as well as improve the long-term stability. After a few iterations, we should get to a good point. The learnings from this iterative process informs future product decisions as well.

Recently, the infrastructure engineering team at Quizlet spoke about their move to a new cloud platform provider [x]. There were an initial series of experiments performed to validate high level metrics such as performance, support, and billing. Once these were established the team quickly began to test more involved features of the cloud platforms and build an rapid understanding of the tradeoffs of the technology. Architectural decisions such as these are always considered in the scope of how they help the company to deliver new features to customers. Moving to a cheaper provider and reducing expenditures would allow the company to hire new developers and increase their development bandwidth.
The usage of primitives such as feature flags and experimental hypothesis allows for an incremental approach where a new design is divided into many smaller changes, narrowing the scope for failure and providing an ability to rapidly converge on meeting customer demands.

In the final analysis...

In the lifecycle of both of these companies - old enough to have found product-market fit and become market leaders in their respective verticals - staying responsive and competitive is crucial to success. This comes from being able to validate and build on successful ideas faster than your competitors. To the engineers at Stripe and Quizlet, the idea of attempting this with a weekly release cycle is unthinkable.
Both companies have invested significantly in automation, continuous delivery and monitoring. They both adopt the disciplined engineering habits of owning your code, designing iteratively and validating decisions with a balance of human feedback and data. Combined, this is a powerful set of practices to deliver well-validated products rapidly to the end user.

References

[] - https://dogvacay.com/
[] - https://aws.amazon.com/rds/
[] - https://en.wikipedia.org/wiki/Criticism_of_Windows_Vista
[] - https://stripe.com/blog/distributed-ruby-testing
[] - http://martinfowler.com/bliki/CanaryRelease.html
[] - http://martinfowler.com/bliki/FeatureToggle.html
[] - https://medium.com/let-s-talk-design/dark-ux-patterns-are-bad-for-your-business-6f774ec6001b#.uxsjo0wdr
[] - https://quizlet.com/blog/whats-the-best-cloud-probably-gcp