This is a quick primer on feature toggles. We'll go into:
- What is a feature toggle?
- Why use them?
- How does it work?
- How do I manage them?
- What are the risks?
- Feature toggle - Put simply: A boolean that you can turn on or off.
- Feature flag - Another word for "feature toggle"
- Feature switch - Another word for "feature toggle"
- A/B testing - This is a "specialized" type of feature toggle. A basic feature toggle is on/off, while A/B testing could be on/off/vA/vB/vC
- Old toggle - The toggle has been alive for a long time (e.g. 6 months)
- Stale toggle - The toggle has not been accessed for a long time, or perhaps was never accessed
Simply put, it's a boolean value somewhere (database, HTTP API, etc.) that determines whether you utilize a feature or not.
The big advantage of feature toggles is that you can constantly push your code into production safely. With feature toggles, you don't need to worry (as much) about breaking production because your code is safely behind the toggle.
OFF
by default- Can be turned on for single users to verify in prod
- Can be rolled back instantly if an issue is found
- Can be scheduled if it is maintenance-related
- Enables code to be deployed at any time when safely behind a toggle
To illustrate why this approach is so convenient, we'll compare it to some other strategies.
Let's look at how you might have done "versioning" the old way in an API vs. how you might do it with toggles
# This style is dependent on re-deploying with updated configs,
# or using some tooling to dynamically modify the configs
# GET /users
def get_users
if config.use_v1
old_client.get_users()
else
new_client.get_users()
end
end
# This style keeps your old version "safe",
# while adding effort for clients to migrate
# GET /users?v=1
# GET /v1/users
def get_users_v1
old_client.get_users()
end
# GET /users?v=2
# GET /v2/users
def get_users_v2
new_client.get_users()
end
For the sake of argument, let's say we want to be "safe" and expose a new version, but we don't want it available to the public until we think it is stable.
# GET /users?v=1
# GET /v1/users
def get_users_v1
old_client.get_users()
end
# GET /users?v=2
# GET /v2/users
def get_users_v2
if check_toggle.is_v2_enabled?
new_client.get_users()
else
return_404
end
In real life, it's usually even more complicated than this, so you may end up implementing a new API version while you also upgrade authorization, so maybe you have something like this:
# GET /users?v=1
# GET /v1/users
def get_users_v1
if check_toggle.is_auth_enabled?
do_auth_check
old_client.get_users()
end
# GET /users?v=2
# GET /v2/users
def get_users_v2
if check_toggle.is_auth_enabled?
do_auth_check
if check_toggle.is_v2_enabled?
new_client.get_users()
else
return_404
end
With the front end, it's much the same. Your options were historically to either hit an API to check a flag or to look at a config. One thing to keep in mind with the front-end is that there is a little more risk of something getting accidentally cached.
With the front end, there are also a number of special considerations depending on which framework or libraries you are using. For the sake of simplicity, I'll illustrate this with vanilla javascript.
// Using config-style
if(global_config.homeV1){
renderV1();
} else {
renderV2();
}
// Using an API
if(toggleClient.isEnabled("homeV1"){
renderV1();
} else {
renderV2();
}
With toggles, the concept is exactly the same for hitting an API.
// Using an API
if(toggleClient.isEnabled("homeV1"){
renderV1();
} else {
renderV2();
}
For those concerned with particulars, toggleClient
would do something like this under the hood:
fetch('http://toggles.com/homeV1')
.then(response => response.json())
.then(data => return data.enabled);
Arguably, this is the hardest part of utilizing toggles.
There are a few things to keep in mind:
- Toggles should have a short lifetime if possible
- Your dashboard for toggles should flag old toggles and stale toggles
- You should have monitoring to detect if a toggle is being accessed
- You should prevent deletion of active toggles
- You should "soft delete" toggles (optionally doing permanent deletion after a period of time)
- Version control the management of the toggles (creation & deletion)
In my experience, here are some of the common risks with toggles:
- Deleting live toggles: This is the biggest one. You must implement a way to prevent deleting live toggles, or it will eventually happen. In my experience, if a toggle hasn't been accessed for a week, it is safe to delete.
- Code/Toggle mismatch: There are a couple ways this can happen:
- You rolled some code back to a former version which happens to depend on a toggle that has been deleted. This is bad, because it's likely the toggle should be
ON
but it will instead beOFF
- You deployed some code but didn't create the toggle. This is usually OK because you will just make the toggle and turn it on.
- You rolled some code back to a former version which happens to depend on a toggle that has been deleted. This is bad, because it's likely the toggle should be
- Bad default behavior: Similar to the previous problem, if you don't define your "default" behavior for when a toggle is not found or the API is unavailable, you will have a bad time.
- Old toggles: If you keep a toggle in the codebase after it has been launched and you have no intent of rolling it back, you're just creating technical debt.
- Stale toggles: Stale toggles will clutter up your dashboard if you don't get rid of them, but otherwise pose minimal "real" risk
- No monitoring: If you don't monitor your apps in some way, you might have subtle issues if toggles suddenly stop working. Old features might suddenly come back, or new features suddenly disappear. It's important to monitor the toggle API and clients using it so you know that everything is behaving normally. Monitoring is a good idea no matter what.
While toggles come with risks, it is important to keep in mind the risks of a system without toggles:
- Slower delivery: You will have slower delivery
- Maintaining and enhancing code is harder: You will be missing a powerful mechanism to evolve your code, and so you might find yourself doing a lot more versioning or doing "big cutovers" that have a lot of code and risk inherent in them
- Hard-coding: You may end up with "hard-coded" behavior that requires re-deploys or other maintenance. It's generally easier to go to you feature toggle dashboard to see the state of things than SSH-ing onto machines to check their configuration.
Various paid vendors:
- https://www.optimizely.com (free tier)
- https://www.split.io/ (free tier)
- https://rollout.io/
- https://launchdarkly.com/
- https://github.com/Unleash/unleash
- https://cloud.spring.io/spring-cloud-config/reference/html/ (Spring Cloud config isn't specifically targeted as a feature flag framework, but it can be leveraged as one quite easily, especially if you already use it for configuration)
- http://featureflags.io/resources/
- etc.
- Optimizely toggle primer - This is another good little primer on feature toggles with some optimizely specifics
- Build or buy? - Explanation of why to use the API vs. rolling your own