smarden1/gist:9a511578f3956f284e87

## gistfile1.txt
Building a feature flag system


There are more blog posts that could possibly be read on the subject of AB testing -- why it is important, why you are doing it wrong, and this one crazy trick to double your conversions in a matter of minutes. At this point you are probably convinced that ab testing is probably something you should do, but the real question is, what is the best way to go about building such a system. This ignores the larger and more pertinent question of whether you should be building something yourself.

So what's actually involved in building an ab testing system? Well there are several different components that you will want...

1. an easy way to configure tests.
2. an api for determining whether a test is on/off for a given request/user/etc
3. a logging framework for measuring the impact of a given test
4. an etl to turn those logs into something usable (like sessions or users)
5. an analysis toolchain (which can be as simple as an R script or as complex as you can possibly imagine)

Given that this is my first blog post, we're going to focus on steps 1 and 2.

Flagging experiments between various states of on and off is tantamount to a feature flagging system, which is a common practice in software, especially in a continuous deployment environment. If you haven't heard of it, then you have probably been living under a rock, but for the troglodytes in the crowd I recommend reading [this] or [this].In short, feature flags allow one to maintain some canonical source that maps feature names to a status, which can include on, off, or somewhere in between. You can then wrap various parts of your code in these feature flags, allowing one to easily flag different parts of your application on or off. As you can imagine, this is useful for rolling out a new feature as well as flagging off parts of your application in the event of a partial degradation.

Although this post deals with the details of a feature flag system specifically designed for running ab tests, much of the following will apply to all feature flag frameworks.

The best way to configure these features is through a configuration system of some sort. This config will allow us to easily switch a feature from off to on and vice-versa as well as define our variants and how much traffic should be routed to each.

// our feature is enabled
{
  name: flaky_external_service
  enabled: true
}

// our feature is disabled
{
  name: flaky_external_service
  enabled: false
}

// our feature is enabled at 50%
{
  name: flaky_external_service
  enabled: true
  percentage: .50
}

And in your code, the calling api looks something like this...

if isEnabled("flaky_external_service") {
  // do something flaky
} else {
  // do something stable
}

You might be tempted to put something like this in a database, so you can turn features on and off and back again without suffering through the misery of a deploy. While there are some advantages to this approach I would strongly advise against such foolishness.

The two main arguments for storing features in a db is expediency in swapping out features and easy access for non-engineers. First off, changes to features are generally way scarier than a deploy -- when developing in a feature flag environment, the basic pattern is to wrap everything in what essentially evaluates to an "if false", therefore the core of your code is never going to be executed. Flipping the proverbial switch is going to have more of an impact than actually deploying your code and you should optimize for such.

But more importantly, if done properly, the feature flagging system is going to be dead center of your critical path, which means that you want to rely on as few external processes as possible. Every dependency here is a major liability and a database is not something that you want to worry about. For the vast majority of the time, there are going to be no problems with this coupling, but we are not optimizing for the vast majority, instead we are focused on overall stability in our system, especially in the event of a disaster. It's not hard to imagine a runaway feature or experiment that accidentally pegs your db and when this happens, and this will happen, how are you going to flag off your troubles if you can't access your feature system? Instead, keep your configuration file in something simple like a json file or put it in actual code.

In the basic feature flag design, there are explicitly two states, on and off, however experiments do not exhibit this limitation. Multi-variant tests can have a variety of different states, each representing a different branch of an experiment. Because of this, our configuration system needs to support multiple states with a special default state representing off, or in experiment parlance, the control group. Generally, I like the control group to be implicitly defined as the difference between 1 and the sum of the variants. Not only does this easily ensure that our percentages always sums to 1, but it means we don't have to continually repeat ourselves when defined a control group.

// our control group is at 40%
{
  name: button_color
  enabled: true
  variants: {"blue": .3, "red": .3}
}

When there are multiple variants defined, the calling api needs to distinguish whether or not the experiment is enabled and if it is enabled, what variant is currently selected.

// TODO finish this
if isEnabled("button_color") {
  $color = switch getVariant("button_color")
}

If we run our experiment and quickly realize that there are errors, we can easily turn it off by simply swapping enabled to false and we do not have to change any of our code.

The premise is simple, but there are many things that need to be considered when designing such a system.

First off, we need to consider the mechanism for determining what variant a user sees when a feature is somewhere between completely on or completely off. Math.random isn't going to cut it here because we want our bucketing to be deterministic in order to ensure a consistent user experience. Imagine if you went to a website and the color of a button changed on every page refresh. Generally, the strategy here is to use a psuedo-random number generator where the psuedo-randomness occurs by hashing a composite key of feature name + a unique identifier that is consistent. Depending on your use cases, this may be either a user-id or a cookie based browser-id or even some sort of merchant-id, but consistency is important! Switching between a user-id and browser id crosses the logged in boundary might result in a confusing experience. Generally, I suggest incorporating this decision into your config.

 ## TODO - also bad for logging^^^

{
  name: listings_per_search_page
  enabled: true
  variants: {"40": .10, "60": .30, "80": .60},
  bucketingId: user
}

But simply choosing variants based on psuedo-random numbers isn't going to take us very far as it doesn't allow for good ways of ensuring access to specific code-paths, not only for testing and development, but also for allowing access to experiments for specific users. Instead, we need some way of forcing an experiment to a specific variant. However, forcing variants introduces some new complexity in terms of our experimental design. The most basic assumption that we make with all of our experiments is that each variant is composed of independent and identical distributed random variables, commonly shortened to iid. This means that the samples for each variant have to be comprised of the same demographics, yet when we force a particular group of traffic to a variant, we are potentially violating that principle. For example, it may be worthwhile to force all the employees at your company to use a particular variant, yet depending on the size of your company, this could potentially ensure that you are introducing a major bias to your experiment as your two variants are not representative of the true underlying population. This if often a problem in the case of a prototype group, where power-users are allowed access to experimental features and are automatically included in an experiment. Although subtle, this selective sampling is quite harmful because it means that the results that you discover in your experiment will not reflect the reality that you will experience when you launch. In order to combat this problem, we need to log the reason for how a variant was chosen and ensure that we remove all non-psuedo-random selectors from our final analysis.

Some common selectors may include url parameters, cookies, user memberships in a specific group such as a prototype, whitelists or blacklists, environment (test/dev/etc), and even a local development config file. For the most part, these selectors should be managed in the config, allowing for a small set of external dependencies to override the default results when certain circumstances are met and when the experiment is not completely disabled. You can imagine the selector chain as an ordered list of partial functions with the last selector defaulting to psuedo-random.

One sign of good software is quality instrumentation, but it's an especially critical component of an ab testing framework. Without adequate logging, we are unable to determine what exactly were the downstream effects of our experiments.

It seems obvious, but we should only log what tests a user was actually exposed to instead of all tests and variants that a user would have been assigned to. This is an important distinction because it allows us to minimize the unexplained variance present when we do our final analysis. Practically, when we minimize this statistical noise, we reduce the amount of data that we need to run an experiment and thus get quicker results. Imagine if we were comparing the average roll of two dice, one with ten sides and the other with five sides, to determine if they were different. We would expect the average and the variance of these rolls to be fairly different over a limited number of rolls. This output is what an experiment might look like. To appropriately capture the noise from logging all variants regardless if a user was exposed or not, simply roll another 20 sided die and combine that to the original results -- we may be able to separate out the differences, but it is going to take much more data. In more laymore terms, this is the difference between having a conversation in an empty room vs having a conversation next to a trumpet player -- it's possibly to understand what your companions are saying, it's just going to take a lot more work.

Many of the tests that you run will not be available for all users and will instead only be available to a subset of users who are eligible.

One approach to ensuring that logging always occurs is by having our configuration api do the logging for us, so any call to `isEnabled`, will automatically log the experiment, selector, and variant selected. This is a great and simple solution, but it introduces some subtle problems caused by unexpected side-effects. In this world, there is a difference between the following...

// correct behavior. will only log for mobile devices
isAMobileDevice(request) && isEnabled('show_mobile_redesign')

and

// incorrect behavior. will log for all devices.
isEnabled('show_mobile_redesign') && isAMobileDevice(request)

In the first case, we are only logging our experiments if the given user is eligible and actually has a chance of being in the new experience, whereas in the second case, all of our users are being bucketing into one of our variants, even though not everyone can experience it. This means that the variant for 'show_mobile_redesign' does not actually represent what a user actually experienced. One of the major problems here is that the error is fairly hard to detect during development as the side-effect requires an understanding of proper experimental design and possible introspection of the logging system.

Much of this design is heavily influenced by Etsy's feature flagging framework, which is a great piece of software. Much of the complexities and use-cases have been thoroughly thought through. However, much of the details of a great feature system rely on the integration within your larger infrastructure. Minimizing your dependencies, thinking through how feature works in different environments or services (web request, cron job, etc), and safely and quickly deploying feature all have very real implications on how successful a system will work out.
	Building a feature flag system



	There are more blog posts that could possibly be read on the subject of AB testing -- why it is important, why you are doing it wrong, and this one crazy trick to double your conversions in a matter of minutes. At this point you are probably convinced that ab testing is probably something you should do, but the real question is, what is the best way to go about building such a system. This ignores the larger and more pertinent question of whether you should be building something yourself.

	So what's actually involved in building an ab testing system? Well there are several different components that you will want...

	1. an easy way to configure tests.
	2. an api for determining whether a test is on/off for a given request/user/etc
	3. a logging framework for measuring the impact of a given test
	4. an etl to turn those logs into something usable (like sessions or users)
	5. an analysis toolchain (which can be as simple as an R script or as complex as you can possibly imagine)

	Given that this is my first blog post, we're going to focus on steps 1 and 2.

	Flagging experiments between various states of on and off is tantamount to a feature flagging system, which is a common practice in software, especially in a continuous deployment environment. If you haven't heard of it, then you have probably been living under a rock, but for the troglodytes in the crowd I recommend reading [this] or [this].In short, feature flags allow one to maintain some canonical source that maps feature names to a status, which can include on, off, or somewhere in between. You can then wrap various parts of your code in these feature flags, allowing one to easily flag different parts of your application on or off. As you can imagine, this is useful for rolling out a new feature as well as flagging off parts of your application in the event of a partial degradation.

	Although this post deals with the details of a feature flag system specifically designed for running ab tests, much of the following will apply to all feature flag frameworks.

	The best way to configure these features is through a configuration system of some sort. This config will allow us to easily switch a feature from off to on and vice-versa as well as define our variants and how much traffic should be routed to each.

	// our feature is enabled
	{
	name: flaky_external_service
	enabled: true
	}

	// our feature is disabled
	{
	name: flaky_external_service
	enabled: false
	}

	// our feature is enabled at 50%
	{
	name: flaky_external_service
	enabled: true
	percentage: .50
	}

	And in your code, the calling api looks something like this...

	if isEnabled("flaky_external_service") {
	// do something flaky
	} else {
	// do something stable
	}

	You might be tempted to put something like this in a database, so you can turn features on and off and back again without suffering through the misery of a deploy. While there are some advantages to this approach I would strongly advise against such foolishness.

	The two main arguments for storing features in a db is expediency in swapping out features and easy access for non-engineers. First off, changes to features are generally way scarier than a deploy -- when developing in a feature flag environment, the basic pattern is to wrap everything in what essentially evaluates to an "if false", therefore the core of your code is never going to be executed. Flipping the proverbial switch is going to have more of an impact than actually deploying your code and you should optimize for such.

	But more importantly, if done properly, the feature flagging system is going to be dead center of your critical path, which means that you want to rely on as few external processes as possible. Every dependency here is a major liability and a database is not something that you want to worry about. For the vast majority of the time, there are going to be no problems with this coupling, but we are not optimizing for the vast majority, instead we are focused on overall stability in our system, especially in the event of a disaster. It's not hard to imagine a runaway feature or experiment that accidentally pegs your db and when this happens, and this will happen, how are you going to flag off your troubles if you can't access your feature system? Instead, keep your configuration file in something simple like a json file or put it in actual code.

	In the basic feature flag design, there are explicitly two states, on and off, however experiments do not exhibit this limitation. Multi-variant tests can have a variety of different states, each representing a different branch of an experiment. Because of this, our configuration system needs to support multiple states with a special default state representing off, or in experiment parlance, the control group. Generally, I like the control group to be implicitly defined as the difference between 1 and the sum of the variants. Not only does this easily ensure that our percentages always sums to 1, but it means we don't have to continually repeat ourselves when defined a control group.

	// our control group is at 40%
	{
	name: button_color
	enabled: true
	variants: {"blue": .3, "red": .3}
	}

	When there are multiple variants defined, the calling api needs to distinguish whether or not the experiment is enabled and if it is enabled, what variant is currently selected.

	// TODO finish this
	if isEnabled("button_color") {
	$color = switch getVariant("button_color")
	}

	If we run our experiment and quickly realize that there are errors, we can easily turn it off by simply swapping enabled to false and we do not have to change any of our code.

	The premise is simple, but there are many things that need to be considered when designing such a system.

	First off, we need to consider the mechanism for determining what variant a user sees when a feature is somewhere between completely on or completely off. Math.random isn't going to cut it here because we want our bucketing to be deterministic in order to ensure a consistent user experience. Imagine if you went to a website and the color of a button changed on every page refresh. Generally, the strategy here is to use a psuedo-random number generator where the psuedo-randomness occurs by hashing a composite key of feature name + a unique identifier that is consistent. Depending on your use cases, this may be either a user-id or a cookie based browser-id or even some sort of merchant-id, but consistency is important! Switching between a user-id and browser id crosses the logged in boundary might result in a confusing experience. Generally, I suggest incorporating this decision into your config.

	## TODO - also bad for logging^^^

	{
	name: listings_per_search_page
	enabled: true
	variants: {"40": .10, "60": .30, "80": .60},
	bucketingId: user
	}

	But simply choosing variants based on psuedo-random numbers isn't going to take us very far as it doesn't allow for good ways of ensuring access to specific code-paths, not only for testing and development, but also for allowing access to experiments for specific users. Instead, we need some way of forcing an experiment to a specific variant. However, forcing variants introduces some new complexity in terms of our experimental design. The most basic assumption that we make with all of our experiments is that each variant is composed of independent and identical distributed random variables, commonly shortened to iid. This means that the samples for each variant have to be comprised of the same demographics, yet when we force a particular group of traffic to a variant, we are potentially violating that principle. For example, it may be worthwhile to force all the employees at your company to use a particular variant, yet depending on the size of your company, this could potentially ensure that you are introducing a major bias to your experiment as your two variants are not representative of the true underlying population. This if often a problem in the case of a prototype group, where power-users are allowed access to experimental features and are automatically included in an experiment. Although subtle, this selective sampling is quite harmful because it means that the results that you discover in your experiment will not reflect the reality that you will experience when you launch. In order to combat this problem, we need to log the reason for how a variant was chosen and ensure that we remove all non-psuedo-random selectors from our final analysis.

	Some common selectors may include url parameters, cookies, user memberships in a specific group such as a prototype, whitelists or blacklists, environment (test/dev/etc), and even a local development config file. For the most part, these selectors should be managed in the config, allowing for a small set of external dependencies to override the default results when certain circumstances are met and when the experiment is not completely disabled. You can imagine the selector chain as an ordered list of partial functions with the last selector defaulting to psuedo-random.

	One sign of good software is quality instrumentation, but it's an especially critical component of an ab testing framework. Without adequate logging, we are unable to determine what exactly were the downstream effects of our experiments.

	It seems obvious, but we should only log what tests a user was actually exposed to instead of all tests and variants that a user would have been assigned to. This is an important distinction because it allows us to minimize the unexplained variance present when we do our final analysis. Practically, when we minimize this statistical noise, we reduce the amount of data that we need to run an experiment and thus get quicker results. Imagine if we were comparing the average roll of two dice, one with ten sides and the other with five sides, to determine if they were different. We would expect the average and the variance of these rolls to be fairly different over a limited number of rolls. This output is what an experiment might look like. To appropriately capture the noise from logging all variants regardless if a user was exposed or not, simply roll another 20 sided die and combine that to the original results -- we may be able to separate out the differences, but it is going to take much more data. In more laymore terms, this is the difference between having a conversation in an empty room vs having a conversation next to a trumpet player -- it's possibly to understand what your companions are saying, it's just going to take a lot more work.

	Many of the tests that you run will not be available for all users and will instead only be available to a subset of users who are eligible.

	One approach to ensuring that logging always occurs is by having our configuration api do the logging for us, so any call to `isEnabled`, will automatically log the experiment, selector, and variant selected. This is a great and simple solution, but it introduces some subtle problems caused by unexpected side-effects. In this world, there is a difference between the following...

	// correct behavior. will only log for mobile devices
	isAMobileDevice(request) && isEnabled('show_mobile_redesign')

	and

	// incorrect behavior. will log for all devices.
	isEnabled('show_mobile_redesign') && isAMobileDevice(request)

	In the first case, we are only logging our experiments if the given user is eligible and actually has a chance of being in the new experience, whereas in the second case, all of our users are being bucketing into one of our variants, even though not everyone can experience it. This means that the variant for 'show_mobile_redesign' does not actually represent what a user actually experienced. One of the major problems here is that the error is fairly hard to detect during development as the side-effect requires an understanding of proper experimental design and possible introspection of the logging system.

	Much of this design is heavily influenced by Etsy's feature flagging framework, which is a great piece of software. Much of the complexities and use-cases have been thoroughly thought through. However, much of the details of a great feature system rely on the integration within your larger infrastructure. Minimizing your dependencies, thinking through how feature works in different environments or services (web request, cron job, etc), and safely and quickly deploying feature all have very real implications on how successful a system will work out.