Baccata/compatibility.md

## compatibility.md

      
    Raw
  

              compatibility.md
            
          
    Reasoning formally about data compatibility

Data compatibility in distributed systems is a problem that is often adressed in ad-hoc manners and suffer from bikeshedding. Developing a formal framework for reasoning about it is crucial to ensure a seamless evolution of complex systems and easy inter-connectivity of components.
Disclaimer
This gist talks specifically about backward compatibility rules. Intuitively, forward compatibility rules are the opposite of backward ones.
Defining the problem

The problem of data compatibility is defined as such : given two components A and B that agree to communicate using some data format Data, how can A decide to change the shape of Data without telling B, in a way that B can continue to function after the change ?
What is data ?

In order to formalise the problem further, the concept of data needs to be deconstructed. Similarly to the question "what is a number ?", the answer is very much context dependent. Numbers can be rational, real, complex, discrete, etc ... what is true and untrue about numbers depends on the context in which we set ourselves.
In a similar fashion, the set of rules that allow to define data models can vary depending on the metamodel, which is the the definition of the various constructs that can be used in the definition of models (shapes of data).
However, there are some constructs that are part of most metamodels. Formalising a framework against those constructs already solves a large part of the problem in most contexts it may arise in.

Primitive types (a set of atomic constructs that cannot be decomposed further)
Product types (aka records, akin to the "and" algebraic operator)
Sum types (aka coproducts/unions, akin to the "or" algebraic operator)

In order to illustrate the presented notions in a way that is easy to relate with the concrete world of engineering (where json is a somewhat hegemonious format), we'll use a simple metamodel limited to the following semantics :

Primitives are Int, String, Boolean
Labeled Product look like this : { val int : Int ; val bool : Boolean }
Tagged unions look like this : { val int : Int } | { val bool: Boolean }

These semantics are compliant with the Scala 3 programming language, and use its "Structural Types" feature as it allows to form types that "look" like json.
What is communication ?

The concept of communication between components, whether synchronous (client-server) or asynchronous (producer-consumer), can be formalised using functions and higher-order functions.
Having access to a A => B function means "if I give a piece of data of type A, I'll obtain a piece of data of type B".
In the context of synchronous communication

In the context of synchronous communication (client-server), we often refer to  Application Programming Interfaces (API), which are merely products of functions, where each function can be called independently.
type API = {
  function1: A => B
  function2: B => C
}
If a client provide a data of shape A through the function1 "channel", they'll receive a piece of data of shape B.
In the context of asynchronous communcation

In the context event event-streams, a similar simplification can be applied to reason about what communication is, as consumers have access to a parameter-less next function allowing to poll for the next piece of data.
type Stream = {
  next: () => A
}
Of course, this simplification only works "conceptually", but what it means is that a consumer of an event-stream does not have to provide any data in order to get produced data.
If we can express rules that let us evolve synchronous APIs whilst retaining compatibility, we can use the same rules for asynchronous APIs.
Providers vs Consumers

Communication between components is seldom even-handed : most of the time there is a provider and a consumer. In the context client-server communication, the server "provides" an interface to the client, which "consumes" the API. This relationship can be modelled as the following method :
def consume(api: API) : Unit
This hierarchy is important to understand the problem of compatibility, as the provider has control over the shape of the communication. After the shape has initially been agreed upon, making changes to that shape can break the consumer. The problem of compatibility has to do with making changes to that shape in a way that does not break the consumers.
Therefore, making compatible changes to API means creating a new type APIv2 that can safely substitute to APIv1 so that the following works :
val myAPIv2 : APIv2 = ???
consume(APIv2)
In other words, making a change to an API that doesn't break consumers means creating a subtype of API that incorporate the changes.
The subtyping lattice

The Scala language offers a graphic representation of its types hierarchy, which can be used to create an intuition as to what safe subsitution means.

To those relationships, we can add the the "union" and "intersection" types introduced in Scala 3.

Given types A and B,

A | B is a supertype of both A and B
A & B is a subtype of both A and B

So we have the following relationships (where <:< means "is subtype of"):
A <:< A | B
B <:< A | B
A & B <:< A
A & B <:< B
Retaining compatibility for an API means exclusively going down this lattice of types. If we cannot prove a subtyping relationship between API and APIv2 using these rules, and given our metamodel, it means that the change between is breaking.
Having set that up, we can start playing with the actual scala 3 compiler to prove relationships between structural type :
summon[{val a : Int; val b : Boolean} =:= ({val a : Int} & {val b: Boolean})]
summon[{val a : Int; val b : Boolean} <:< {val a : Int}]
summon[{val a : Int; val b : Boolean} <:< {val b : Boolean}]
summon[{val a : Int} <:< ({val a: Int} | { val b : Boolean})]
To reinforce the intution, we can also relate these structural types to case classes.
case class A_and_B(a: Int, b: Boolean)
summon[A_and_B <:< ({val a : Int} & {val b : Boolean})]
Unfortunately, the compiler doesn't let us prove similar properties for coproducts/enums (from the admission of the Scala 3 developers, it is because of performance concern, but the following equivalence is very much true in theory)
enum A_or_B {
  case A(a: Int)
  case B(b: Boolean)
}

summon[A_or_B =:= (A_or_B.A | A_or_B.B)] // doesn't compile

Variance and subtyping

Because we're representing an API as a product of functions, it is important to understand how the variance of functions impact our ability to go down the lattice :
// the function type in Scala.
type =>[-A, +B]
Here, the type B (data produced) is covariant, whereas the type A (data consumed) is contravariant.
So in order to evolve a function of type A => B whilst retaining compatibilty, we need to either have B go down the lattice, or have A go up the lattice.
An easy way to remember is "give more, take less" : when it comes to product types, we can easily go down the lattice by adding fields. Since the produced data needs to go down, it means we can add data to it without breaking the consumer.
type B = { val b : Int }
type Bv2= { val b : Int; val b2: Int }

summon[Bv2 <:< B]
This makes sense : if the API consumer can work where only b is present, it will work when both b and b2 are present.
type A = { val a : Int, a2: Int }
type Av2 = { val a : Int }
Reversely, the data consumed by the function needs to go up, meaning if the API suddenly doesn't need for a2 to be sent anymore, it will not matter of the API consumer keeps sending it.
Variance and unions

Looking at the type lattice representation, we can infer that adding alternatives to a union means traveling upward in the lattice, whereas removing alternative from unions means traveling downward in the lattice.
Because contravariance inverses the direction in which a modification makes us travel in the lattice, we can safely give more options to the data that the consumer of the API has to send to get results.
type A = { val a : Int }
type A2 = { val a2 : Boolean }

type B = { val b : Int }
type B2 = { val b2 : Boolean}

type API = A => B
def consume(api : API) : Unit = ???


type APIv2 = (A | A2) => B
val apiV2: APIv2 = ???
consume(apiV2) // compiles because ((A | A2) => B) is subtype of (A => B)
This makes sense because if the user was sending A before, and we add an option to send B without them knowing, we're still accepting A and therefore they continue to function as expected.
However, it means that we cannot add alternatives to the data we send back to them, as it would mean that their logic handling this data would not be exhaustive anymore, therefore we'd be in a risk of breaking them.
type APIv2 = A => (B | B2)
val apiV2 : APIv2 = ???
// below doesn't compile because (A => (B | B2)) is not subtype of (A => B)
consume(apiV2)
That being said, it does mean that we can we can remove alternatives from the data we send back without breaking the consumer. Indeed, if the consumer is prepared to handle both B and B2, not sending B2 anymore will not have an incidence on their function.
type API = A => (B | B2)
def consume(api : API) : Unit = ???
type APIv2 = A => B
val apiv2 : APIv2= ???

consume(api) // compiles because (A => B) is subtype of (A => (B | B2))
The specific case of optionality

In scala, the Option[A] type is isomorphic to A | Null, which allow us to treat options as a mere specific case of unions
The special case of adding optional data to parameters

WARNING the reasoning in this subsection is brittle
We have established that we cannot add fields to a function parameter whilst retaining compatibility, as adding fields violates the downward progression of APIs in the lattice (since contravariance inverts the direction of the progression).
However, despite the Scala compiler not allowing for it, we can conceptualise "imaginary" equivalent types for each record, that have added null fields :
type A = {
  val a : Int
}

type ImaginaryA = {
  val a : Int
  val b : Null
}
A and ImaginaryA are virtually equivalent, because we can go from one to another in a completely decidable manner, by adding a b = null value when we have an A and ImaginaryA is required.
So in this imaginary world, A =:= ImaginaryA (we can think of it as similar to complex numbers, in which an imaginary i value exists, the square of which is -1).
Because it is possible to add alternatives to a type in contravariant (parameter) position without breaking compatibility, we can have
type AB = {
  val a : Int
  val b : Boolean | Null
}

summon[ImaginaryA <:< AB]
By transitivity of the sutyping relationship (A <: B and B <: C implies A <: C) , and by reflexivity (A <: A, therefore ImaginaryA <: A), we can infer that
A can be used "somewhat safely" where AB is required.
Concretely, this means that it is "kind of okay" to add optional fields to an API parameter type, as the API consumer, unaware of the existence of this new field, will not provide a value for it, meaning it will be inferred to null by the provider.
On the covariant (returned type) side, because we can always add fields to a record without breaking compatibility, we can always add an optional field, and decide to make it required later (since going from optional to required means going from A | Null => A, which equates to removing an alternative from a union, which takes us downward the type lattice).
NB : once again, the reasoning in this subsection is brittle. In reality, engineers tend to build false expectation about the requirement of fields by looking at existing data, and do not necessarily think about optionality. Because they see a field "x" in a payload, they assume all payloads will contain "x" and build pipelines with that expectation in mind.
Be very careful when using optionality.
Changing nested data

It is possible to change nested fields as long as the change respects the downward progression in the lattice :
type A = { val int : Int }
type A2 = { val int : Int; val bool : Boolean}
summon[A2 <:< A] // A2 is subtype of A

type B = { val a : A }
type B2 = { val a : A2}
summon[B2 <:< B] // we've change the type of a in a compatible way
Summary

We can now summarise the rules of compatible evolutions.
See https://scastie.scala-lang.org/Baccata/O9dLokkDQjy6zeCb0gQzXw/5 to play actual proofs
import scala.util.NotGiven as Not
def prove[A](using A) = summon[A]
def disprove[A](using Not[A]) = summon[Not[A]]

object proof {

  type A = { val a : Int }
  type B = { val b : Boolean }

  type A_and_B = { val a : Int ; val b : Boolean }
  type A_or_B = A | B

  prove[A_and_B =:= (A & B)]
  prove[A_and_B <:< A]
  prove[A_and_B <:< B]
  prove[A <:< A_or_B]

  // Functions :

  type A2 = { val a2 : Int }
  type B2 = { val b2 : Boolean }

  // Can add fields in returned types
  prove[(A => (B & B2)) <:< (A => B)]

  // Can remove fields from parameter types
  prove[(A => B) <:< ((A & A2) => B)]

  // Can add alternatives to parameter types
  prove[((A | A2) => B) <:< (A => B)]

  // Can remove alternatives from returned types
  prove[(A => B) <:< (A => (B | B2))]

  // Must not add fields to parameter types
  disprove[((A & A2) => B) <:< (A => B)]

  // Must not remove fields from returned types
  disprove[(A => B) <:< (A => (B & B2))]

  // Must not add alternatives in returned types
  disprove[(A => (B | B2)) <:< (A => B)]

  // Must not remove alternatives from parameter types
  disprove[(A => B) <:< ((A | A2) => B)]

}
What about breaking changes ?

The need to perform changes that are breaking APIs (and consumer of event streams) do happen.
In these examples, we have established rules with regards to functions. As a matter of fact, an API is closer to a product of functions :
type API {
  val functionV1 : A => B
}
Because it is a product, it means we can add function fields without breaking the consumers :
type APIv2 {
  val functionV1 : A => B
  val functionV2 : A2 => B2
}
Therefore, we can evolve APIs by adding functions, without breaking API consumers, since APIv2 <: API
Breaking changes in the case of event streams.

In the case of event streams, the conceptual representation is :
type EventStreamV1 {
  val urn1Next : () => Event1
}
In other words, a lineage of events is associated to an urn, here urn1. A breaking change in the modelling of Event1 would imply the creation of a new urn : urn2.
type EventStreamV2 {
  val urn1Next: () => Event1
  val urn2Next: () => Event2
}
Side note regarding this representation : in a real platform, there are different ways to handle such a change : either the addition of a new Event2 type would lead to the creation of a whole new stream at the infrastructure level (think kafka, kinesis), OR both Event1 and Event2 would be collocated/multiplexed in a single stream, and it'd be the responsibility of the consumer to "unweave" the data to obtain the specific version they want, discarding the rest.
The actual ergonomics might greatly differ between these options and how they end up being implemented, but I want to re-emphasise that the notions presented in this document are valuable conceptually, with the goal of establishing a formal framework for making decisions on how to evolve data.
If it helps, you can keep in mind that push-based streams (KCL fan-out) could be turned into pull-based ones via a queue, giving you the () => Event semantics.
Regarding sanitisation / schema deprecation and removal

We've seen that the way of introducing breaking changes to API (synchronous, asynchronous) is to add more functions that contain the new versions of shapes.
In reality, engineers do not like this. They are used to modifying code as opposed to duplicating it. However, a version of a data model needs to live for as long as it is consumed by a component of the platform. We can only remove schemas when we know for sure that nobody needs them anymore.
A good way to ensure that we can remove obsolete versions of schemas / API functions, is to have consumers register themselves with producers, and explicitly state what they consume.
type ProducedModel = { val a : Int; val b : Boolean; val c : String}

type ConsumedModel1 = { val a : Int }
type ConsumedModel2 = { val c : String}

summon[ConsumedModel1 <:< ProducedModel]
summon[ConsumedModel2 <:< ProducedModel]
In the example above, that implies a data producer and two data consumers, we can infer that b is not used, and therefore could be safely removed from ProducedModel
Whether asking for consumers to register themselves is acceptable or not is another question, but it'd open the door to solve a lot of issues technically.
Conclusion

This document attempts to establish formal rules to schema evolution, and hints at how breaking changes can be turned into non-breaking ones. Having the rules  backed by an actual proof system (in our case the Scala compiler) means the rules are objective, easier to justify to engineers, and can be automated.