evancz/data-interchange.md

## data-interchange.md

      
    Raw
  

              data-interchange.md
            
          
    A vision for data interchange in Elm

How do you send information between clients and servers? What format should that information be in? What happens when the server changes the format, but the client has not been updated yet? What happens when the server changes the format, but the database cannot be updated?
These are difficult questions. It is not just about picking a format, but rather picking a format that can evolve as your application evolves.
Literature Review

By now there are many approaches to communicating between client and server. These approaches tend to be known within specific companies and language communities, but the techniques do not cross borders. I will outline JSON, ProtoBuf, and GraphQL here so we can learn from them all.
JSON

Within the JavaScript community, it is very common to use the JSON format for data interchange.
Pros:

It integrates perfectly with JavaScript.
It integrates decently with dynamicly languages like Ruby and Python.
It is often human readable. Maybe you need to prettify though.

Cons:

It does not fit very naturally with typed languages.
It is not very dense. Field names like "price" can easily be repeated hundreds or thousands of times in a single message.
No guidance on how to evolve as your client and server change. Just change the JSON and write some tests. Everyone just kind of makes it up it seems.
No guidance on how to efficiently store information in your database. The database person will deal with that.

ProtoBuf

When I was at Google, we used protobuf for everything. You can think of it as an effort to do better than XML, which shares many of JSON’s weakness.
Pros:


Protobuf is designed to be as dense as possible. By specifying the format in a .proto file beforehand, you can send bits without any price annotations. Just the data.


Protobuf has good support for union types. The latest version (proto3) supports oneof, making union types pretty easy to represent.


Protobuf is designed to evolve with your application. There were two major rules about protobufs when I was at Google. (1) You never remove an existing field, and (2) every field you is optional. Together these rules guarantee that the data format is always fully backwards compatible. You can only ever add optional fields. A client can check if fields X and Y are there. If so, great! If not, maybe the server is not updated yet. This decouples the code for data exchange from the code for backwards compatibility.


Protobuf facillicates use of different programming languages. You compile .proto files to C++, Java, JavaScript, etc. that can unpack protobufs into nice data structures within the language. If you decide to revamp your client or server, you do not need to revamp the data interchange format to make things feel nice.


Within Google, databases were set up to use protobufs. So you actually had the same format between client, server, and database. I do not think this pro exists outside of Google right now, but it at least points to a theory of how server and database can work together better.


Cons:

You must specify the format explicitly up front, rather than doing it implicitly every time you write data interchange code on the client or server. I include this here mainly because I can see why people would feel this way. I personally feel that planning ahead saves time within a matter of hours, but other people see the delayed payoff differently.

GraphQL

Facebook introduced GraphQL in the last few years. You can think of this as an effort to address the question of how the data interchange format can evolve as your client and server changes.
Pros:


It lets you delay database optimization. Client-side developers can just say what data they need, rather than asking database engineers to add specific queries for them. Database folks can then observe traffic patterns and optimize based on that, focusing their work on problem areas more effectively.


The data extracted is JSON. That means it works well when you use JavaScript, Flow, TypeScript, and other things like that.


The format permits “query optimization” so you can send fewer requests. Basically, if you have two GraphQL queries, you can combine them into one. This means less time doing handshakes with a server and less data sent across if the two queries needed any overlapping data.


You can optimize request size. Because the requests are all specified in your .graphql file, you can choose a more dense representation for certain requests. As long as the client and server both think 1 is the same thing, that is all you need to say.


You can support multiple languages. Because you have .graphql files, you can generate code in JavaScript, Python, Ruby, Elixir, or whatever else. Changing to a new language does not require redesigning your data interchange format.


Cons:

I do not personally know if it pushes you to design for an evolving application. It may be that there is a culture of “no required fields” and “never remove fields” like with protobuf, but I do not know the details for sure. This should be a pro if the community of users embraces that approach in practice!

Lessons


Big projects use .format files. This decouples the data interchange format from the particulars of the client or server.


Big projects care about asset size. When you are serving lots of data to lots of people, a couple bits here and there really add up.


Decoupling the data interchange format from what people do with it is useful. For Google that was about backwards compatibility, and for Facebook it is about making sure client-side work is not blocked by server folks.


Lessons for Elm

For some reason we think JSON is a thing that is fine. It is not a great choice on basically every metric that matters to building reliable, efficient, and flexible applications. It seems unreasonable to think that folks will be using JSON on purpose in 10 or 20 years.
At this moment in the Elm ecosystem, folks can be quite upset that you have to write JSON decoders by hand. Rather than questioning the idea of using JSON at all, they wonder if Elm itself should be generating JSON decoders. But the history of building in support for a particular interchange format is pretty poor. For example, Java assumed that XML was the way to go. Scala followed in that tradition and actually has XML syntax as part of its parser, so XML is valid Scala. I do not think it is easy to argue that this direction was wise, and I know there are efforts to get this out of Scala.
In contrast, the success stories of protobuf and GraphQL are all around having a specific .format file and generating code in whatever language you need. This does not require any specific language support, so languages do not get tied to interchange formats and vice-versa.
I hope that rather than lamenting that Elm lacks language features that have historically turned out very poorly, folks can start getting into the world of data interchange formats. There is a rich history here, and many options. Going this route will mean better interop with other languages, smaller assets sent across the wire, no need to deal with JSON by hand, and no need to clutter the language.