Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save fkohlgrueber/3cad2e59d04e2fe21843de2a0d9a398c to your computer and use it in GitHub Desktop.
Save fkohlgrueber/3cad2e59d04e2fe21843de2a0d9a398c to your computer and use it in GitHub Desktop.
File Formats Draft
I've been thinking about file formats lately. When looking at different formats, it seems like there are common concepts used in many of them, the main ones being **textual data** and **hierarchy**. These concepts are usually implemented by stacking different encoding layers. For example, the SVG format can be seen as the following stack of abstractions: `SVG -> XML -> UTF-8 -> Binary`. In this case, the hierarchy is provided by XML and the encoding of textual data is done in XML and UTF-8. You might be wondering why textual data is handled both in XML and UTF-8. The reason for this is that one cannot simply paste a UTF-8-encoded string into an SVG file and expect it to work. A lot of strings would contain characters that have a meaning to the surrounding markup. The common solution to this is to use escape characters that indicate that the character following them should not be treated as markup. But other than that, textual data is encoded in UTF-8.
Having to handle textual data encoding at two levels in the stack of encodings doesn't seem optimal, does it? As software developers, we usually try to implement separation of concerns. Handling text encoding should only be handled in one place. Let's try to resolve this.
The reason for the issue is that the markup that provides hierarchy and textual data are both encoded as UTF-8. There's an overlap between encoded textual data and encoded hierarchy. This means that as soon as a layer is built upon UTF-8, it cannot contain markup and direct encoding of any UTF-8 textual data. In the SVG example, UTF-8 is the first encoding layer which makes SVG a text format.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment