paniq/BAG.md

## BAG.md

      
    Raw
  

              BAG.md
            
          
    The .bag File Format

A .bag file (short for "Blobby/Binary/Boxed/Bagged Atom Graph") file describes a directed graph of atoms, which can be either blob (array of bytes) or cell (array of atoms) in topological order of ownership, leaves first, root last. This renders the format particularly append/overlay-friendly, as the root can be rewritten without changing the existing file or removing unused nodes.
.bag is comparable to RIFF in scope but fixes two major problems with it, namely that RIFF can only encode trees and isn't topologically ordered, requiring the use of a stack when reading it.
The format can be read as a binary stream of 64-bit words. It begins with the magic 4-character sequence .bag and a 32-bit version which must be 1 (hence .bag\x01\0\0\0 or 0x000000016761622e), and is then only followed by atoms until the file ends.
An atom begins with a header word, followed by its contents, aligned to the next word. The header word can be read as a signed word-sized integer that encodes both type and length of the data, which is retrieved as follows:
is_cell = header < 0
is_blob = !is_cell
bytesize = header ^ (header >> (bitcount(word) - 1))
word_count = (bytesize + sizeof(word) - 1) / sizeof(word)

For blobs, bytesize indicates their length n in bytes. The content that follows is a n bytes string, padded with zeroes to the next word. The maximal theoretical size of a blob is 2^(bitcount(word) - 1) - 1 bytes.
For cells, word_count indicates their element count n. The content that follows is a n word sequence, of which each word is either a strong reference to an atom parsed so far (that is, the cell takes shared ownership of that atom) or a weak reference to any past or future atom in the file. A reference encodes weakness and index of the referenced atom, and can be decoded as follows:
is_weak = ref < 0
index = ref ^ (ref >> (bitcount(word) - 1))

The first atom of the file has index 0. A cell of index I may only strongly reference atoms whose index K satisfies K < I. This guarantees that we can construct purely acyclic graphs as we stream in the file, and also ensures that ownership remains acyclic. A weak reference may reference any atom in the file, and no restriction on the index exists. If the index points past the last index in the file, the weak reference is to be rewritten as empty. Hence 1 << (bitcount(word) - 1) safely encodes a physically unreachable index and can thus be taken to encode empty weak references.
The last atom of the file read is to be returned as the root of the graph.
Values of the file may also be read lazily by streaming in atoms as they are requested from the root. For this, a index to offset mapping must be created by reading atom headers only and skipping the following content, incrementing the read pointer by the word-aligned bytesize:
word_ptr += word_count
// or
byte_ptr += (bytesize + sizeof(word) - 1) & ~(sizeof(word) - 1)

When writing a .bag file, the header information can simply be encoded as follows:
header = is_cell?~(cell_count * sizeof(word)):blob_size

A reference is encoded as follows:
ref = is_weak?~index:index

It is convention, though not mandatory, to write a zero-length blob as the first atom of the file, so index 0 maps to an empty atom.
Implementation Notes

When writing a .bag file, the graph should first be visited without writing the file, following all strong references and assembling the topological order in a dynamic array (which maps index to atom), as well as building a reverse mapping from atom to array index. Then the atoms listed in the array are written to the file in linear order. Both strong and weak references can be translated to indices, as the mapping is complete. Since we never visited the graph through weak references, it is possible that some weak references can not be translated to indices as the target atom has not been recorded in the graph. These references should translate to the special empty index mentioned above.
When reading a .bag file, the acyclical graph is constructed as the file is traversed, with each seen atom appended to a dynamic array (mapping index to atom). Every strong reference will point to an index within the array and can be resolved directly. For dealing with weak references, there are three possible approaches:
A. If reference i of a cell I pointing to an unknown atom with index K is weak, its edge must be recorded to a todo-list as a tuple { &cell[I].element[i], K }. Once all atoms have been constructed, the todo-list can be worked off, resolving each index K to a known cell (or empty, if the index is invalid), and updating the memory location of the weak reference with it.
B. Every cell with weak references that need to be resolved later is recorded into a todo-list, mapping it to its location in the file. Once all atoms have been constructed, the todo-list can be worked off, completing for each cell its weak references, reading the respective offset from the file and resolving it with the index-to-atom array.
C. Every cell with weak references that need to be resolved later is recorded into a todo-list. The weak reference pointer in the cell is initialized with the index to be resolved. Once all atoms have been constructed, the todo-list can be worked off, completing for each cell its weak references, reading the index from the pointer and resolving it to the actual pointer with the index-to-atom array.