CMCDragonkai/history_data_structures.md

## history_data_structures.md

      
    Raw
  

              history_data_structures.md
            
          
    History Data Structures

For stateful applications, there are 5 different ways of managing the history of state:

No History - Living in the moment. - Examples: Any stateful application that doesn't discards all previous states upon mutation.
Ad Hoc Snapshotting - Allows restoration to manually saved snapshots. - Examples: Memento Pattern.
Singleton - Only remembers the previous snapshot, where undoing the undo is just another undo. - Examples: Xerox PARC Bravo.
1 Stack - Allows linear undo. - Examples: AtariWriter.
2 Stack - Allows linear undo and redo. - Examples: Browser History, Microsoft Word, Adobe Photoshop.
Append-Only Log - Allows moving forward with reversed patches/deltas (a.k.a. compensating transactions). - Examples: Git-Revert, Datomic, Blockchain, Event Sourcing.
Tree - The Multiverse Approach. - Examples: Git, Vim, Emacs with UndoTree, Delorean.

There's variations and hybrid approaches that have to deal with special cases involving multi-user history, which is relevant to operational transformation applications such as Google Docs.
I'm going to attempt to explain some of the characteristics, advantages, and disadvantages of the 5 different ways of managing history.
Firstly having no history is the simplest, in such a case, you don't care about the previous states, only the present matters.
Ad hoc snapshotting is probably the most common and flexible form of history management. This is because it's a completely manual process, that the user of any stateful application can perform. Imagine if you were writing on an ancient word processor. You can perform your own history management, by simply copying the entire state and saving it to the disk as version_1.txt..version_X.txt. Then restoring to any state is just a matter of inspecting the version you want, and deciding what to do with it. You can even start creating version branches via version_1-1.txt..version_1-X.txt. There are some major disadvantages. Firstly you always have to decide what "restoration" means. It could mean a replacement of your current state, but it also could mean a complex merge. Secondly it's a lot of work, as you have to manually curate your version history. This process isn't just for end-user software, but is equally applicable to programming in the small and programming in the large.
Singleton pattern is used a lot where one-step rollback is all that is required. You may see this in optimistic concurrency control or transaction rollbacks.
Now we get stacks, logs and trees. For these solutions, there are 2 different ways of encoding states. The snapshot method and the delta method. The snapshot method is where every "state" is a complete and independent snapshot of what the state was. Whereas the delta method (a.k.a. delta encoding) is the (ideally minimal) difference between the past and present. Both methods have advantages and disadvantages. The snapshot method uses more storage space, while the delta method requires more computation to merge the differences. If you store snapshots you can always construct the corresponding deltas, and if you store deltas you can always construct the corresponding snapshots. However the time to construct the snapshots takes much longer then when using the delta method. This is because you may need to transitively apply all the deltas from origin. Whereas for snapshots you can just calculate the delta between 2 snapshots. Another key realisation to make is that the same delta can be used for different states. That is the delta between state A to B can be called delta p, but it can also be applied to state C to get D. Here's an example:
A: [H, E, L, L, O]         C: [O, K]
                |               |
                |   p: O -> !   |
                v               v
B: [H, E, L, L, !]         D: [!, K]

It is interesting to note that the naive conception of a delta is usually what I call a strong delta. A delta that is unique to its input and output states, and cannot be used for any other states. The above diagram shows a weak delta because its both necessary and sufficient for there to exist an O to change to !, but makes no claim on any other properties on the input state. If it were to be a strong delta, the delta would be usable if and only if the input state is an O (note the difference between "there exists an O" and "is an O"). A weak delta can be made into a stronger delta by adding extra contextual constraints, such as the byte position of O, the surrounding context of O... etc. This issue arises in choosing the lock granularity when using databases with multiple lock granularity. When performing an atomic update transaction on a row in your database, do you need to lock your database thus enforcing contextual constraints on the entire database, or just the table the row is in which constrains input state to the context of the table, or only the specific row itself?
There's also a theory formalising how deltas works called "Patch Theory", which is used by the Darcs and Pijul version control system.
The states in the history data structure have their own level of granularity as well:

Temporal Granularity - Each state point may be recorded in time intervals, where that time could be real time or logical time. For example playing a video game that has auto-saving every 5 minutes, versus auto-saving every second, versus auto-saving on special events.
Spatial Granularity - Each state point may be recording only partial state of the overall application session. For example, when you using a text editor, you may be saving the contents of file, but you may not be saving the scroll position or cursor position of your viewport into the file. Unless you're using Vim, in which case these things can be saved as well.

To make our terminology more clearer, we shall use "states" or "state points" to resemble a point in history, which can be encoded as a "state snapshot" or a "state delta". State deltas can be reversed, that is applied in reverse. Whereas "state snapshots" can only be restored atomically or via manual merging. The term "undo" means an undoing a previous action where undoing could mean a restoration or a reversion, while "redo" means to undo the undo. It is possible for there to be "linear undos/redos" and "non-linear undos/redos", which means, being able to restore or revert from some state in the history that is not immediately preceding the current state. To "restore" a state, means to replace the current state with some state in history (note that with a delta encoded scheme, recovering the historical state requires the transitive application of state snapshots from origin to restoration point). "Reverting" a state only makes sense with state deltas. It means to apply the reverse of the state delta. Doing restorations or reversions can mean forward movement in the arrow of time and correspondingly our appending mutations on our history tracking data structure. But they could also mean a backward movement involving history erasure, that is our history data structure discards states in order to perform our undo action. With regards to reversions, you can only have backward movement history erasure with linear reversions. This is because if you allow non-linear reversions with history erasure, then what do you do with all the intermediate states that are meant to be erased, but were not involved in the undo action? It's impossible to erase those states. This means non-linear reversions cannot have history erasure.
With the snapshot method, a non-linear restoration is either an all-or-nothing operation or requires custom manual merging as directed by the the user. The state delta method makes it possible to apply a reversed delta from some point in history to the current state without affect intermediate states, if the granularity of a state point is small enough for there to be independent state deltas (for more information on this, see: https://gist.github.com/CMCDragonkai/b11dcd8ec0bf07459d197fd671738a5e).
Here is a list of binary oppositions that will be relevant:

Undo vs Redo
State Snapshot vs State Delta
Linear vs Non-Linear
Reversion vs Restoration
Destructive vs Non-Destructive - History actions may or may not destroy existing history.
Mutational vs Non-Mutational - History actions may or may not mutate destructively/non-destructively existing history.
Backward Movement vs Forward movement - Moving the current state to a point back in time or forward in time.
History Erasure vs History Appending - Equivalent to destructive vs non-destructive.

The 1 stack history only allows linear undo. There's no way to perform redo, because every time a state is undone, that state is popped off the stack, and is promptly discarded. Therefore performing undo is permanent mutation.
As actions are performed from A -> B -> C -> D, they are pushed onto the stack.

D C B --+
        |
        v
      |   |
      |   |
      |   |
      | A |
      +---+

  History Stack

After action D, we want to perform a linear undo.

        +--> D
        |
        |
      |   |
      | C |
      | B |
      | A |
      +---+

  History Stack

The state of our system is now at C. D is discarded and cannot be recovered.
We can continue to undo by undoing C, or we can move forward and mutate to state E.
Any undone states are discarded.

The 2 stack history allows linear undo and redo because the second stack holds all of the states we have undone.
As actions are performed from A -> B -> C -> D, they are pushed onto the undo stack.

D C B --+                 
        |                 
        v                 
      |   |          |   |
      |   |          |   |
      |   |          |   |
      | A |          |   |
      +---+          +---+

   Undo Stack     Redo Stack

After action D, we want to perform a linear undo of D then C.

        +--------------+
        |              |
        |              v
      |   |          |   |
      |   |          |   |
      | B |          | C |
      | A |          | D |
      +---+          +---+

   Undo Stack     Redo Stack

We find out that we want to redo C, so we undo the undo.

        +--------------+
        |              |
        v              |
      |   |          |   |
      | C |          |   |
      | B |          |   |
      | A |          | D |
      +---+          +---+

   Undo Stack     Redo Stack

Now we want to move forward and perform E. This will discard the Redo Stack contents.

    E --+
        |
        v
      |   |          |   |
      | C |          |   |
      | B |          |   |
      | A |          |   |
      +---+          +---+
   Undo Stack     Redo Stack

The append-only history log only allows forward movement in history. Which means undos and redos always append new states, states are never discarded. (Practically you may do log rotation which may mean log archival and eventually log deletion.) This of course allows non-linear undo and non-linear redo, and also non-destructive time travel, you'll never get to the point of no return.
As actions are performed from A -> B -> C -> D, they are appended to the history log.

             D C ----+
                     |
+------------------  |
| A -> B           <-+
+------------------

   History Log

After action D, we want to perform a non-linear undo of B. This could be a restoration or a reversion.
From the perspective the data structure, it doesn't matter. It's just another appended state.
We shall notate the undoing of action B as action B'.

              B' ----+
                     |
+------------------  |
| A -> B -> C -> D <-+
+------------------

   History Log

If we want perform a redo of B. This means an undo of B', which means appending a new action of B''.

              B'' ---------+
                           |
+------------------------  |
| A -> B -> C -> D -> B' <-+
+------------------------

   History Log

The history tree allows for linear and non-linear undo and redo. It is the most flexible of all solutions, but also the most complex. It combines the capabilities of the 2 stack and the append-only log. The 2 stack method can be encoded by having a "HEAD" pointer that points to the current state point. Undoing things simply means moving the HEAD pointer to point back to some previous state point, while redoing things means moving the HEAD pointer forward. Thus instead of having stacks to keep track of the states, the tree data structure holds onto the state, and an external pointer navigates the tree. The append-only log can be encoded because every state is related to a prior state through a predecessor pointer or a successive state through a successor pointer, it is therefore possible to perform "forward moving appended" undos and redos.
What we see here is 2 forms of linear and non-linear undo/redo. The first is like 2 stack method. While the second is like append only log. Using the stack form, undoing and redoing is non-mutational to the existing history, however applying new changes after undoing/redoing is mutational and destructive, in that undone history states can be lost. Using the append form, undoing and redoing is mutational to the existing history, however applying new changes after undoing/redoing is mutational and non-destructive, in that undone history states are never lost. It is up to the programmer and/or the user to decide which is more appropriate in their situation. It turns that the append form is more suitable for distributed state machine replication, because destructive history changes is impossible to replicate.
However the tree has also another unique and important feature. It has the ability to diverge its history and optionally converge them. This means you can move the HEAD pointer backwards to some state point, apply new changes from that point onwards, which will then be recorded in its own branch that doesn't affect your existing history. This allows the 2 stack form of undo/redo to be non-destructive. This is very powerful, as it allows experimental state changes that is not destructive to existing state changes. The fact that each branch is independent allows it to be used in distributed version control systems, where having a distributed fork is just another branch of the history tree.
As actions are performed from A -> B -> C -> D, they are appended to the history tree branch.
This branch is called "master".

           D C ----+
                   |
                   |
master:  A -> B  <-+
              ^
              |
              HEAD

  History Tree

After action D, we want to perform a linear undo of D. 
All it does is just move the HEAD pointer back to C. 
D is still available.

master:  A -> B -> C -> D
                   ^
                   |
                   HEAD

  History Tree

At this point we can apply a new change E. 
This creates a new branch called "feature" from C.
HEAD pointer now points to E.

master:  A -> B -> C -> D
                   |
          feature: +--> E
                        ^
                        |
                        HEAD

  History Tree

We may decide that our experiment is not successful, so we want to go back to D.

                        HEAD
                        |
                        v
master:  A -> B -> C -> D
                   |
          feature: +--> E

  History Tree

At this point we can perform a non-linear undo of A in the append style.

                             HEAD
                             |
                             v
master:  A -> B -> C -> D -> A'
                   |
          feature: +--> E

  History Tree

Now we want to linear undo of A' and D, and destructively remove the states.
This destructiveness isn't inherent to the history tree data structure, it's just an option.

                   HEAD
                   |
                   v
master:  A -> B -> C
                   |
          feature: +--> E

  History Tree

The append-only log and the history tree both allow easy distributed state machine replication as long as you disallow destructive history changes to the history tree. However if you don't care about replication, then destructive history changes don't matter.
Previously I mentioned that history trees primarily allow diverging states, but can also optionally allow converging states. To allow convergence would mean that the history tree is no longer a tree, but a directed acyclic graph. Basically branches of the trees can be merged into each other.
Generally this means merging one branch into another branch, not merging 2 branches into 1 completely new branch. Here, undoing D just means going back to B which would mean the merge never happened.
A --> B +-> D
 \     /
  +-> C

To actually merge 2 branches into 1 completely new branch, we would need to define the semantics of having a completely independent branch that has its own independent origin. Basically what should the current state be if you undid E? Since it has no single origin, would be it's own origin, which should mean that E cannot be undone. This is if you consider E to be a starting point for a completely independent branch.
0 -> A -> B
|          \
|           +-> E -> F
|          /
+--> C -> D

Alternatively, and more common, you would merge the 2 branches into third branch that incorporates the history of both branches collapsed into a linear order. How the order is specified is up to the user.
Merging is a very complex topic, it is a lot of nuances and still under research. The big problem is merge conflicts, I have written another article on what Git perceives to be a merge conflict and what isn't a merge conflict.
Metastate, State and Protostate

It turns out history management can be composed. By "composed", I mean in terms of object composition, not function composition. What do I mean by this?
Well consider that managing history means managing different versions of state. However what manages the manager? The data structure tracking history, is itself is a form of state, does this state also get versioned? We could say that this state representing history is "metastate" from the perspective of the "object" state. Here "object" comes from the terminology of https://en.wikipedia.org/wiki/Object_language
For example when using Git, there's a hidden folder called .git that keeps track of internal Git data structures. This metastate is not managed by Git, and this is evidenced by the fact that there is no way to undo the metastate changing operations such as git reset --hard or git rebase -i.
To version the metastate itself, you need a higher level history management system, that considers its object state to include the metastate that we want to track. This could be achieved if you created a filesystem running ZFS and used ZFS snapshots to snapshot the entire filesystem, which could include any Git repositories and specifically their corresponding .git directories.
If we can rise up the abstraction level for history management, can we also drill down? Consider the idea of using Vim to edit a text file located in a Git repository which is stored on a ZFS filesystem. How do we consider the history management that exists in Vim in relation to the history management in Git and subsequently ZFS?
From the perspective of Git, the state that exists in Vim that hasn't yet been committed to Git should be called protostate.
Let's make this clearer. The prefixes "meta-" and "proto-" and "object-" are all relative to the situation. From the perspective of Git, the protostate is just the state that hasn't yet "settled" to be persisted. While its metastate is the state that itself cannot track because that would lead to infinite recursion of history managing history.
This is all just terminology, but I think this sets up a good way of having precise discussion about the nature of persistence and history management in computing technology.
Delta vs Diff

In many places you'll see "deltas" and "diffs" being used interchangeably to refer to the same thing. In my opinion "deltas" are the actual patch files that can be applied to a version 1 to produce version 2, while "diffs" are human friendly interpreted form of the delta patch.
Diffs are not necessarily patches. To give some examples, consider a smarter table differ that uses colour and geograph layout to highlight differences between 2 CSV tables. Or an image differ that produces a visually consumable image difference picture. Even things like pdf2text or doc2text can be used to produce a imperfect diff, but not a delta.
For examples take a look at these tools that produce a diff, but not necessarily a delta.

https://help.github.com/articles/rendering-and-diffing-images/
https://github.com/smootoo/TableDiff

Practical Applications

Here I give an overview on applications and processes that make use of history management and related topics.
Compression

Delta encoding can be used for compression algorithms. It is often combined with Huffman or Run Length encoding or Burrows-Wheeler transform.
The basic idea is that assume you have an ordered set of datums. Instead of saving each datum, we want to create a lossless version of the data but with less information and less redundancy. Then we take the delta between each datum, and produce a new ordered set of deltas. If this ordered set of deltas has longer sequences of similar values, it becomes easier to compress.
Imagine our ordered set of datums is a set of characters, where their ASCII codepoint was: { 2, 4, 6, 8, 10, 13, 11 }. Applying delta encoding to their ASCII codepoint, we acquire { 2, 2, 2, 2, 2, 3, -2 }. Applying Run Length encoding, gives us 52131-2 meaning 5 of 2, 1 of 3, 1 of -2. I wrote a demonstration program that performs this delta encoding here: https://gist.github.com/CMCDragonkai/fa5f7eb9746b6eaf416047c9838fc870
The term "delta compression" means delta encoding then compression.
Git

Git is a distributed version control system that operates on a directory of files. It is a history tree that supports the features of the history log. It also supports diverging and converging branches.
Git is interesting because it uses the snapshot method for encoding states where the snapshots are "lazy", and only after a git gc does it change to using the delta method.
At an abstract level, when you create new commit, an entire snapshot of the directory is saved. Files that haven't changed are represented as just pointers (this is what I mean by a lazy snapshot). But files that are changed are saved completely. The usage of lazy snapshots makes these snapshots seemingly a hybrid of true snapshots and true deltas.
This is all because of performance. It is easier to perform random access to the history of state if the states are encoded as snapshots because you don't need compose the deltas before you get something useful. However it's a trade off, as more space will be used to store each snapshot. This space usage isn't a problem when the files you're working with is tiny programming source files, and saving each changed file completely won't really use much space. This does however become a problem when you starting having big binary files that are being changed per commit.
Delta encoding is used in Git when you run git gc. Git will try to compute the most space-saving deltas among all unpacked commits, and then compress these deltas into a single pack file. Git applies delta compression to both binaries and text files. The main idea is that every snapshot is now converted into a delta, so the space time trade off goes the opposite direction now, your Git repository will use less space, but if you ever want to work with the versions that was recorded in a pack file, it will take much longer for basic Git operations, as Git now needs to uncompress the pack file and compose the deltas.
Even with this delta compression applied at git gc, binary files are still problematic. This isn't because of some fundamental difference between text data and binary data, because after all, text data is binary data. This is mainly due to the way programs create binary blobs. Historically, developers have not expected people to need to version binary blobs. This has resulted in a situation where binary files are just not designed to be incremental in their changes. At the same time, compression is often applied immediately to the binary file which happens to PDF, JPEG, MPEG, DOC, DOCX and more. All of these factors mean that a tiny change to a binary file from the perspective of the user (like for example drawing a line on an image), can be a massive change in the binary bits that represent the file. This basically makes creating space saving deltas at the bit level an almost impossible problem. Now it is possible for vendors to supply protocol specific delta encoding programs that work on the binary files at a higher level. Basically if you understand the file protocol intimately, you can delta it in a much more efficient manner. Consider the JPEG example with a line drawn on top, the delta could simply be co-ordinates of where a line should be drawn superimposed on the original image.
My recommendation for using binary files with Git-like version control systems is this: If the information to be recorded is in fact text, just use a plain text file format. If you need extra fancy mark up, use markdown or latex, or even just HTML (There's a really great tool called pandoc). If you have to use a binary file, try to look for an option to save it as an uncompressed binary file, Git's delta compression will take care of the size issue later.
By the way SAAS companies, please allow an option to send invoices as just plain HTML instead of PDFs!
For more information on how Git performs delta compression and git gc, see: https://gist.github.com/matthewmccullough/2695758

If you have a development history where you constantly change between several particular versions of, say, a large binary blob — say a resource file of some kind — this operation can be very cheap under Git since it can delta against versions that are not adjacent in the development history.
The delta derivations don’t have to obey causality: a commit made chronologically later can be used to derive one made earlier. It’s just a bunch of blobs in a graph, there isn’t even a strictly necessary notion of time attached to each blob at all to begin with! That data is maintained at a higher level. Repack doesn’t have to know or care about when a commit was made. (The only reason it may care is to help implement heuristics. Right now no such heuristic exists[0])
Finding/verifying an optimal (space-minimizing) delta-derivation graph feels NP-hard. I now wave my hands furiously.
https://metalinguist.wordpress.com/2007/12/06/the-woes-of-git-gc-aggressive-and-how-git-deltas-work/


There are several ways of storing files in the pack file as well. For example, consider a file with the contents 'A' and a subsequent version 'AB'. This can be stored as an A and then +B, or it can be stored as AB and -B. The latter is more performant, since the set of changes involves a deletion of an existing range, so only needs to store the start and length of the deletion occurring. As files grow over time, it also means that the largest file is often stored as-is (typically, the most recent) which gives faster access for that than for previous versions.
http://alblue.bandlem.com/2011/09/git-tip-of-week-packfiles-redux.html

Sometimes we just cannot escape having large binary files. Game development often involves large binary assets. This is when choose to forgo automatic version control using version control systems, but instead mannually manage N rolling versions of a large binary. It's kind of like the difference between automatic memory management and manual memory management.
The topic of merging is also of interest here. I wrote a gist before on what Git considers to be a merge conflict and what is a compatible merge: https://gist.github.com/CMCDragonkai/b11dcd8ec0bf07459d197fd671738a5e It turns that the answer to what Git considers a compatible merge, is dependent on the historical decisions that led Git to be biased towards managing syntactically sequential (imperative) programming language source code.
SVN

Unlike Git, SVN performs delta encoding at every commit.
Vim

Vim's history management is one of the more sophisticated ones among text editors. It represents its history as a tree. This is best understood with the undotree plugin.
Vim has a few options that enable an almost orthogonal persistence. That is, it is possible to save the metastate on the filesystem allowing one to reload the file and its associated undo history and other things like cursor position.
These are the common metastate files to be used in Vim where file is your file name:

.file.un~ - Undo history
.file.swp - Swap file
file~ - Backup file

Event Sourcing

There are databases that are designed to store events. If we imagine changes as events, then these can be considered history tracking databases.

http://martinfowler.com/eaaDev/EventSourcing.html
https://geteventstore.com/
http://kafka.apache.org/documentation.html#uses_eventsourcing

Patch Theory in Darcs and Pijul

This is fascinating:

https://byorgey.wordpress.com/2008/02/13/patch-theory-part-ii-some-basics/
http://liamoc.net/posts/2015-11-10-patch-theory.html

Practical Implementation in Haskell using Monads

We shall call this the History Monad.
WIP.
See Also


https://en.wikipedia.org/wiki/Persistent_data_structure
https://en.wikipedia.org/wiki/Retroactive_data_structures

Try playing with them in Haskell or Clojure!