Skip to content

Instantly share code, notes, and snippets.

@CMCDragonkai
Last active September 4, 2023 21:47
Show Gist options
  • Star 3 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
  • Save CMCDragonkai/b11dcd8ec0bf07459d197fd671738a5e to your computer and use it in GitHub Desktop.
Save CMCDragonkai/b11dcd8ec0bf07459d197fd671738a5e to your computer and use it in GitHub Desktop.
Version Control Diff, Patch and Merge Analysis (mostly Git)

Version Control Diff, Patch and Merge Analysis (mostly Git)

Inspired by this: https://github.com/mndrix/merge-this

Intrafile-Intraline Change

Where branch A and branch B changes 2 different contiguous sets of characters on the same line, but the changes are non-overlapping. Both changes can change the line length arbitrarily.

Use Case: merging field changes in a CSV file.

Parent:

aaa,bbb,ccc

A:

@@ -1 +1 @@
-aaa,bbb,ccc
+aa,bbb,ccc

B:

@@ -1 +1 @@
-aaa,bbb,ccc
+aaa,bbb,cc

Git Demonstration:

> mkdir --parents /tmp/merge-test
> cd /tmp/merge-test
> git init
Initialized empty Git repository in /tmp/merge-test/.git/
> git commit --allow-empty --message='begin merge tests'
[master (root-commit) ad40ad0] begin merge tests
> echo 'aaa,bbb,ccc' >./intrafile-intraline-parent
> git add ./intrafile-intraline-parent
> git commit --message='added intrafile-intraline-parent'
[master 7d0bbec] added intrafile-intraline-parent
 1 file changed, 1 insertion(+)
 create mode 100644 intrafile-intraline-parent
> git branch A
> git branch B
> git checkout A
Switched to branch 'A'
> delta="$(diff --unified ./intrafile-intraline-parent <(echo 'aa,bbb,ccc'))"
> patch --unified ./intrafile-intraline-parent --input=- <<<"$delta"
patching file ./intrafile-intraline-parent
> git commit --all --message='intrafile-intraline-A-change'
[A 9cff2f6] intrafile-intraline-A-change
 1 file changed, 1 insertion(+), 1 deletion(-)
> git checkout B
Switched to branch 'B'
> delta="$(diff --unified ./intrafile-intraline-parent <(echo 'aaa,bbb,cc'))"
> patch --unified ./intrafile-intraline-parent --input=- <<<"$delta"
patching file ./intrafile-intraline-parent
> git commit --all --message='intrafile-intraline-B-change'
[B 250e1ef] intrafile-intraline-B-change
 1 file changed, 1 insertion(+), 1 deletion(-)
> git merge A
Auto-merging intrafile-intraline-parent
CONFLICT (content): Merge conflict in intrafile-intraline-parent
Automatic merge failed; fix conflicts and then commit the result.
> git --no-pager diff
diff --cc intrafile-intraline-parent
index d46c126,14c04f8..0000000
--- a/intrafile-intraline-parent
+++ b/intrafile-intraline-parent
@@@ -1,1 -1,1 +1,5 @@@
++<<<<<<< HEAD
 +aaa,bbb,cc
++=======
+ aa,bbb,ccc
++>>>>>>> A
> echo "aa,bbb,cc" > ./intrafile-intraline-parent
> git commit --all --message='merged A + B'
[B b60fe93] merged A + B
> git --no-pager log --graph --decorate --pretty=oneline --abbrev-commit
*   b60fe93 (HEAD -> B) merged A + B
|\
| * 9cff2f6 (A) intrafile-intraline-A-change
* | 250e1ef intrafile-intraline-B-change
|/
* 7d0bbec (master) added intrafile-intraline-parent
* ad40ad0 begin merge tests
> git checkout master
Switched to branch 'master'
> git branch --delete --force A
Deleted branch A (was 9cff2f6).
> git branch --delete --force B
Deleted branch B (was b60fe93).
> # intrafile-intraline is not auto-compatible according to git

Intrafile-Intraline Change (Byte Replacement)

Where branch A replaces byte X -> Y at position n, and branch B replaces byte J -> K at position m. Where X /= Y, J /= K and n /= m.

Use Case: Packet Editing

Parent:

ABC

A:

@@ -1 +1 @@
-ABC
+1BC

B:

@@ -1 +1 @@
-ABC
+AB2

Git Demonstration:

> echo "ABC" > ./intrafile-intraline-byte-replacement-parent
> git add ./intrafile-intraline-byte-replacement-parent
> git commit --message='added intrafile-intraline-byte-replacement-parent'
[master 3b8cda3] added intrafile-intraline-byte-replacement-parent
 1 file changed, 1 insertion(+)
 create mode 100644 intrafile-intraline-byte-replacement-parent
> git branch B
> git checkout -b A
Switched to a new branch 'A'
> echo "1BC" > ./intrafile-intraline-byte-replacement-parent
> git commit --all --message='intrafile-intraline-byte-replacement-A-change'
[A 80d7a3a] intrafile-intraline-byte-replacement-A-change
 1 file changed, 1 insertion(+), 1 deletion(-)
> git checkout B
Switched to branch 'B'
> echo "AB2" > ./intrafile-intraline-byte-replacement-parent
> git commit --all --message='intrafile-intraline-byte-replacement-B-change'
[B 08e696c] intrafile-intraline-byte-replacement-B-change
 1 file changed, 1 insertion(+), 1 deletion(-)
> git merge A
Auto-merging intrafile-intraline-byte-replacement-parent
CONFLICT (content): Merge conflict in intrafile-intraline-byte-replacement-parent
Automatic merge failed; fix conflicts and then commit the result.
> git --no-pager diff
diff --cc intrafile-intraline-byte-replacement-parent
index bf271fa,f564147..0000000
--- a/intrafile-intraline-byte-replacement-parent
+++ b/intrafile-intraline-byte-replacement-parent
@@@ -1,1 -1,1 +1,5 @@@
++<<<<<<< HEAD
 +AB2
++=======
+ 1BC
++>>>>>>> A
> git merge --abort
> git checkout master
Switched to branch 'master'
> git branch --delete --force A
Deleted branch A (was 80d7a3a).
> git branch --delete --force B
Deleted branch B (was 08e696c).
> # intrafile-intraline (byte replacement) is not auto-compatible according to git

Intrafile-Interline Change

Where branch A and branch B changes 2 different lines in the same file. The overall number of lines in the file do not change. There are 2 variants here: where the changed lines are adjacent, and where they are not adjacent. The first variant will be demonstrated through an A + B merge. While the second will be the C + D merge.

Use Case: merging row changes in a TSV file, merging a minor comment change and a minor variable change.

Parent:

This is 0 line.
This is 1 line.
This is 2 line.
This is 3 line.

A:

@@ -1,4 +1,4 @@
 This is 0 line.
-This is 1 line.
+This is one line.
 This is 2 line.
 This is 3 line.

B:

@@ -1,4 +1,4 @@
 This is 0 line.
 This is 1 line.
-This is 2 line.
+This is two line.
 This is 3 line.

C:

@@ -1,4 +1,4 @@
-This is 0 line.
+This is zero line.
 This is 1 line.
 This is 2 line.
 This is 3 line.

D:

@@ -1,4 +1,4 @@
 This is 0 line.
 This is 1 line.
 This is 2 line.
-This is 3 line.
+This is three line.

Git Demonstration:

> echo -e "This is 0 line.\nThis is 1 line.\nThis is 2 line.\nThis is 3 line." >./intrafile-interline-parent   
> git add ./intrafile-interline-parent
> git commit --message='added intrafile-interline-parent'
[master b157117] added intrafile-interline-parent
 1 file changed, 4 insertions(+)
 create mode 100644 intrafile-interline-parent
> git branch D
> git branch C
> git branch B
> git checkout -b A
Switched to a new branch 'A'
> echo -e "This is 0 line.\nThis is one line.\nThis is 2 line.\nThis is 3 line." >./intrafile-interline-parent 
> git commit --all --message='intrafile-interline-A-change'
[A d15045a] intrafile-interline-A-change
 1 file changed, 1 insertion(+), 1 deletion(-)
> git checkout B
Switched to branch 'B'
> echo -e "This is 0 line.\nThis is 1 line.\nThis is two line.\nThis is 3 line." >./intrafile-interline-parent 
> git commit --all --message='intrafile-interline-B-change'
[B ba0458a] intrafile-interline-B-change
 1 file changed, 1 insertion(+), 1 deletion(-)
> git checkout C
Switched to branch 'C'
> echo -e "This is zero line.\nThis is 1 line.\nThis is 2 line.\nThis is 3 line." >./intrafile-interline-parent
> git commit --all --message='intrafile-interline-C-change'
[C 9958332] intrafile-interline-C-change
 1 file changed, 1 insertion(+), 1 deletion(-)
> git checkout D
Switched to branch 'D'
> echo -e "This is 0 line.\nThis is 1 line.\nThis is 2 line.\nThis is three line." >./intrafile-interline-parent
> git commit --all --message='intrafile-interline-D-change'
[D 95a4732] intrafile-interline-D-change
 1 file changed, 1 insertion(+), 1 deletion(-)
> git checkout B
Switched to branch 'B'
> git merge A
Auto-merging intrafile-interline-parent
CONFLICT (content): Merge conflict in intrafile-interline-parent
Automatic merge failed; fix conflicts and then commit the result.
> git --no-pager diff
diff --cc intrafile-interline-parent
index ff3769c,59c42a5..0000000
--- a/intrafile-interline-parent
+++ b/intrafile-interline-parent
@@@ -1,4 -1,4 +1,9 @@@
  This is 0 line.
++<<<<<<< HEAD
 +This is 1 line.
 +This is two line.
++=======
+ This is one line.
+ This is 2 line.
++>>>>>>> A
  This is 3 line.
> git merge --abort
> # intrafile-interline (adjacent) is not auto-compatible according to git
> git checkout D
Switched to branch 'D'
> git merge --no-commit C
Auto-merging intrafile-interline-parent
Automatic merge went well; stopped before committing as requested
> git --no-pager diff --cached
diff --git a/intrafile-interline-parent b/intrafile-interline-parent
index b54c831..fa44955 100644
--- a/intrafile-interline-parent
+++ b/intrafile-interline-parent
@@ -1,4 +1,4 @@
-This is 0 line.
+This is zero line.
 This is 1 line.
 This is 2 line.
 This is three line.
> git commit --all --message='Merge branch 'C' into D'
[D b67f611] Merge branch C into D
> git checkout master
Switched to branch 'master'
> git branch --delete --force A
Deleted branch A (was d15045a).
> git branch --delete --force B
Deleted branch B (was ba0458a).
> git branch --delete --force C
Deleted branch C (was 9958332).
> git branch --delete --force D
Deleted branch D (was b67f611).
> # intrafile-interline (non-adjacent) is auto-compatible according to git

Interfile Changes

For the major VCS, all allow automatic interfile merging.

Collection of File Changes

For the major VCS, there doesn't seem to be a way to atomically group a set of files so that different changes to any part of the set of files is non-compatible for automatic merging.

Use Case: Grouping a set of files that are syntactically or semantically tightly coupled to each other's content. A delta to any file part of the group has context assumptions on not just the file it is being applied to, but all the files in the group.

Reversion Merges

How does a reversion work? And are there commits in your history that you cannot revert?

> touch ./reverting
> git add ./reverting
> git commit --message='begin reverting'
[master 7154f6f] begin reverting
 1 file changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 reverting
> echo "aaa" > ./reverting
> git commit --all --message='change the reverting'
[master 8f233e4] change the reverting
 1 file changed, 1 insertion(+)
> rm ./reverting
> git add ./reverting
> git commit --message='delete the reverting'
[master 9839a1a] delete the reverting
 1 file changed, 1 deletion(-)
 delete mode 100644 reverting
> git revert 8f233e4 # try to revert the change 
error: could not revert 8f233e4... change the reverting
hint: after resolving the conflicts, mark the corrected paths
hint: with 'git add <paths>' or 'git rm <paths>'
hint: and commit the result with 'git commit'
> git status
On branch master
You are currently reverting commit 8f233e4.
  (fix conflicts and run "git revert --continue")
  (use "git revert --abort" to cancel the revert operation)

Unmerged paths:
  (use "git reset HEAD <file>..." to unstage)
  (use "git add/rm <file>..." as appropriate to mark resolution)

        deleted by us:   reverting

no changes added to commit (use "git add" and/or "git commit -a")
> ls ./reverting
reverting
> cat ./reverting
> git revert --abort
> git revert 7154f6f # try to revert the creation of an empty file
On branch master
nothing to commit, working directory clean
> git revert 9839a1a
[master 1f29c4e] Revert "delete the reverting"
 1 file changed, 1 insertion(+)
 create mode 100644 reverting
> cat ./reverting
aaa
> git reset --hard 9839a1a
HEAD is now at 9839a1a delete the reverting
> echo "aaa" > ./reverting
> git add ./reverting
> git commit --message='recreate the reverting'
[master ec8c8e8] recreate the reverting
 1 file changed, 1 insertion(+)
 create mode 100644 reverting
> rm ./reverting
> git add ./reverting
> git commit --message='delete the reverting again'
[master 7da7f25] delete the reverting again
 1 file changed, 1 deletion(-)
 delete mode 100644 reverting
> git revert 9839a1a
[master 9efa09d] Revert "delete the reverting"
 1 file changed, 1 insertion(+)
 create mode 100644 reverting
> cat ./reverting
aaa

The key to understanding this issue is with Git's 3 way merging. The way it reverts, is that it takes the opposite of the delta you're trying to revert, then it essentially tries to merge that flipped delta into the current snapshot.

In our first try, we tried to revert a content change addition to ./reverting. Which basically means subtracting the same content from ./reverting. (It's possible to do this manually without Git using patch --reverse, which applies the opposite of a delta to an input file).

Since we have deleted the very file that the reversion delta is applying to, it's equivalent to a merge conflict. It's not able to subtract content from a file that no longer exists.

The conflicted result is given to us to resolve, and the conflicted result is just an empty ./reverted file. We can run git revert --continue after committing, or abort via git revert --abort. I don't exactly understand why this is the conflicted result. In other merge conflicts, we usually see markers for the HEAD and the branch you're trying to merge, and markers for the 2 alternate versions of content. Here we just we get an empty file. What's the logic here?

In our second try, the reversion is of the creation of the empty ./reverted file, which is just the deletion of an empty file (although I don't think the emptiness assumption is captured in the reversion). Merging the flipped delta to the snapshot where the file was deleted is automatically compatible. However no reversion commit is made, because Git considers such a commit to be redundant.

In our third try, we are trying to merge the reversion of the deleting ./reverted which contains "aaa", which means merging in the creation of a file containing "aaa".

The fourth try demonstrates the order or commit hash of patches we're reverting doesn't matter, only the content of the patch.

Basically we see that reversions represent apply an existing delta in the opposite way. Thinking of it in terms of a merge is just like creating a virtual temporary branch with the reverted commit, and merging that into your current branch. A conflict only occurs if certain assumptions of the reversed delta no longer exist in the state of your current branch. Therefore it is possible for there to be commits in your history that you cannot automatically revert.

Other

You get the idea now. There are other merging situations listed below:

Their results are listed here: https://github.com/mndrix/merge-this#current-results

Other than Change Content and Change File Path, the above listed situations requires syntax awareness. Based on the results, it seems that some VCS stick to their principles, while others implement partial syntax awareness for certain languages. This is evidenced by Git's ability to merge "Change Line in Block and Indent the Block" for C, but not for Python.

Discussion

What exactly is the common principle among the major software VCS (Git, Darcs, Mercurial, and Bazaar)? It is that these VCS have been designed for software developers to manager their source code. So their design has been biased towards sequential programming language source code stored as plain text.

This means their content tracking is designed around line changes to files. That is the atomic unit of content is most often considered to be a single file. Interfile changes are always auto-mergeable, while intrafile changes may or may not depending on heuristics. The most common heuristic is that:

  1. overlapping changes are not auto-mergeable
  2. adjacent changes (with the exception of Darcs) are not auto-mergeable
  3. non-adjacent changes are auto-mergeable

Darcs considers both intrafile-interline (adjacent) and intrafile-interline (non-adjacent) to be compatible for auto-merging. This is because its unit of content (called a "hunk") is defined to be "a contiguous block of changes to a file". Hence changes to different lines within the same file can be merged without a problem. However there are situations where such merges can be problematic, especially if the files represent source code for a sequential programming language. Combining changes to 2 different lines or sets of non-overlapping lines doesn't always make sense syntactically. On the other hand for things like line-oriented datasets, merging these kind of changes should not cause any problems. And allowing non-adjacent intrafile-interline changes is often used in the NixPkgs project, where people contribute new expressions to different parts of the top-level configuration script.

So why are interfile changes always allowed? You may conceivably require the ability to group a collection of files as an atomic unit of content. This would make it that any delta to any file part of the group has context assumptions on both the file it is being applied to, and all other files in the group. I would say that such a situation is rare in programming source code. The most likelihood of it arising, is when one branch has changes to a file that determines how a module consumes an interface, while another branch changes the file backing the interface in non-compatible way. If both branches are auto-merged, there may not any syntactical problems, but there would be a semantic problem, either in compile-time error or runtime error. The solution to this would either be language awareness at the module level, or to allow the user to group interfaces and implementations that use those interfaces together. However interface changes are rare, and would often have far-reaching effects, so they would be managed by the users more carefully, so the VCS software haven't bothered to make this a feature yet.

There are lots of things that can be auto-mergeable, if the VCS has more language or application awareness. Consider things like function name and parameter change, comment addition and subtraction, brace changes between 2 distinct functions, reordering functions around or reordering attributes in an unordered data structure, field changes within datasets etc. Even high level things like images, PDFs, Word Docs and PSDs can be auto-mergeable if the VCS was able to parse out the application-specific semantic difference between different versions.

Git actually does have the ability for you to specify custom difference software (but only for visualisation purposes, not for the actual packing operation). So proprietary file formats should either make their file format easily "diffable" with existing tools, or supply an executable to perform semantic differences and patch semantic differences. It's important to have both the diff and patch tool. Having just the diff tool would give us fancy visualisation of the difference, but without the patch tool, it's impossible to use such visual differences as a form of delta-encoding.

Let's talk about some non-line programming language source code oriented diffing and patching. It isn't just funny binary proprietary formats that suffer in VCS, we have common files like CSVs, TSVs which require tabular diffing, while things like JSON and XML require tree based diffing.

Darcs has a the concept called patch theory, which is still under development, but gets around the edge case of Git inconsistency. But an important part of making patches more smarter and more capable is patch commutation. It's explained here: https://en.wikipedia.org/wiki/Merge_%28version_control%29#Patch_commutation I believe towards the future, more probabilistic and fuzzy methods will be used to acquire the most minimal patch while also understand how best to commute patches.

One of the possible things towards the future that will make merges and patches far more powerful for programming languages, is if programming languages evolve to being amenable to structure editing. Or become an homoiconic structural language like Lisp. here are the latest innovations:

If a language can support a structure editor, it isn't far for there to be a structural diff/patch for the language as well. Such diffs can make more sense, and support more complex diffs, and produce much more minimal diffs. But this is just the beginning, the next step is semantic awareness such as understanding the typesystem of the language. A type aware diff/patch would be able to understand if merging 2 things would create compilable code, or would it just result in a type check error. It could also create semantic diffs, which don't reproduce exactly the same target, but would instead produce a target that performs the same behaviour.

These ideas are important to API design. Consider the HTTP PATCH method. This very method was designed to allow partial delta based updates to existing state. If the resource under control is some sort of line-oriented text based structure, then accepting a unified diff format in your PATCH endpoint makes the most sense.

There's some cross over with Operational Transformation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment