Skip to content

Instantly share code, notes, and snippets.

@brannerchinese
Last active August 29, 2015 14:02
Show Gist options
  • Save brannerchinese/dbdd5d3d88541ca6c0f4 to your computer and use it in GitHub Desktop.
Save brannerchinese/dbdd5d3d88541ca6c0f4 to your computer and use it in GitHub Desktop.

Make a Differ-Patcher

General information about the Iron Forger project at Hacker School.

Assignment: Build a tool to compare two text files and generate a diff — a file that concisely specifies the differences between the two other files.

What a diff file looks like

A "diff" file has the following structure:

  1. A header of two lines, giving the names of the two original text files and indicating which is considered the "original" file from which the other is being derived. The original file is called the "from-file" and is marked with minus signs; the "to-file" is marked with plus-signs:

    --- from-file
    +++ to-file
    
  2. A series of hunks of differences between the files. Each hunk has two parts:

  3. It begins with a line beginning and ending with a pair of at-signs (@@), between which are listed the numbers of the lines in each file described by the hunk and the number of consecutive changes. Lines are numbered from 1, not 0.

  4. After that there is a series of whole lines, each of which prefaced with a plus or a minus to indicate whether it is being added to the to-file or removed from it. Here is an example of a hunk:

@@ -3 +3 @@ -some text +other words ```

 This hunk says: "The single line at line 3 of the from-file is being removed and a single new line is being added to the to-file." The line being removed is "some text" and the line being added is "other words".

More detail about format

In this hunk

@@ -2,0 +3 @@
+some words

the expression -2,0 +3 means that line 2 in the from-file corresponds to line 3 in the to-file, and nothing is removed from the from-file at the position specified, but one line is added in the to-file at the position specified. The default change is one line — to be removed, if specified for the from-file; added, if specified for the to-file.

If the change is of something other than the default, it is specified after the number, with a comma before it. so -2,0 above means "We are dealing with position line 2 in the from-file, but there are zero changes".

But a hunk like this:

@@ -24,3 +25 @@
-Extra line to be removed.
-Extra line to be removed.
-Extra line to be removed.
+The flavor of limited-release Japanese soda Pepsi Baobab was described as "liberating" by PepsiCo. (https://en.wikipedia.org/wiki/Adansonia#Food_uses, accessed 20140608)

means "At line 24 of the from-file, remove three lines so that they do not appear at line 25 of the to-file; instead, at line 25 of the to-file, add the default of one line." And then the three lines to be removed are specified, as is the one line to be added.

More detail: context lines

Hunks can also include lines of context, for the benefit of human readers. Context lines always begin with a space. They are found in both the from-file and the to-file, and strictly speaking they may be redundant to analysing the diff. Here is a hunk containing two context lines — one before and one after the changed lines:

@@ -1,4 +1,3 @@
 line one
-line two
-line three
+第二、三行合併
 line four

This hunk says: "Beginning at line 1 of from-file we are displaying four affected lines, and beginning at line 1 of to-file we are displaying three affected lines." But note that not all the lines displayed actually have changes.

And note, too, that if we were not displaying context-lines, our "hunk-header" (@@-bounded line) would read differently, since it would be describing a different list of explicit lines:

@@ -2,2 +2,1 @@
-line two
-line three
+第二、三行合併

Here -2,2 +2,1 refers to line 2 in each of the files — why line 2? Because (we happen to know from the preceding example) that line 1 says line one.

Other ideas

  1. If diff functionality is already available in your programming language of choice (look for "unified diff" or words to that effect), build another tool to "patch" a text file using a diff — to generate the other of the two files that was used in producing the original diff.
  2. First try generating only the to-file, given a from-file and a diff.
  3. If that goes smoothly, next try generating the from-file, given a to-file and a diff.
  4. And if that goes smoothly, too, try writing a single patch tool that generates either the from-file or the to-file, given the other of the two and the diff.
Note that only one of the sets of explicitly listed lines are actually needed in the patching process, depending on which direction you are going.
  1. Add diff functionality to your version control project from elsewhere in the Iron Forger challenge.

Resources

  1. The GNU man page for diffutils.
  2. The git-diff man page.
  3. The Python documentation for difflib.
  4. Google diff-match-patch, available in a number of languages.
  5. The C++ diff templating library dtl-cpp; tutorial also available.

[end]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment