The diff output is more specific:
[I]f a whole block of text is moved, then all of it, rather than just the beginning and end, is detected as changed.
The algorithm described here avoids these difficulties. It detects differences that correspond very closely to our intuitive notion of difference.
Compare to Python's difflib
. We get an in-line diff instead of a block-level diff. Instead of this:
-I did not have sexual relations with that woman.
+I may have had sexual relations with that woman.
... you get this:
I did not have may have had sexual relations with that woman.
Linear time and space for all cases.
To compare two files, it is usually convenient to take a line as the basic unit, although other units are possible, such as word, sentence, paragraph, or character. We approach the problem by finding similarities until only differences remain. We make two observations:
-
If a line occurs only once in each file, then it must be the same line, although it may have been moved.
We use this observation to locate unaltered lines that we subsequently exclude from further treatment.
-
If a line has been found to be unaltered, and the lines immediately adjacent to it in both files are identical, then these lines must be the same line. This information can be used to find blocks of unchanged lines.
- Files:
O
, the old fileN
, the new file
- Symbol table:
table
- Arrays:
OA
NA
Each line
works as the key in the table
look-up, i.e. as table[line]
.
The symbol table stores three entries for each line:
- Counters:
OC
NC
- Line reference:
OLNO
In Python:
table = {
line: {
"OC": ocurrences_of_line_in_o
"NC": ocurrences_of_line_in_n
"OLNO": line_number_in_o
}
}
The value entry for each line in table
has two counters. They specify the line's number of occurrences in O
and N
: OC
and NC
.
OC
and NC
can assume three values: 1, 2, and many.
Aside from the two counters, the line's entry also includes a reference to the line's line number in O
: OLNO
.
OLNO
is only interesting, if OC == 1
. Alternatively, OLNO
would have to assume multiple values or none at all.
The arrays OA
and NA
have one entry for each line in their respective files, O
and N
. The arrays contain either:
- a pointer to the line's symbol table entry,
table[line]
- the line's number in the other file (
N
forOA
,O
forNA
)
... and some code to determine which. (See Third Pass part c.)
In other words, you would have for each array:
-
OA
- one entry for each line of file
O
containing either- a pointer to
table[line]
- the line's number in file
N
- a pointer to
- one entry for each line of file
-
NA
- one entry for each line of file
N
containing either- a pointer to
table[line]
- the line's number in file
O
- a pointer to
- one entry for each line of file
In Python:
OA = []
NA = []
The algorithm has six passes:
- a. Each line
i
of fileN
is read in sequence - b. An entry for each line
i
is created in the table, if it doesn't already exist - c.
NC
for the line'stable
entry is incremented - d.
NA[i]
is set to point to thetable
entry of linei
a. each line
i
of fileN
is read in sequence
i = 0
for line in N:
b. An entry for each line
i
is created in the table, if it doesn't already exist
if not line in table:
table[line] = {}
# (...)
# else:
# (...)
c.
NC
for the line'stable
entry is incremented
if not "NC" in table[line]:
table[line]["NC"] = 1
else:
if table[line]["NC"] == 1:
table[line]["NC"] = 2 # Why not just have 1 and "many"?
else:
table[line]["NC"] = "many"
d.
NA[i]
is set to point to thetable
entry of linei
NA[i] = table[line]
i += 1
All steps combined:
i = 0
for line in N:
if not line in table:
table[line] = {}
table[line]["NC"] = 1
else:
if table[line]["NC"] == 1:
table[line]["NC"] = 2
else:
table[line]["NC"] = "many"
NA[i] = table[line]
i += 1
Similar to first pass, except it acts on files
O
OA
OC
OLNO
which is set to the line's number
O
j = 0
for line in O:
OA
OA[j] = table[line]
j += 1
OC
if not "OC" in table[line]:
table[line]["OC"] = 1
else:
if table[line]["OC"] == 1:
table[line]["OC"] = 2
else:
table[line]["OC"] = "many"
OLNO
# for line in O:
table[line]["OLNO"] = j
All steps combined:
j = 0
for line in O:
if not line in table:
table[line] = {}
table[line]["OC"] = 1
else:
if not "OC" in table[line]:
table[line]["OC"] = 1
elif table[line]["OC"] == 1:
table[line]["OC"] = 2
else:
table[line]["OC"] = "many"
table[line]["OLNO"] = j # Gets overwritten with multiple occurrences.
# Check to see if this is the intended implementation.
# Maybe only relevant for "OC" == "NC" == 1
OA[j] = table[line]
j += 1
-
a. We use Observation 1:
If a line occurs only once in each file, then it must be the same line, although it may have been moved.
We use this observation to locate unaltered lines that we subsequently exclude from further treatment.
-
b. Using this, we only process the lines where
OC == NC == 1
. -
c. As the lines between
O
andN
"must be the same line, although it may have been moved", we alter thetable
pointers inOA
andNA
to the number of the line in the other file. -
d. We also locate unique virtual lines
- immediately before the first and
- immediately after the last
lines of the files (???)
b. Using this, we only process the lines where
OC == NC == 1
.
if "OC" in NA[i] and "NC" in NA[i]:
if NA[i]["OC"] == NA[i]["NC"] == 1:
c. As the lines between
O
andN
"must be the same line, although it may have been moved", we alter thetable
pointers inOA
andNA
to the number of the line in the other file.
"""cf. the paper's ex. description of Pass 3."""
olno = NA[i]["OLNO"]
NA[i] = olno
OA[olno] = i
d. "Find" unique virtual lines immediately before the first and immediately after the last lines of the files (???)
All steps combined:
i = 0
for i in NA:
if "OC" in NA[i] and "NC" in NA[i]:
if NA[i]["OC"] == NA[i]["NC"] == 1:
olno = NA[i]["OLNO"]
NA[i], OA[olno] = olno, i
i += 1
-
a. We use Observation 2:
If a line has been found to be unaltered, and the lines immediately adjacent to it in both files are identical, then these lines must be the same line. This information can be used to find blocks of unchanged lines.
-
b. Using this, we process each entry in ascending order.
-
c. If
NA[i]
points toOA[j]
, andNA[i+1]
andOA[j+1]
contain identicaltable
entry pointers
then
OA[j+1]
is set to linei+1
, andNA[i+1]
is set to linej+1
b. Using this, we process each entry in ascending order.
# ascending
for i, j in NA, OA:
c. If (...) then (...)
if NA[i] == OA[j] and NA[i+1] == OA[j+1]:
OA[j+1] = table[O[i+1]]
NA[i+1] = table[N[j+1]]
Similar to fourth pass, except:
- It processes each entry in descending order
- It uses
j-1
andi-1
instead ofj+1
andi+1
It processes each entry in descending order
# descending
for i, j in NA, OA:
It uses
j-1
andi-1
instead ofj+1
andi+1
if NA[i] == OA[j] and NA[i-1] == OA[j-1]:
OA[j-1] = table[O[i-1]]
NA[i-1] = table[N[j-1]]
At this point following our five passes, we have the necessary information contained in NA
to tell the differences between O
and N
.
This pass uses NA
and OA
to tell when a line has changed between O
and N
, and how far the change extends.
Recall our initial description of NA
in which we said that the array has either:
one entry for each line of file
N
containing either
- a pointer to
table[line]
- the line's number in file
O
Using these two cases, we know that if NA[i]
refers to an entry in table
(case 1), then line i
must be new.
We know this, because otherwise, NA[i]
would have contained the line's number in O
(case 2), if it existed in O
and N
.
We now know that we are dealing with a new line, but we have yet to figure where the change ends.
Recall Observation 2:
If a line has been found to be unaltered, and the lines immediately adjacent to it in both files are identical, then these lines must be the same line. This information can be used to find blocks of unchanged lines.
If NA[i]
points to OA[j]
, but NA[i+1]
does not point to OA[j+1]
, then line i
is the boundary for the alteration.
You can look at it this way:
i : The quick brown fox | j : The quick brown fox
i+1: jumps over the lazy dog | j+1: jumps over the loafing cat
Here, NA[i] == OA[j]
, but NA[i+1] != OA[j+1]
. This means our boundary is between the two lines.
At this point, we know all we need to know and have all the relevant logic in place to output the diff(erence) between O
and N
.
Hi,
Thanks for putting this gist up.
I was wondering what your interpretation of Pass 3 of Heckel's diff algorithm is wrt. "virtual lines".
In his paper he talks about locating "virtual lines" before the first and after the last. Maybe he means putting the begin and end lines in there?
It looks like that can be ignored, but I was just curious if I was missing something?