ndarville/diff.mdown

## diff.mdown

      
    Raw
  

              diff.mdown
            
          
    Isolating Differences Between Files

Advantage over Other Algorithms

The diff output is more specific:

[I]f a whole block of text is moved, then all of it, rather than just the beginning and end, is detected as changed.


The algorithm described here avoids these difficulties. It detects differences that correspond very closely to our intuitive notion of difference.

Compare to Python's difflib. We get an in-line diff instead of a block-level diff. Instead of this:
-I did not have sexual relations with that woman.
+I may have had sexual relations with that woman.
... you get this:
I did not have may have had sexual relations with that woman.
Running Time

Linear time and space for all cases.
The Algorithm


To compare two files, it is usually convenient to take a line as the basic unit, although other units are possible, such as word, sentence, paragraph, or character. We approach the problem by finding similarities until only differences remain. We make two observations:

Preliminary Observations


If a line occurs only once in each file, then it must be the same line, although it may have been moved.
We use this observation to locate unaltered lines that we subsequently exclude from further treatment.


If a line has been found to be unaltered, and the lines immediately adjacent to it in both files are identical, then these lines must be the same line. This information can be used to find blocks of unchanged lines.


Input, Output, and Data Structures


Files:

O, the old file
N, the new file


Symbol table:

table


Arrays:

OA
NA


The Symbol Table table

Each line works as the key in the table look-up, i.e. as table[line].
The symbol table stores three entries for each line:

Counters:

OC
NC


Line reference:

OLNO


In Python:
table = {
    line: {
        "OC":   ocurrences_of_line_in_o
        "NC":   ocurrences_of_line_in_n
        "OLNO": line_number_in_o
    }
}
Counters OC and NC

The value entry for each line in table has two counters. They specify the line's number of occurrences in O and N: OC and NC.
OC and NC can assume three values: 1, 2, and many.
The Line Reference OLNO

Aside from the two counters, the line's entry also includes a reference to the line's line number in O: OLNO.
OLNO is only interesting, if OC == 1. Alternatively, OLNO would have to assume multiple values or none at all.
The Arrays OA and NA

The arrays OA and NA have one entry for each line in their respective files, O and N. The arrays contain either:

a pointer to the line's symbol table entry, table[line]
the line's number in the other file (N for OA, O for NA)

... and some code to determine which. (See Third Pass part c.)
In other words, you would have for each array:


OA

one entry for each line of file O containing either

a pointer to table[line]
the line's number in file N


NA

one entry for each line of file N containing either

a pointer to table[line]
the line's number in file O


In Python:
OA = []
NA = []
The Passes

The algorithm has six passes:
First Pass


a. Each line i of file N is read in sequence
b. An entry for each line i is created in the table, if it doesn't already exist
c. NC for the line's table entry is incremented
d. NA[i] is set to point to the table entry of line i


a. each line i of file N is read in sequence

i = 0
for line in N:

b. An entry for each line i is created in the table, if it doesn't already exist

if not line in table:
    table[line] = {}
#   (...)
# else: 
#   (...)

c. NC for the line's table entry is incremented

if not "NC" in table[line]:
    table[line]["NC"] = 1
else:
    if table[line]["NC"] == 1:
        table[line]["NC"] = 2  # Why not just have 1 and "many"?
    else:
        table[line]["NC"] = "many"

d. NA[i] is set to point to the table entry of line i

NA[i] = table[line]
i += 1
All steps combined:
i = 0

for line in N:
    if not line in table:
        table[line] = {}
        table[line]["NC"] = 1
    else:
        if table[line]["NC"] == 1:
            table[line]["NC"] = 2
        else:
            table[line]["NC"] = "many"
    NA[i] = table[line]
    i += 1
Second Pass

Similar to first pass, except it acts on files

O
OA
OC
OLNO which is set to the line's number


O

j = 0
for line in O:

OA

OA[j] = table[line]
j += 1

OC

if not "OC" in table[line]:
    table[line]["OC"] = 1
else:
    if table[line]["OC"] == 1:
        table[line]["OC"] = 2
    else:
        table[line]["OC"] = "many"

OLNO

# for line in O:
table[line]["OLNO"] = j
All steps combined:
j = 0

for line in O:
    if not line in table:
        table[line] = {}
        table[line]["OC"] = 1
    else:
        if not "OC" in table[line]:
            table[line]["OC"] = 1
        elif table[line]["OC"] == 1:
            table[line]["OC"] = 2
        else:
            table[line]["OC"] = "many"
    table[line]["OLNO"] = j  # Gets overwritten with multiple occurrences.
    # Check to see if this is the intended implementation.
    # Maybe only relevant for "OC" == "NC" == 1
    OA[j] = table[line]
    j += 1
Third Pass


a. We use Observation 1:

If a line occurs only once in each file, then it must be the same line, although it may have been moved.
We use this observation to locate unaltered lines that we subsequently exclude from further treatment.


b. Using this, we only process the lines where OC == NC == 1.


c. As the lines between O and N "must be the same line, although it may have been moved", we alter the table pointers in OA and NA to the number of the line in the other file.


d. We also locate unique virtual lines


immediately before the first and
immediately after the last

lines of the files (???)


b. Using this, we only process the lines where OC == NC == 1.

if "OC" in NA[i] and "NC" in NA[i]:
    if NA[i]["OC"] == NA[i]["NC"] == 1:

c. As the lines between O and N "must be the same line, although it may have been moved", we alter the table pointers in OA and NA to the number of the line in the other file.

"""cf. the paper's ex. description of Pass 3."""
olno = NA[i]["OLNO"]

NA[i]    = olno
OA[olno] = i

d. "Find" unique virtual lines immediately before the first and immediately after the last lines of the files (???)

All steps combined:
i = 0

for i in NA:
    if "OC" in NA[i] and "NC" in NA[i]:
        if NA[i]["OC"] == NA[i]["NC"] == 1:
            olno = NA[i]["OLNO"]
            NA[i], OA[olno] = olno, i
    i += 1
Fourth Pass


a. We use Observation 2:

If a line has been found to be unaltered, and the lines immediately adjacent to it in both files are identical, then these lines must be the same line. This information can be used to find blocks of unchanged lines.


b. Using this, we process each entry in ascending order.


c. If

NA[i] points to OA[j], and
NA[i+1] and OA[j+1] contain identical table entry pointers

then

OA[j+1] is set to line i+1, and
NA[i+1] is set to line j+1


b. Using this, we process each entry in ascending order.

# ascending
for i, j in NA, OA:

c. If (...) then (...)

if NA[i] == OA[j] and NA[i+1] == OA[j+1]:
    OA[j+1] = table[O[i+1]]
    NA[i+1] = table[N[j+1]]
Fifth Pass

Similar to fourth pass, except:

It processes each entry in descending order
It uses j-1 and i-1 instead of j+1 and i+1


It processes each entry in descending order

# descending
for i, j in NA, OA:

It uses j-1 and i-1 instead of j+1 and i+1

if NA[i] == OA[j] and NA[i-1] == OA[j-1]:
    OA[j-1] = table[O[i-1]]
    NA[i-1] = table[N[j-1]]
Final Pass

At this point following our five passes, we have the necessary information contained in NA to tell the differences between O and N.
This pass uses NA and OA to tell when a line has changed between O and N, and how far the change extends.
Determining a New Line

Recall our initial description of NA in which we said that the array has either:

one entry for each line of file N containing either

a pointer to table[line]
the line's number in file O


Using these two cases, we know that if NA[i] refers to an entry in table (case 1), then line i must be new.
We know this, because otherwise, NA[i] would have contained the line's number in O (case 2), if it existed in O and N.
Determining the Boundaries of the New Line

We now know that we are dealing with a new line, but we have yet to figure where the change ends.
Recall Observation 2:

If a line has been found to be unaltered, and the lines immediately adjacent to it in both files are identical, then these lines must be the same line. This information can be used to find blocks of unchanged lines.

If NA[i] points to OA[j], but NA[i+1] does not point to OA[j+1], then line i is the boundary for the alteration.
You can look at it this way:
i  : The quick brown fox      | j  : The quick brown fox
i+1: jumps over the lazy dog  | j+1: jumps over the loafing cat

Here, NA[i] == OA[j], but NA[i+1] != OA[j+1]. This means our boundary is between the two lines.
Output Diff

At this point, we know all we need to know and have all the relevant logic in place to output the diff(erence) between O and N.