Instantly share code, notes, and snippets.

# ndarville/diff.mdown Created Jul 23, 2012

Paul Heckel's Diff Algorithm

# Isolating Differences Between Files

The diff output is more specific:

[I]f a whole block of text is moved, then all of it, rather than just the beginning and end, is detected as changed.

The algorithm described here avoids these difficulties. It detects differences that correspond very closely to our intuitive notion of difference.

Compare to Python's `difflib`. We get an in-line diff instead of a block-level diff. Instead of this:

```-I did not have sexual relations with that woman.
+I may have had sexual relations with that woman.```

... you get this:

I did not have may have had sexual relations with that woman.

## Running Time

Linear time and space for all cases.

## The Algorithm

To compare two files, it is usually convenient to take a line as the basic unit, although other units are possible, such as word, sentence, paragraph, or character. We approach the problem by finding similarities until only differences remain. We make two observations:

### Preliminary Observations

1. If a line occurs only once in each file, then it must be the same line, although it may have been moved.

We use this observation to locate unaltered lines that we subsequently exclude from further treatment.

2. If a line has been found to be unaltered, and the lines immediately adjacent to it in both files are identical, then these lines must be the same line. This information can be used to find blocks of unchanged lines.

### Input, Output, and Data Structures

• Files:
1. `O`, the old file
2. `N`, the new file
• Symbol table:
• `table`
• Arrays:
1. `OA`
2. `NA`

#### The Symbol Table `table`

Each `line` works as the key in the `table` look-up, i.e. as `table[line]`.

The symbol table stores three entries for each line:

1. Counters:
1. `OC`
2. `NC`
2. Line reference:
• `OLNO`

In Python:

```table = {
line: {
"OC":   ocurrences_of_line_in_o
"NC":   ocurrences_of_line_in_n
"OLNO": line_number_in_o
}
}```
##### Counters `OC` and `NC`

The value entry for each line in `table` has two counters. They specify the line's number of occurrences in `O` and `N`: `OC` and `NC`.

`OC` and `NC` can assume three values: 1, 2, and many.

##### The Line Reference `OLNO`

Aside from the two counters, the line's entry also includes a reference to the line's line number in `O`: `OLNO`.

`OLNO` is only interesting, if `OC == 1`. Alternatively, `OLNO` would have to assume multiple values or none at all.

#### The Arrays `OA` and `NA`

The arrays `OA` and `NA` have one entry for each line in their respective files, `O` and `N`. The arrays contain either:

• a pointer to the line's symbol table entry, `table[line]`
• the line's number in the other file (`N` for `OA`, `O` for `NA`)

... and some code to determine which. (See Third Pass part c.)

In other words, you would have for each array:

• `OA`

• one entry for each line of file `O` containing either
• a pointer to `table[line]`
• the line's number in file `N`
• `NA`

• one entry for each line of file `N` containing either
• a pointer to `table[line]`
• the line's number in file `O`

In Python:

```OA = []
NA = []```

### The Passes

The algorithm has six passes:

#### First Pass

• a. Each line `i` of file `N` is read in sequence
• b. An entry for each line `i` is created in the table, if it doesn't already exist
• c. `NC` for the line's `table` entry is incremented
• d. `NA[i]` is set to point to the `table` entry of line `i`

a. each line `i` of file `N` is read in sequence

```i = 0
for line in N:```

b. An entry for each line `i` is created in the table, if it doesn't already exist

```if not line in table:
table[line] = {}
#   (...)
# else:
#   (...)```

c. `NC` for the line's `table` entry is incremented

```if not "NC" in table[line]:
table[line]["NC"] = 1
else:
if table[line]["NC"] == 1:
table[line]["NC"] = 2  # Why not just have 1 and "many"?
else:
table[line]["NC"] = "many"```

d. `NA[i]` is set to point to the `table` entry of line `i`

```NA[i] = table[line]
i += 1```

All steps combined:

```i = 0

for line in N:
if not line in table:
table[line] = {}
table[line]["NC"] = 1
else:
if table[line]["NC"] == 1:
table[line]["NC"] = 2
else:
table[line]["NC"] = "many"
NA[i] = table[line]
i += 1```

#### Second Pass

Similar to first pass, except it acts on files

• `O`
• `OA`
• `OC`
• `OLNO` which is set to the line's number

`O`

```j = 0
for line in O:```

`OA`

```OA[j] = table[line]
j += 1```

`OC`

```if not "OC" in table[line]:
table[line]["OC"] = 1
else:
if table[line]["OC"] == 1:
table[line]["OC"] = 2
else:
table[line]["OC"] = "many"```

`OLNO`

```# for line in O:
table[line]["OLNO"] = j```

All steps combined:

```j = 0

for line in O:
if not line in table:
table[line] = {}
table[line]["OC"] = 1
else:
if not "OC" in table[line]:
table[line]["OC"] = 1
elif table[line]["OC"] == 1:
table[line]["OC"] = 2
else:
table[line]["OC"] = "many"
table[line]["OLNO"] = j  # Gets overwritten with multiple occurrences.
# Check to see if this is the intended implementation.
# Maybe only relevant for "OC" == "NC" == 1
OA[j] = table[line]
j += 1```

#### Third Pass

• a. We use Observation 1:

If a line occurs only once in each file, then it must be the same line, although it may have been moved.

We use this observation to locate unaltered lines that we subsequently exclude from further treatment.

• b. Using this, we only process the lines where `OC == NC == 1`.

• c. As the lines between `O` and `N` "must be the same line, although it may have been moved", we alter the `table` pointers in `OA` and `NA` to the number of the line in the other file.

• d. We also locate unique virtual lines

• immediately before the first and
• immediately after the last

lines of the files (???)

b. Using this, we only process the lines where `OC == NC == 1`.

```if "OC" in NA[i] and "NC" in NA[i]:
if NA[i]["OC"] == NA[i]["NC"] == 1:```

c. As the lines between `O` and `N` "must be the same line, although it may have been moved", we alter the `table` pointers in `OA` and `NA` to the number of the line in the other file.

```"""cf. the paper's ex. description of Pass 3."""
olno = NA[i]["OLNO"]

NA[i]    = olno
OA[olno] = i```

d. "Find" unique virtual lines immediately before the first and immediately after the last lines of the files (???)

All steps combined:

```i = 0

for i in NA:
if "OC" in NA[i] and "NC" in NA[i]:
if NA[i]["OC"] == NA[i]["NC"] == 1:
olno = NA[i]["OLNO"]
NA[i], OA[olno] = olno, i
i += 1```

#### Fourth Pass

• a. We use Observation 2:

If a line has been found to be unaltered, and the lines immediately adjacent to it in both files are identical, then these lines must be the same line. This information can be used to find blocks of unchanged lines.

• b. Using this, we process each entry in ascending order.

• c. If

• `NA[i]` points to `OA[j]`, and
• `NA[i+1]` and `OA[j+1]` contain identical `table` entry pointers

then

• `OA[j+1]` is set to line `i+1`, and
• `NA[i+1]` is set to line `j+1`

b. Using this, we process each entry in ascending order.

```# ascending
for i, j in NA, OA:```

c. If (...) then (...)

```if NA[i] == OA[j] and NA[i+1] == OA[j+1]:
OA[j+1] = table[O[i+1]]
NA[i+1] = table[N[j+1]]```

#### Fifth Pass

Similar to fourth pass, except:

• It processes each entry in descending order
• It uses `j-1` and `i-1` instead of `j+1` and `i+1`

It processes each entry in descending order

```# descending
for i, j in NA, OA:```

It uses `j-1` and `i-1` instead of `j+1` and `i+1`

```if NA[i] == OA[j] and NA[i-1] == OA[j-1]:
OA[j-1] = table[O[i-1]]
NA[i-1] = table[N[j-1]]```

#### Final Pass

At this point following our five passes, we have the necessary information contained in `NA` to tell the differences between `O` and `N`.

This pass uses `NA` and `OA` to tell when a line has changed between `O` and `N`, and how far the change extends.

##### Determining a New Line

Recall our initial description of `NA` in which we said that the array has either:

one entry for each line of file `N` containing either

• a pointer to `table[line]`
• the line's number in file `O`

Using these two cases, we know that if `NA[i]` refers to an entry in `table` (case 1), then line `i` must be new.

We know this, because otherwise, `NA[i]` would have contained the line's number in `O` (case 2), if it existed in `O` and `N`.

##### Determining the Boundaries of the New Line

We now know that we are dealing with a new line, but we have yet to figure where the change ends.

Recall Observation 2:

If a line has been found to be unaltered, and the lines immediately adjacent to it in both files are identical, then these lines must be the same line. This information can be used to find blocks of unchanged lines.

If `NA[i]` points to `OA[j]`, but `NA[i+1]` does not point to `OA[j+1]`, then line `i` is the boundary for the alteration.

You can look at it this way:

``````i  : The quick brown fox      | j  : The quick brown fox
i+1: jumps over the lazy dog  | j+1: jumps over the loafing cat
``````

Here, `NA[i] == OA[j]`, but `NA[i+1] != OA[j+1]`. This means our boundary is between the two lines.

##### Output Diff

At this point, we know all we need to know and have all the relevant logic in place to output the diff(erence) between `O` and `N`.

### drudru commented Oct 29, 2018

 Hi, Thanks for putting this gist up. I was wondering what your interpretation of Pass 3 of Heckel's diff algorithm is wrt. "virtual lines". In his paper he talks about locating "virtual lines" before the first and after the last. Maybe he means putting the begin and end lines in there? It looks like that can be ignored, but I was just curious if I was missing something?

### larspensjo commented Dec 13, 2018

 If the first line of the two files are identical but not unique, they will not be matched in pass 4. Inserting a virtual BEGIN and END line in the files will be matched in the third pass as identical. That means the first line of the file will now be matched in pass 4.