Skip to content

Instantly share code, notes, and snippets.

@jcoglan
Last active February 3, 2017 18:04
Show Gist options
  • Star 3 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jcoglan/e4c02dc7774109f9c63624caff65dd47 to your computer and use it in GitHub Desktop.
Save jcoglan/e4c02dc7774109f9c63624caff65dd47 to your computer and use it in GitHub Desktop.

The Myers diff algorithm

The following based on Myers' original paper.

The edit graph

To use the example from the paper, say you want to calculate the difference between two strings:

  • a = ABCABBA
  • b = CBABAC

By "difference", we mean a sequence of edits that will convert string a into string b. One possible such sequence is to simply delete each character in A, and then insert each character in B, or to use common diff notation:

- A
- B
- C
- A
- B
- B
- A
+ C
+ B
+ A
+ B
+ A
+ C

However, we wouldn't consider this a good-quality diff since it doesn't tell us very much. Changes to source code typically leave much of a file unmodified and we really want to see chunks of code that were inserted or deleted. A diff that shows the entire file being removed and replaced with the new version isn't much use to us.

A better diff of these two strings would be:

- A
- B
  C
+ B
  A
  B
- B
  A
+ C

This makes the smallest possible number of changes to a in order to produce b, so it's a better visualisation of what really changed. It's not the only possible solution, for example these are also valid:

1.  - A       2.  - A       3.  + C
    - B           + C           - A
      C             B             B
    - A           - C           - C
      B             A             A
    + A             B             B
      B           - B           - B
      A             A             A
    + C           + C           + C

However, they are all minimal: they make the smallest number of edits possible, which in this case is five. What's interesting about them is they differ in which sections they consider to be the same between the strings, and which order they perform edits in. From looking at diffs, you probably have an intuitive idea that diffs only show the things that changed, but these examples show that there are many possible answers to the question of what changed.

So, the purpose of diff algorithms is to provide a strategy for generating diffs, where the diffs have certain properties. We usually want diffs to be as small as possible, but there are other considerations. For example, when you change something, you're probably used to seeing deletions followed by insertions, not the other way round. That is, you'd rather see solution 2 than solution 3 above. And, when you change a whole block of code, you'd like to see the whole chunk being deleted followed by the new code being inserted, rather than many deletions and insertions interleaved with each other.

Good:   - one         Bad:    - one
        - two                 + four
        - three               - two
        + four                + five
        + five                + six
        + six                 - three

You also probably want to see deleted or insert code that aligns with your idea of the code's structure, for example if you insert a method, you'd like that method's end to be considered new, rather than the end of the preceding method:

Good      class Foo                   Bad:      class Foo
            def initialize(name)                  def initialize(name)
              @name = name                          @name = name
            end                               +   end
        +                                     +
        +   def inspect                       +   def inspect
        +     @name                           +     @name
        +   end                                   end
          end                                   end

Myers' algorithm is just one such strategy, but it's fast and it produces diffs that tend to be of good quality most of the time. It does this by being greedy, that is trying to consume as many lines that are the same before making a change (therefore avoiding the "wrong end" problem), and also by preferring deletions over insertions when given a choice, so that deletions appear first.

The Myers paper is based on the idea that finding the shortest edit script (SEC) can be modelled as a graph search. Let's take our two strings, a = ABCABBA and b = CBABAC, and build a graph of all the ways you can get from a to b.

The x,y co-ordinates in the grid shown below correspond to steps in the editing process; at 0,0 you have string a, that is, you have not started editing. Moving rightward corresponds to deleting a character from a, for example moving to 1,0 means we've deleted the first A from a. Moving downwards corresponds to inserting a character from b, for example if we now move from 1,0 down to 1,1, we insert the first C from b, and our edited string is thus CBCABBA. At position 4,3, we have converted ABCA into CBA, but we still need to convert BBA into BAC. The bottom-right position 7,6 corresponds to converting string a fully into string b.

As well as moving rightwards and downwards, in some positions we can also move diagonally. This occurs when the two strings have the same character at the position's indexes, for example the third character in a and the first character in b are both C, and so position 2,0 has a diagonal leading to 3,1. This corresponds to consuming an equal character from both strings, neither deleting nor inserting anything.

       A     B     C     A     B     B     A

    o-----o-----o-----o-----o-----o-----o-----o   0
    |     |     | \   |     |     |     |     |
C   |     |     |  \  |     |     |     |     |
    |     |     |   \ |     |     |     |     |
    o-----o-----o-----o-----o-----o-----o-----o   1
    |     | \   |     |     | \   | \   |     |
B   |     |  \  |     |     |  \  |  \  |     |
    |     |   \ |     |     |   \ |   \ |     |
    o-----o-----o-----o-----o-----o-----o-----o   2
    | \   |     |     | \   |     |     | \   |
A   |  \  |     |     |  \  |     |     |  \  |
    |   \ |     |     |   \ |     |     |   \ |
    o-----o-----o-----o-----o-----o-----o-----o   3
    |     | \   |     |     | \   | \   |     |
B   |     |  \  |     |     |  \  |  \  |     |
    |     |   \ |     |     |   \ |   \ |     |
    o-----o-----o-----o-----o-----o-----o-----o   4
    | \   |     |     | \   |     |     | \   |
A   |  \  |     |     |  \  |     |     |  \  |
    |   \ |     |     |   \ |     |     |   \ |
    o-----o-----o-----o-----o-----o-----o-----o   5
    |     |     | \   |     |     |     |     |
C   |     |     |  \  |     |     |     |     |
    |     |     |   \ |     |     |     |     |
    o-----o-----o-----o-----o-----o-----o-----o   6

    0     1     2     3     4     5     6     7

The idea behind the Myers algorithm is quite simple: we want to get from 0,0 to 7,6 (the bottom-right) in as few moves as possible. A "move" is a single step rightwards (a deletion) or downwards (an insertion). The most number of moves we could take to get from a to b is 13: the combined length of the two strings.

However, walking diagonal paths is free since they don't correspond to making changes, thus we want to maximise the number of diagonal steps we take and minimise the number of rightward/downward moves. The examples above show that we can actually get from a to b making only five edits, and Myers provides a strategy for finding that pathway.

Exploring the graph

+-----+
| 0,0 |
+-----+
+-----+     +-----+
| 0,0 |-----| 1,0 |
+-----+     +-----+
   |
   |
+-----+
| 0,1 |
+-----+
+-----+     +-----+
| 0,0 |-----| 1,0 |
+-----+     +-----+
   |
   |
+-----+     +-----+
| 0,1 |-----| 2,2 |
+-----+     +-----+
   |
   |
+-----+
| 2,4 |
+-----+
+-----+     +-----+     +-----+
| 0,0 |-----| 1,0 |-----| 3,1 |
+-----+     +-----+     +-----+
   |           |
   |           |
+-----+     +-----+
| 0,1 |     | 2,2 |
+-----+     +-----+
   |
   |
+-----+
| 2,4 |
+-----+
+-----+     +-----+     +-----+
| 0,0 |-----| 1,0 |-----| 3,1 |
+-----+     +-----+     +-----+
   |           |
   |           |
+-----+     +-----+
| 0,1 |     | 2,2 |
+-----+     +-----+
   |
   |
+-----+     +-----+
| 2,4 |-----| 4,5 |
+-----+     +-----+
   |
   |
+-----+
| 3,6 |
+-----+

Down from 2,2 is 2,3; 4,5 reached right from 2,4 wins.

+-----+     +-----+     +-----+
| 0,0 |-----| 1,0 |-----| 3,1 |
+-----+     +-----+     +-----+
   |           |
   |           |
+-----+     +-----+     +-----+
| 0,1 |     | 2,2 |-----| 5,4 |
+-----+     +-----+     +-----+
   |
   |
+-----+     +-----+
| 2,4 |-----| 4,5 |
+-----+     +-----+
   |
   |
+-----+
| 3,6 |
+-----+
+-----+     +-----+     +-----+     +-----+
| 0,0 |-----| 1,0 |-----| 3,1 |-----| 5,2 |
+-----+     +-----+     +-----+     +-----+
   |           |           |
   |           |           |
+-----+     +-----+     +-----+
| 0,1 |     | 2,2 |     | 5,4 |
+-----+     +-----+     +-----+
   |
   |
+-----+     +-----+
| 2,4 |-----| 4,5 |
+-----+     +-----+
   |
   |
+-----+
| 3,6 |
+-----+

There is no node down from 3,6. Right of 3,6 is 4,6 but this is also reachable down from 4,5.

+-----+     +-----+     +-----+     +-----+
| 0,0 |-----| 1,0 |-----| 3,1 |-----| 5,2 |
+-----+     +-----+     +-----+     +-----+
   |           |           |
   |           |           |
+-----+     +-----+     +-----+
| 0,1 |     | 2,2 |     | 5,4 |
+-----+     +-----+     +-----+
   |
   |
+-----+     +-----+     +-----+
| 2,4 |-----| 4,5 |-----| 5,5 |
+-----+     +-----+     +-----+
   |           |
   |           |
+-----+     +-----+
| 3,6 |     | 4,6 |
+-----+     +-----+
+-----+     +-----+     +-----+     +-----+
| 0,0 |-----| 1,0 |-----| 3,1 |-----| 5,2 |
+-----+     +-----+     +-----+     +-----+
   |           |           |
   |           |           |
+-----+     +-----+     +-----+     +-----+
| 0,1 |     | 2,2 |     | 5,4 |-----| 7,5 |
+-----+     +-----+     +-----+     +-----+
   |                       |
   |                       |
+-----+     +-----+     +-----+
| 2,4 |-----| 4,5 |     | 5,5 |
+-----+     +-----+     +-----+
   |           |
   |           |
+-----+     +-----+
| 3,6 |     | 4,6 |
+-----+     +-----+

Down from 5,2 you also reach 7,5 but this is redundant as another path already leads there

+-----+     +-----+     +-----+     +-----+     +-----+
| 0,0 |-----| 1,0 |-----| 3,1 |-----| 5,2 |-----| 7,3 |
+-----+     +-----+     +-----+     +-----+     +-----+
   |           |           |
   |           |           |
+-----+     +-----+     +-----+     +-----+
| 0,1 |     | 2,2 |     | 5,4 |-----| 7,5 |
+-----+     +-----+     +-----+     +-----+
   |                       |
   |                       |
+-----+     +-----+     +-----+
| 2,4 |-----| 4,5 |     | 5,5 |
+-----+     +-----+     +-----+
   |           |
   |           |
+-----+     +-----+
| 3,6 |     | 4,6 |
+-----+     +-----+

There is no node down from 4,6. Right of 4,6 is 5,6, which is also down from 5,5.

+-----+     +-----+     +-----+     +-----+     +-----+
| 0,0 |-----| 1,0 |-----| 3,1 |-----| 5,2 |-----| 7,3 |
+-----+     +-----+     +-----+     +-----+     +-----+
   |           |           |
   |           |           |
+-----+     +-----+     +-----+     +-----+
| 0,1 |     | 2,2 |     | 5,4 |-----| 7,5 |
+-----+     +-----+     +-----+     +-----+
   |                       |
   |                       |
+-----+     +-----+     +-----+     +-----+
| 2,4 |-----| 4,5 |     | 5,5 |-----| 6,5 |
+-----+     +-----+     +-----+     +-----+
   |           |           |
   |           |           |
+-----+     +-----+     +-----+
| 3,6 |     | 4,6 |     | 5,6 |
+-----+     +-----+     +-----+

Down from 7,5 is 7,6, the end square, which definitely beats the 6,5 that we reached travelling right from 5,5.

+-----+     +-----+     +-----+     +-----+     +-----+
| 0,0 |-----| 1,0 |-----| 3,1 |-----| 5,2 |-----| 7,3 |
+-----+     +-----+     +-----+     +-----+     +-----+
   |           |           |
   |           |           |
+-----+     +-----+     +-----+     +-----+
| 0,1 |     | 2,2 |     | 5,4 |-----| 7,5 |
+-----+     +-----+     +-----+     +-----+
   |                       |           |
   |                       |           |
+-----+     +-----+     +-----+     +-----+
| 2,4 |-----| 4,5 |     | 5,5 |     | 7,6 |
+-----+     +-----+     +-----+     +-----+
   |           |           |
   |           |           |
+-----+     +-----+     +-----+
| 3,6 |     | 4,6 |     | 5,6 |
+-----+     +-----+     +-----+

Changing representation

    |      0       1       2       3       4       5
----+--------------------------------------------------
    |
    |                                   +-----+
 4  |                                   | 7,3 |
    |                                   +-----+
    |                                  /
    |                           +-----+
 3  |                           | 5,2 |
    |                           +-----+
    |                          /
    |                   +-----+         +-----+
 2  |                   | 3,1 |         | 7,5 |
    |                   +-----+         +-----+
    |                  /       \       /       \
    |           +-----+         +-----+         +-----+
 1  |           | 1,0 |         | 5,4 |         | 7,6 |
    |           +-----+         +-----+         +-----+
    |          /       \               \
    |   +-----+         +-----+         +-----+
 0  |   | 0,0 |         | 2,2 |         | 5,5 |
    |   +-----+         +-----+         +-----+
    |          \                               \
    |           +-----+         +-----+         +-----+
-1  |           | 0,1 |         | 4,5 |         | 5,6 |
    |           +-----+         +-----+         +-----+
    |                  \       /       \
    |                   +-----+         +-----+
-2  |                   | 2,4 |         | 4,6 |
    |                   +-----+         +-----+
    |                          \
    |                           +-----+
-3  |                           | 3,6 |
    |                           +-----+
module Diff
TAGS = {:eql => " ", :del => "-", :ins => "+"}
COLS = {
:del => "\e[31m",
:ins => "\e[32m",
:default => "\e[39m"
}
LINE_WIDTH = 4
def self.print(diff)
diff.each do |line|
col = $stdout.tty? ? COLS[line.type] : ""
reset = $stdout.tty? ? COLS[:default] : ""
tag = TAGS[line.type]
a_line = line.a_line.to_s.rjust(LINE_WIDTH, " ")
b_line = line.b_line.to_s.rjust(LINE_WIDTH, " ")
text = line.text.rstrip
puts "#{col}#{tag} #{a_line} #{b_line} #{text}#{reset}"
end
end
end
module Myers
Line = Struct.new(:type, :a_line, :b_line, :text)
class KArray
def initialize(max, val = [])
@max = max
@val = val
end
def [](index)
@val[@max + index]
end
def []=(index, value)
@val[@max + index] = value
end
def dup
KArray.new(@max, @val.dup)
end
end
def self.diff(a, b)
a, b = a.lines, b.lines
trace = shortest_trace(a, b)
diff = []
backtrack(trace, a.size, b.size) do |prev_x, prev_y, x, y, d|
while x > prev_x && y > prev_y
diff.unshift Line.new(:eql, a.size, b.size, [a.pop, b.pop].last)
x, y = x - 1, y - 1
end
if d > 0
if x > prev_x
diff.unshift Line.new(:del, a.size, nil, a.pop)
else
diff.unshift Line.new(:ins, nil, b.size, b.pop)
end
end
end
diff
end
def self.shortest_trace(a, b)
n, m = a.size, b.size
max = n + m
v = KArray.new(max)
v[1] = 0
trace = []
(0 .. max).step do |d|
trace << v.dup
(-d .. d).step(2) do |k|
if k == -d || (k != d && v[k - 1] < v[k + 1])
x = v[k + 1]
else
x = v[k - 1] + 1
end
y = x - k
while x < n && y < m && a[x] == b[y]
x, y = x + 1, y + 1
end
v[k] = x
return trace if x >= n && y >= m
end
end
end
def self.backtrack(trace, x, y)
trace.each.with_index.reverse_each do |v, d|
k = x - y
down = (k == -d) || (k != d && v[k - 1] < v[k + 1])
prev_k = down ? k + 1 : k - 1
prev_x = v[prev_k]
prev_y = prev_x - prev_k
yield prev_x, prev_y, x, y, d
x, y = prev_x, prev_y
end
end
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment