Skip to content

Instantly share code, notes, and snippets.

@maxrothman
Last active October 6, 2019 02:01
Show Gist options
  • Save maxrothman/b93de707748d295e1c7f8f5076a270eb to your computer and use it in GitHub Desktop.
Save maxrothman/b93de707748d295e1c7f8f5076a270eb to your computer and use it in GitHub Desktop.
Book diffing

Book diffing

This code was written when a new edition of a book series I like came out and I wanted to find out what was different between the old and new versions. The goal was to create readable diffs of English prose so that I could scan through the books and easily see what was different.

Most of the diffing tools out there are built with the assumption that their target texts are source code, or some other machine-optimized format, not prose, so I had to build some of my own tools.

The format that I found most readable had the following features:

  • Inline: side-by-side diffs are harder to scan
  • Line-oriented: word- or character-oriented diffs make it harder to see the context of a change. I'd rather read the whole old version and compare to the whole new version.
  • Individual changes colored: Line-oriented diffs make it hard to see exactly what changed, so I wanted each individual change within a line to be colored.
  • Insert/delete markers: Once I started reviewing diffs, I found it difficult to tell where insertions and deletions occurred in the other version, so I had the diff formatter insert a special "↓" character at the insertion/deletion site.

At first glance, it would seem like using the standard diff utility and something like git's diff-highlight for coloring would work fine, but when highlighting the changes within a line, diff-highlight highlights from the beginning of the first change to the end of the last change. That means that diff-highlight can't properly color multiple changes within a line.

I tried a number of different approaches, including going as far as to write my own diffing/highlighting algorithm using Python's difflib (turns out it's a bit of a garbage fire), but this is the solution I settled on.

I used two scripts: one to do a character-by-character diff of both texts and produce a human-readable inline diff, and another to break the result of the previous up by lines and produce colorized diffs. Usage looked like this:

$ ./diff.py book-title.old book-title.new | ./format.py > book-title.diff

Then the diffs can be reviewed in your pager of choice.

#!/usr/bin/env python
"""
Use google's diff-match-patch library to do a simple character-based diff of both texts, optimized for human-readability.
Change regions are notated by "magic" character patterns that are recognized by format.py
Requirements: diff-match-patch
"""
import sys
from diff_match_patch import diff_match_patch
def print_sub(seq):
print(f"[-{''.join(seq)}-]", end='')
def print_add(seq):
print(f"{{+{''.join(seq)}+}}", end='')
def main(old, new):
# Ugh, their API is terrible
dmp = diff_match_patch()
dmp.Diff_Timeout = 0 # Otherwise for large texts it won't finish
diff = dmp.diff_main(old, new)
# Ugh, it mutates a list in-place?
dmp.diff_cleanupSemantic(diff)
for typ, text in diff:
if typ == -1:
print_sub(text)
elif typ == 0:
print(text, end='')
elif typ == 1:
print_add(text)
else:
raise Exception("This should never happen!")
if __name__ == '__main__':
if len(sys.argv) < 3:
print("Usage: ./diff.py FILE1 FILE2")
else:
with open(sys.argv[1]) as old, open(sys.argv[2]) as new:
main(old.read(), new.read())
#!/usr/bin/env python
"""
Take a diff created by diff.py and colorize it and add insert/delete markers.
Requirements: click (only used for its terminal color printing feature, could easily be replaced with something else)
Conceptually, this script implements a state machine:
┏━━━━━┓
┃START┃
┗━━┯━━┛
╭───────────────────┲━━━━━━┱────────────────────╮
↓ ╭──────→┃COMMON┃←────────╮ ↓
╭────────╮ │ ╭────→┗┯━━━━┯┛←──────╮ │ ╭────────╮
│START │ │ │ ↓ ↓ │ │ │START │
│ADDITION│ │ │ ╭──────╮ ╭───────╮ │ │ │DELETION│
╰───┬────╯ │ │ │NORMAL│ │NEWLINE│ │ │ ╰───┬────╯
│ │ │ │CHAR │ ╰────┬──╯ │ │ │
↓ │ │ ╰─┬────╯ ↓ │ │ ↓
┏━━━━━━━━┓ │ │ ↓ print,─╯ │ ┏━━━━━━━━┓
┃COLLECT ┃ │ ╰──store clear │ ┃COLLECT ┃
╭────→┃ADDITION┃ │ both stored │ ┃DELETION┃←────╮
│ ┗┯━━━━━━┯┛ │ │ ┗┯━━━━━━┯┛ │
│ ↓ ↓ │ │ ↓ ↓ │
│ ╭──────╮ ╭────────╮ │ │ ╭────────╮ ╭──────╮ │
│ │NORMAL│ │END │ │ │ │END │ │NORMAL│ │
│ │CHAR │ │ADDITION│ │ │ │DELETION│ │CHAR │ │
│ ╰──┬───╯ ╰────┬───╯ │ │ ╰─┬──────╯ ╰──┬───╯ │
│ ↓ ╰─────╯ ╰───╯ ↓ │
╰──store store───╯
addition deletion
KEY
---
┏━━━━━┓
┃STATE┃
┗━━━━━┛
╭─────────╮
│CONDITION│
╰─────────╯
action
We go through the input character-by-character, storing text in two buffers: one for addition lines
and one for deletions lines. When we encounter an addition or deletion region, we store it in the
appropriate buffer, marking it as a diff region. When we encounter a common character, we store it
in both buffers. When we encounter a newline, we first print the deletion buffer with red foreground
and the diff regions with red background, then do the same with the addition buffer except with
green.
The algorithm used here is slightly different than the described above. First, rather than going
through the input character-by-character, we iterate through matches of a regex that matches some
arbitrary text followed by a state transition indicator. This pushes the tight loop into the C
implementation of the regex engine, improving performance. That change necessitates some minor
modifications to the state machine, but the overall idea is similar. Second, there's a bit of a hack
to support insertion and deletion markers. The intended effects is described in the README; to add
them we add the marker character before deletions and additions that don't directly follow
deletions. This works because what it means to be an "edit" is to be an addition directly following
a deletion. There's some slightly hairy edge cases to handle, but it's pretty well-explained in the code.
"""
from sys import stdout, stdin
from enum import Enum
import re
import click
def bail(state, boundary, text):
raise Exception(f"Unexpected boundary in {state} state: {boundary}. After {text}")
class Change:
def __init__(self, s):
self.s = s
def __repr__(self):
return f"{type(self).__name__}({self.s})"
class Addition(Change): pass
class Deletion(Change): pass
def format_sep(buf, color):
result = []
for bit in buf:
if isinstance(bit, str):
result.append(click.style(bit, fg=color))
elif isinstance(bit, Addition):
# Hack hack. Should be factored into Change
if bit.s == '↓':
result.append(click.style(bit.s, fg='white', bg='green'))
else:
result.append(click.style(bit.s, fg='green', reverse=True))
elif isinstance(bit, Deletion):
# As above
if bit.s == '↓':
result.append(click.style(bit.s, fg='white', bg='red'))
else:
result.append(click.style(bit.s, fg='red', reverse=True))
else:
raise Exception(f"Unexpected item in buffer: {bit}")
return ''.join(result)
State = Enum('State', ['common', 'insert', 'delete'])
boundary_re = re.compile(r'(.*?)(\{\+|\+\}|\[-|-]|\n)', flags=re.MULTILINE)
def main(whole_text):
state = State.common
sub_buf, add_buf= [], []
for match in boundary_re.finditer(whole_text):
text, boundary = match.groups()
if state == State.common:
if boundary == '\n':
if sub_buf or add_buf:
# Print separated output
sub_buf.append(text + boundary)
add_buf.append(text + boundary)
stdout.write(format_sep(sub_buf, 'red'))
stdout.write(format_sep(add_buf, 'green'))
sub_buf, add_buf = [], []
else:
# We have just a plain line of common
stdout.write(text + boundary)
elif boundary == '{+':
state = State.insert
if text:
sub_buf.append(text)
add_buf.append(text)
elif boundary == '[-':
state = State.delete
if text:
sub_buf.append(text)
add_buf.append(text)
else:
bail(state, boundary, text)
elif state == State.delete:
if boundary == '-]':
add_buf.append(Deletion('↓')) # Add deletion marker for easier comparison
sub_buf.append(Deletion(text))
state = State.common
elif boundary == '\n':
sub_buf.append(Deletion(text + boundary))
else:
bail(state, boundary, text)
elif state == State.insert:
if boundary == '+}':
# print('sub buff', sub_buf)
# print('add buff', add_buf)
# Remove spurious deletion marker added in previous delete branch
if sub_buf and isinstance(sub_buf[-1], Deletion) and isinstance(add_buf[-1], Deletion):
del add_buf[-1]
# Add addition marker for easier comparison, but not when the change was an edit
# No need to do this in the delete branch above because in edits, additions
# always follow deletions
if sub_buf and not isinstance(sub_buf[-1], Deletion):
sub_buf.append(Addition('↓'))
add_buf.append(Addition(text))
state = State.common
elif boundary == '\n':
add_buf.append(Addition(text + boundary))
else:
bail(state, boundary, text)
else:
raise Exception(f"Unexpected state {state}")
# Print the rest in case there isn't a trailing newline or something
# scope-leak from loop var is unfortunate but necessary
stdout.write(whole_text[match.end():])
if __name__ == '__main__':
main(stdin.read())
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment