Skip to content

Instantly share code, notes, and snippets.

@bebebebebe

bebebebebe/diffs.md

Last active Aug 29, 2015
Embed
What would you like to do?

#notes: simplediff and jsdiff for roundtrip-test.js

background

We're using jsdiff in roundtrip-test.js now, and (because of this bug) are deciding whether to switch to another diffing library such as simplediff.

input/output differences

notes on what we'd have to change if we switched to simplediff:

output

  • jsdiff returns an array of "change objects," objects with keys 'value', 'added', and 'removed'. The value of 'value' is a string, and the values of 'added', 'removed' are either true or undefined. simplediff returns an array of pairs whose first element is "=", "+" or "-" and whose second element is a string (unchanged, added, removed, respectively).
  • it looks like handling the different output formats would involve changing: convertDifftoOffsetPairs function in lib/mediawiki.Util.js

input, input processing

  • jsdiff comes with a "tokenizer" to break the input text into lines. we'd have to add this for simplediff.

speed

  • Here's a jsperf diffing two versions of a wikipedia page in wikitext (I added the same line splitter that jsdiff uses): http://jsperf.com/diff/6 (jsdiff is 80% faster on this page).
  • For the zhwiki page in the bug report above: jsdiff runs out of memory, and simplediff (slowly) diffs the two versions
  • Some example times when run in browser with Chrome:
    • page: "San Francisco". lines: 1850. change objects (roughly 3 times as number of diffs): 133.

      diff time with simplediff (ms): 77. diff time with jsdiff (ms): 17.

    • page: "Parsing". lines: 222. change objects (roughly 3 times as number of diffs): 13.

      diff time with simplediff (ms): 6. diff time with jsdiff (ms): 2.

    • page: zhwiki page from bug report. lines: 3706. change objects (roughly 3 times as number of diffs): 2470

      diff time with simplediff (ms): 1794. diff time with jsdiff (ms): runs out of memory

are they equivalent?

No. Jsdiff and simplediff won't always give exactly the same results, though both will give reasonable results. This is because jsdiff identifies the unchanged text with the longest common subsequence of the before and after strings being diffed. Simplediff instead identifies the unchanged text as follows: by first finding the longest common substring of the two texts, and then working recursively to find the longest common substrings of the remaining pieces. They can give different results eg when

before = 'abxdxbxc', after = 'aydyabyc',

since the longest common substring 'ab' isn't a substring of the longest common subsequence 'adbc.'

Suppose:

var oldString = 'a\nb\nx\nd\nx\nb\nx\nc';
var newString = 'a\ny\nd\ny\na\nb\ny\nc';

Then for JsDiff we have:

> JSON.stringify(JsDiff.diffLines(oldString, newString))
[{"value":"a\n"},{"value":"y\n","added":true},{"value":"b\nx\n","removed":true},{"value":"d\n"},{"value":"y\na\n","added":true},{"value":"x\n","removed":true},{"value":"b\n"},{"value":"y\n","added":true},{"value":"x\n","removed":true},{"value":"c"}]
// converted to offset pairs:
[[{"start":2,"end":4},{"start":2,"end":6}],[{"start":6,"end":10},{"start":8,"end":10}],[{"start":12,"end":14},{"start":12,"end":14}]] 

but for simpleDiff we have:

> JSON.stringify(diffLines(oldString, newString))
[["+",["a\n","y\n","d\n","y\n"]],["=",["a\n","b\n"]],["-",["x\n","d\n","x\n","b\n","x\n"]],["+",["y\n"]],["=",["c"]]] 
// converted to offset pairs:
[[{"start":0,"end":8},{"start":0,"end":0}],[{"start":12,"end":14},{"start":4,"end":14}]] 

(This probably doesn't matter, and is unlikely to come up anyway since we're diffing lines which are not that likely to repeat -- but maybe worth noticing if we use jsdiff in parsoidService, but simplediff in roundtrip-test.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.