Skip to content

Instantly share code, notes, and snippets.

@Floppy
Last active December 19, 2015 13:58
Show Gist options
  • Save Floppy/5965443 to your computer and use it in GitHub Desktop.
Save Floppy/5965443 to your computer and use it in GitHub Desktop.
Misadventures in word-diff

I'm trying to write a CSV-compatible diff for git, using word-diff, so that I can see things like added columns easily.

In .gitattribues I've added:

*.csv   diff=csv

And in .git/config I've added:

[diff "csv"]
  wordRegex = "."

This first (very stupid) version works fine, giving me output like this:

diff --git a/data.csv b/data.csv
index 3470d93..c795b80 100644
--- a/data.csv
+++ b/data.csv
@@ -1,19 +1,20 @@
planetary_body,a{+phelion,a+}cceleration
"Earth",{+152098232,+}9.80665
"Moon",{+,+}1.625
"Sun",{+0,+}274.1
"{+Mercury",69816900,3.7+}
{+"+}Venus",{+108939000,+}8.872
"Mars",{+249209300,+}3.78
"Jupiter",{+816520800,+}25.93
"Io",{+,+}1.789
"Europa",{+,+}1.314
"Ganymede",{+,+}1.426
"Callisto",{+,+}1.24
"Saturn",1{+513325783,1+}1.19
"Titan",{+,+}1.3455
"Uranus",{+3004419704,+}9.01
"Titania",{+,+}0.379
"Oberon",{+,+}0.347
"Neptune",{+4553946490,+}11.28
"Triton",[- -]{+,+}0.779
"Pluto",[- -]{+7311000000,+}0.61

These changes are correct, but obviously not very useful. It's treating every character as an individual word, but that means it gets some things a bit wrong, such as the diff in the first line. Ideally the diff should split into actual CSV fields, which means splitting on commas. (I know that's simplistic, but this is the first stage of a better solution).

My understanding is that the wordDiff regexp defines what IS a word. So, as far as I can tell, if I tell it a word is any sequence that doesn't include a comma:

[diff "csv"]
  wordRegex = "[^,]*"

it should split my fields correctly. But no, this one gives me nothing in my diff. No changes shown.

diff --git a/data.csv b/data.csv
index 3470d93..c795b80 100644
--- a/data.csv
+++ b/data.csv
@@ -1,19 +1,20 @@
planetary_body,aphelion,acceleration
"Earth",152098232,9.80665
"Moon",,1.625
"Sun",0,274.1
"Mercury",69816900,3.7
"Venus",108939000,8.872
"Mars",249209300,3.78
"Jupiter",816520800,25.93
"Io",,1.789
"Europa",,1.314
"Ganymede",,1.426
"Callisto",,1.24
"Saturn",1513325783,11.19
"Titan",,1.3455
"Uranus",3004419704,9.01
"Titania",,0.379
"Oberon",,0.347
"Neptune",4553946490,11.28
"Triton",,0.779
"Pluto",7311000000,0.61

If I try saying there should be at least one character that's not a comma, I get something:

[diff "csv"]
  wordRegex = "[^,]+"
diff --git a/data.csv b/data.csv
index 3470d93..c795b80 100644
--- a/data.csv
+++ b/data.csv
@@ -1,19 +1,20 @@
planetary_body,{+aphelion+},acceleration
"Earth",152098232,9.80665
"Moon",,1.625
"Sun",0,274.1
"Mercury",69816900,3.7
"Venus",108939000,8.872
"Mars",249209300,3.78
"Jupiter",816520800,25.93
"Io",,1.789
"Europa",,1.314
"Ganymede",,1.426
"Callisto",,1.24
"Saturn",1513325783,11.19
"Titan",,1.3455
"Uranus",3004419704,9.01
"Titania",,0.379
"Oberon",,0.347
"Neptune",4553946490,11.28
"Triton",,0.779
"Pluto",7311000000,0.61

However, it seems to have given up after the first line. If I say there also shouldn't be spaces in words, I get something more sensible that handles multiple lines, but this rule isn't right; spaces should be allowed in fields:

[diff "csv"]
  wordRegex = "[^,[:space:]]+"
diff --git a/data.csv b/data.csv
index 3470d93..c795b80 100644
--- a/data.csv
+++ b/data.csv
@@ -1,19 +1,20 @@
planetary_body,{+aphelion+},acceleration
"Earth",{+152098232+},9.80665
"Moon",,1.625
"Sun",{+0+},274.1
{+"Mercury",69816900,3.7+}
"Venus",{+108939000+},8.872
"Mars",{+249209300+},3.78
"Jupiter",{+816520800+},25.93
"Io",,1.789
"Europa",,1.314
"Ganymede",,1.426
"Callisto",,1.24
"Saturn",{+1513325783+},11.19
"Titan",,1.3455
"Uranus",{+3004419704+},9.01
"Titania",,0.379
"Oberon",,0.347
"Neptune",{+4553946490+},11.28
"Triton",,0.779
"Pluto",{+7311000000+},0.61

I'm getting very confused and going round in circles a bit. If I start to try to do things like detect commas at the end of "words", newlines, etc, it all gets a bit unpredictable. I'm testing these regexes with git grep as well and getting them working there, but they seem to behave differently when I try to put it into the word diff.

I get the feeling I'm doing something wrong, but I'm not sure what. Does anyone know?

@Floppy
Copy link
Author

Floppy commented Jul 10, 2013

moritz on the #git IRC channel suggested adding newlines (which I tried before but managed to get confused with), like so:

[diff "csv"]
  wordRegex = "[^,\n]+"

This works a treat:

diff --git a/data.csv b/data.csv
index 3470d93..c795b80 100644
--- a/data.csv
+++ b/data.csv
@@ -1,19 +1,20 @@
planetary_body,{+aphelion+},acceleration
"Earth",{+152098232+},9.80665
"Moon",,1.625
"Sun",{+0+},274.1
{+"Mercury",69816900,3.7+}
"Venus",{+108939000+},8.872
"Mars",{+249209300+},3.78
"Jupiter",{+816520800+},25.93
"Io",,1.789
"Europa",,1.314
"Ganymede",,1.426
"Callisto",,1.24
"Saturn",{+1513325783+},11.19
"Titan",,1.3455
"Uranus",{+3004419704+},9.01
"Titania",,0.379
"Oberon",,0.347
"Neptune",{+4553946490+},11.28
"Triton",,[- 0.779-]{+0.779+}
"Pluto",[- 0.61-]{+7311000000,0.61+}

I can even include the end character to show the full change:

[diff "csv"]
  wordRegex = "[^,\n]+[,\n]"
diff --git a/data.csv b/data.csv
index 3470d93..c795b80 100644
--- a/data.csv
+++ b/data.csv
@@ -1,19 +1,20 @@
planetary_body,{+aphelion,+}acceleration
"Earth",{+152098232,+}9.80665
"Moon",,1.625
"Sun",{+0,+}274.1
{+"Mercury",69816900,3.7+}
"Venus",{+108939000,+}8.872
"Mars",{+249209300,+}3.78
"Jupiter",{+816520800,+}25.93
"Io",,1.789
"Europa",,1.314
"Ganymede",,1.426
"Callisto",,1.24
"Saturn",{+1513325783,+}11.19
"Titan",,1.3455
"Uranus",{+3004419704,+}9.01
"Titania",,0.379
"Oberon",,0.347
"Neptune",{+4553946490,+}11.28
"Triton",,[- 0.779-]{+0.779+}
"Pluto",[- 0.61-]{+7311000000,0.61+}

Now my only problem is that it doesn't detect the empty fields. Surely I change the first character class to zero-or-more instead of one-or-more... but no. Back to square one:

[diff "csv"]
  wordRegex = "[^,\n]*[,\n]"
diff --git a/data.csv b/data.csv
index 3470d93..c795b80 100644
--- a/data.csv
+++ b/data.csv
@@ -1,19 +1,20 @@
planetary_body,{+aphelion,+}acceleration
"Earth",152098232,9.80665
"Moon",,1.625
"Sun",0,274.1
"Mercury",69816900,3.7
"Venus",108939000,8.872
"Mars",249209300,3.78
"Jupiter",816520800,25.93
"Io",,1.789
"Europa",,1.314
"Ganymede",,1.426
"Callisto",,1.24
"Saturn",1513325783,11.19
"Titan",,1.3455
"Uranus",3004419704,9.01
"Titania",,0.379
"Oberon",,0.347
"Neptune",4553946490,11.28
"Triton",,0.779
"Pluto",7311000000,0.61

We're nearly there though, this is good progress. A lot of this stuff seemed to work quite differently when I was trying it yesterday, but taking a more structured approach to it (and explaining as I go) seems to be helping.

@Floppy
Copy link
Author

Floppy commented Jul 11, 2013

The zero-or-more thing is a distraction. Still not sure why that approach doesn't work, but instead we can do something slightly different. If we say that a "word" is the thing we had before, OR an empty field with a separator, we get exactly what we want:

[diff "csv"]
  wordRegex = "[^,\n]+[,\n]|[,]"
diff --git a/data.csv b/data.csv
index 3470d93..c795b80 100644
--- a/data.csv
+++ b/data.csv
@@ -1,19 +1,20 @@
planetary_body,{+aphelion,+}acceleration
"Earth",{+152098232,+}9.80665
"Moon",{+,+}1.625
"Sun",{+0,+}274.1
{+"Mercury",69816900,3.7+}
"Venus",{+108939000,+}8.872
"Mars",{+249209300,+}3.78
"Jupiter",{+816520800,+}25.93
"Io",{+,+}1.789
"Europa",{+,+}1.314
"Ganymede",{+,+}1.426
"Callisto",{+,+}1.24
"Saturn",{+1513325783,+}11.19
"Titan",{+,+}1.3455
"Uranus",{+3004419704,+}9.01
"Titania",{+,+}0.379
"Oberon",{+,+}0.347
"Neptune",{+4553946490,+}11.28
"Triton",[- 0.779-]{+,0.779+}
"Pluto",[- 0.61-]{+7311000000,0.61+}

This shows empty fields as well, which is exactly what we need. Next step now is to handle different separators, and quoted fields. Time to consult the CSV spec.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment