Skip to content

Instantly share code, notes, and snippets.

@jennybc
Last active November 17, 2022 13:04
Show Gist options
  • Star 25 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save jennybc/0be7717c2b5b30088811 to your computer and use it in GitHub Desktop.
Save jennybc/0be7717c2b5b30088811 to your computer and use it in GitHub Desktop.
Finessing Excel's stupid line endings

Finessing Excel's stupid line endings

I am sheepish to admit a certain type of routine Microsoft Excel use.

Current example: I am marking for STAT 545. I use R to create a comma delimited marking sheet, by joining the official class list and peer reviews. The sheet contains variables, initially set to NA, where the TAs and I enter official marks and optional comments.

This is where Excel comes in. I like its visual organization of this comma delimited file much more than, say, using a plain text editor. I use the ability to hide columns, resize columns, wrap text, and (gasp!) even fill rows with grey to indicate I am done.

I keep saving the file as comma delimited and I put up with Excel's incessant freak out about "losing features". This is not a one time thing. I need to save and commit this file many times before it is considered done.

But Excel for the Mac 2011 does a really stupid thing: In a misplaced homage to the classic Mac OS, line endings are actually carriage returns -- the character \r.

This is the orphan disease of line ending problems. Everyone wants to talk to you about the much more prevalent "Windows vs. Unix" line ending problem, where Windows uses carriage return + line feed ("CRLF") or \r\n and Unix gets it right with line feed only or \n. Git is even prepared to help you navigate "Windows vs Unix" line ending problem. But that solution does absolutely zilch if you are suffering from naked carriage returns as line endings.

Luckily, Rich Fitzjohn has suffered before me and written a great blog post about this. He explains very well why this is worth solving, so your Git diffs remain informative.

Bottom line from Rich with one slight twist:

  • Add lines like this to .git/config

    [filter "lfnotcr"]
        clean = LC_CTYPE=C awk '{printf(\"%s\\n\", $0)}' | LC_CTYPE=C tr '\\r' '\\n'
        smudge = tr '\\n' '\\r'
    
  • Create a file .git/info/attributes like this

    *.csv filter=lfnotcr
    

Now your line endings will be corrected before the csv files enter the repository (that's what the clean filter does) and restored when you work with them locally (that's what the smudge filter does). Do NOT omit the smudge filter, even if you think it's unnecessary because Excel seems perfectly able to read files with line feed as line ending. You need both for diffing to work the way you want.

My only twist on Rich's solution is to use .git/info/attributes instead of .gitattributes for a completely local solution. Neither of the files mentioned above are actually tracked in the repository, so your Excel shame can remain a relatively private thing.

Other notes

Notes to self for future line ending headaches ...

English OS connotation character "vibe"
carriage return classic Mac, i.e. OS 9 and earlier \r archaic
line feed Unix, including Mac OS X \n The Very Best
carriage return + line feed, "CRLF" Windows, going back to DOS \r\n Boo! Windows!

How to check your line endings in the shell. You could use cat -v. Here's a comma delimited file, fresh after saving with Excel for the Mac 2011:

$ cat -v months.csv 
jan,1,january^Mfeb,2,february^Mmar,3,march

You can see the distinctive ^Ms where the carriage returns are. Note that Excel also does not put a line ending at the end the file.

Another option is to use file. Here's a look at the same file:

$ file months.csv 
months.csv: ASCII text, with CR line terminators

How to correct manually? You could use Homebrew to install dos2unix, i.e. brew install dos2unix. Then you will have the mac2unix command:

$ mac2unix months.csv 
mac2unix: converting file months.csv to Unix format...
$ file months.csv 
months.csv: ASCII text

You could also use the tr command to translate carriage returns to line feeds.

$ more months.csv 
jan,1,january^Mfeb,2,february^Mmar,3,march
$ file months.csv 
months.csv: ASCII text, with CR line terminators
$ tr '\r' '\n' < months.csv > months-unix.csv 
$ file months-unix.csv 
months-unix.csv: ASCII text
$ more months-unix.csv 
jan,1,january
feb,2,february
mar,3,march

While I'm on a roll, if you've somehow got Windows-style line endings and you want Unix-style, use tr to simply delete carriage returns:

tr -d '\r' < file-name > file-name-unix

This will turn \r\n into just \n. The dos2unix command would also work here.

Random: line ending conversion is one of the reasons it's so important to specify FTP file transfer in ascii mode for text files.

At first, I considered writing a pre-commit Git hook, but the Git attribute and filter approach was easier to get working. Git hooks are not particularly well documented or exampled. But Git Hooks for Fun and Profit actually gave me some hope.

@jennybc
Copy link
Author

jennybc commented Dec 19, 2015

Rich's group has moved on to a different solution:

dfalster/baad@1620ecb

@vibhay
Copy link

vibhay commented Apr 3, 2018

The BEST Solution...and very well explained !

@irmac
Copy link

irmac commented Sep 13, 2018

Thanks for writing this up! I've just spent my morning trying various incantations in sed, vi, and dos2unix weren't getting rid of ^M. mac2unix to the rescue - wouldn't have thought of using it - thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment