TLDR: File.read(a_file).sub("\xEF\xBB\xBF", "").gsub("\r", "")
I tried using combinations of find/awk/sed, but I ended up corrupting my project files including the git index because I don't understand those commands very well. So instead, I turned to my good old friend ruby. This also has the advantage that I can use it in my program that processes these files. They've been ninja-edited (modfied outside of source control) in the past, which makes it likely that someone will come by and ninja edit them again in the future. But instead of convincing people that it's bad to edit files in notepad or outside of source control, we can just silently fix it on the fly and smile and say "Happy Friday!"
Here is the example of what I'm cleaning up. Thankfully, git's diff viewer is brutally honest and shows these characters. Unfortunately, most editors (including the good ones) simply hide these characters and put them back in when you save the file. The byte order marks were causing browsers to skip CSS declarations when multiple CSS files were combined into one and the page would look funny.
+<U+FEFF>^M
+body {^M
+ background: #FFF url(/content/skins/azteka/images/bg_tile.jpg) repeat-x; }^M
+ ^M
+html,body,input,select,textarea,tr,td,table {^M
+ font-family: arial,Helvetica,sans-serif;^M
+ font-size: 9pt;^M
+ color: #666; }^M
+^M
Here's a ruby one-liner for a single file. For bonus points, I decided to remove trailing whitespace also. This will overwrite the current file, but hey we're using version control.
ruby -e 'IO.write(ARGV.first, IO.read(ARGV.first).sub("\xEF\xBB\xBF", "").gsub("\r", "").gsub(/[ \t]+\n/, "\n"))' content/Skins/Azteka/skin.css
"\xEF\xBB\xBF"
is the byte order mark, and "\r"
is the same as ^M
- a carriage return. (Windows uses \r\n
for line endings instead of simply \n
.)
And here it is a as a function, which also happens to be more readable:
def clean_up_after_notepad(file_contents)
file_contents.sub("\xEF\xBB\xBF", "").gsub("\r", "")
end
Now if we run git diff
(and scroll down), it looks a lot better! Browers will also parse all of the styles now.
+
+body {
+ background: #FFF url(/content/skins/azteka/images/bg_tile.jpg) repeat-x; }
+
+html,body,input,select,textarea,tr,td,table {
+ font-family: arial,Helvetica,sans-serif;
+ font-size: 9pt;
+ color: #666; }
+
If you're still skeptical, you can run this one-liner before and after cleaning the file to see that the correct bytes got removed and nothing else.
Before:
$ ruby -e 'puts File.readlines(ARGV.first).first.bytes.to_a.join(",")' content/Skins/Azteka/skin.css
239,187,191,13,10
239,187,191
is the BOM, 13
is the carriage return, and 10
is the newline.
After:
$ ruby -e 'puts File.readlines(ARGV.first).first.bytes.to_a.join(",")' content/Skins/Azteka/skin.css
10
Now we just see the newline character like expected.
In real life, you may see the error
Encoding::CompatibilityError: incompatible encoding regexp match (ASCII-8BIT regexp with UTF-8 string)
In that case, you can simply force the encoding to UTF-8, leaving the bytes unchanged:
file_contents.sub("\xEF\xBB\xBF".force_encoding('UTF-8'), "").gsub("\r", "")