Skip to content

Instantly share code, notes, and snippets.

@jbgo
Last active March 10, 2023 12:53
Show Gist options
  • Save jbgo/4692434 to your computer and use it in GitHub Desktop.
Save jbgo/4692434 to your computer and use it in GitHub Desktop.
Safely remove byte order marks (BOM) and carriage returns (^M) from files edited with Notepad on windows

TLDR: File.read(a_file).sub("\xEF\xBB\xBF", "").gsub("\r", "")

I tried using combinations of find/awk/sed, but I ended up corrupting my project files including the git index because I don't understand those commands very well. So instead, I turned to my good old friend ruby. This also has the advantage that I can use it in my program that processes these files. They've been ninja-edited (modfied outside of source control) in the past, which makes it likely that someone will come by and ninja edit them again in the future. But instead of convincing people that it's bad to edit files in notepad or outside of source control, we can just silently fix it on the fly and smile and say "Happy Friday!"

Here is the example of what I'm cleaning up. Thankfully, git's diff viewer is brutally honest and shows these characters. Unfortunately, most editors (including the good ones) simply hide these characters and put them back in when you save the file. The byte order marks were causing browsers to skip CSS declarations when multiple CSS files were combined into one and the page would look funny.

+<U+FEFF>^M
+body {^M
+    background: #FFF url(/content/skins/azteka/images/bg_tile.jpg) repeat-x; }^M
+    ^M
+html,body,input,select,textarea,tr,td,table {^M
+       font-family: arial,Helvetica,sans-serif;^M
+    font-size: 9pt;^M
+    color: #666; }^M
+^M

Here's a ruby one-liner for a single file. For bonus points, I decided to remove trailing whitespace also. This will overwrite the current file, but hey we're using version control.

ruby -e 'IO.write(ARGV.first, IO.read(ARGV.first).sub("\xEF\xBB\xBF", "").gsub("\r", "").gsub(/[ \t]+\n/, "\n"))' content/Skins/Azteka/skin.css

"\xEF\xBB\xBF" is the byte order mark, and "\r" is the same as ^M - a carriage return. (Windows uses \r\n for line endings instead of simply \n.)

And here it is a as a function, which also happens to be more readable:

def clean_up_after_notepad(file_contents)
  file_contents.sub("\xEF\xBB\xBF", "").gsub("\r", "")
end

Now if we run git diff (and scroll down), it looks a lot better! Browers will also parse all of the styles now.

+
+body {
+    background: #FFF url(/content/skins/azteka/images/bg_tile.jpg) repeat-x; }
+
+html,body,input,select,textarea,tr,td,table {
+       font-family: arial,Helvetica,sans-serif;
+    font-size: 9pt;
+    color: #666; }
+

If you're still skeptical, you can run this one-liner before and after cleaning the file to see that the correct bytes got removed and nothing else.

Before:

$ ruby -e 'puts File.readlines(ARGV.first).first.bytes.to_a.join(",")' content/Skins/Azteka/skin.css
239,187,191,13,10

239,187,191 is the BOM, 13 is the carriage return, and 10 is the newline.

After:

$ ruby -e 'puts File.readlines(ARGV.first).first.bytes.to_a.join(",")' content/Skins/Azteka/skin.css
10

Now we just see the newline character like expected.

@jbgo
Copy link
Author

jbgo commented Feb 1, 2013

In real life, you may see the error Encoding::CompatibilityError: incompatible encoding regexp match (ASCII-8BIT regexp with UTF-8 string)

In that case, you can simply force the encoding to UTF-8, leaving the bytes unchanged: file_contents.sub("\xEF\xBB\xBF".force_encoding('UTF-8'), "").gsub("\r", "")

@mattt
Copy link

mattt commented May 11, 2022

A BOM typically occurs at the start of a stream, so the delete_prefix can be used instead of sub. This has the advantage of not reading the entire string if it doesn't contain a BOM.

The force_encoding method may cause issues if you're assigning to a constant in a file with the # frozen_string_literal: true directive. An alternative approach would be to construct the string using pack:

UTF_8_BOM = [0xEF, 0xBB, 0xBF].pack("C*").force_encoding("UTF-8").freeze

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment