Skip to content

Instantly share code, notes, and snippets.

@rkh
Created September 2, 2012 14:57
Show Gist options
  • Save rkh/3600034 to your computer and use it in GitHub Desktop.
Save rkh/3600034 to your computer and use it in GitHub Desktop.
# encoding: binary
# Removes any bytes from a string that are not valid UTF-8
class Cleaner
attr_reader :bytes, :buffer, :outstanding
def self.clean(str)
new.tap { |c| c << str }.to_s
end
def initialize(str = nil)
@bytes = []
clear_buffer
end
def <<(input)
return self << input.bytes if input.respond_to? :bytes
return input.each { |b| self << b } if input.respond_to? :each
case input
when 001..127 then add(input)
when 128..191 then fill_buffer(input)
when 192..223 then start_buffer(input, 2)
when 224..239 then start_buffer(input, 3)
when 240..247 then start_buffer(input, 4)
when 248..251 then start_buffer(input, 5)
when 252..253 then start_buffer(input, 6)
else clear_buffer
end
end
def to_s
bytes.pack('C*').force_encoding('utf-8')
end
private
def clear_buffer
start_buffer(nil, 0)
end
def start_buffer(byte, size)
@buffer, @outstanding = Array(byte), size
end
def fill_buffer(byte)
buffer << byte
add(buffer) if buffer.size == outstanding
clear_buffer if buffer.size > outstanding
end
def add(input)
clear_buffer
bytes.concat Array(input)
end
end
str = "yummy\xE2 \xF0\x9F\x8D\x94 \x9F\x8D\x94"
puts str
puts Cleaner.clean(str)
@Burgestrand
Copy link

Questions on how to remove invalid UTF-8 characters from strings come up from time to time in #ruby and #ruby-lang. Up until now the only solution I’ve been able to give is this:

"xå\xFFö".encode("UTF-16BE", undef: :replace, invalid: :replace, replace: "").encode("UTF-8")

Now I have yet another option. You should gemify it. :)

@rkh
Copy link
Author

rkh commented Sep 2, 2012

There is a third option to use the iconv bindings on 1.8, as @Burgerstrand's solution is 1.9 only.

@rkh
Copy link
Author

rkh commented Sep 2, 2012

No edit button for gist comments?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment