Skip to content

Instantly share code, notes, and snippets.

@brianmario
Last active December 9, 2020 20:21
Show Gist options
  • Star 4 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save brianmario/2992961 to your computer and use it in GitHub Desktop.
Save brianmario/2992961 to your computer and use it in GitHub Desktop.
Quick little pure-ruby UTF-8 string verification and cleaning utility
require 'strscan'
module UTF8Util
HIGH_BIT_RANGE = /[\x80-\xff]/n
ENCODING_SUPPORT = "".respond_to?(:force_encoding)
REPLACEMENT = "?"
# Check if this String is valid UTF-8
#
# Returns true or false.
def self.valid?(str)
sc = StringScanner.new(str)
while sc.skip_until(HIGH_BIT_RANGE)
sc.pos -= 1
if !sequence_length(sc)
return false
end
end
true
end
# Replace invalid UTF-8 character sequences with a replacement character
#
# Returns self as valid UTF-8.
def self.clean!(str)
str.force_encoding 'binary' if ENCODING_SUPPORT
sc = StringScanner.new(str)
while sc.skip_until(HIGH_BIT_RANGE)
pos = sc.pos = sc.pos-1
if !sequence_length(sc)
str[pos] = REPLACEMENT
end
end
str.force_encoding 'UTF-8' if ENCODING_SUPPORT
str
end
# Replace invalid UTF-8 character sequences with a replacement character
#
# Returns a copy of this String as valid UTF-8.
def self.clean(str)
clean!(str.dup)
end
# Validate the UTF-8 sequence at the current scanner position.
#
# str_or_scanner - A String or StringScanner instance used to read bytes
# for checking UTF-8 sequence length
#
# Returns The length in bytes of this UTF-8 sequence, false if invalid.
def self.sequence_length(str_or_scanner)
if str_or_scanner.is_a?(String)
str_or_scanner = StringScanner.new(str_or_scanner)
end
leader = str_or_scanner.get_byte.getbyte(0)
if (leader >> 5) == 0x6
if check_next_sequence(str_or_scanner)
return 2
else
str_or_scanner.pos -= 1
end
elsif (leader >> 4) == 0x0e
if check_next_sequence(str_or_scanner)
if check_next_sequence(str_or_scanner)
return 3
else
str_or_scanner.pos -= 2
end
else
str_or_scanner.pos -= 1
end
elsif (leader >> 3) == 0x1e
if check_next_sequence(str_or_scanner)
if check_next_sequence(str_or_scanner)
if check_next_sequence(str_or_scanner)
return 4
else
str_or_scanner.pos -= 3
end
else
str_or_scanner.pos -= 2
end
else
str_or_scanner.pos -= 1
end
end
false
end
private
# Read another byte off the scanner oving the scan position forward one place
#
# Returns nothing.
def self.check_next_sequence(scanner)
byte = scanner.get_byte[0]
(byte >> 6) == 0x2
end
end
@nag-cheedella
Copy link

Hi -

I've taken some random chinese/japanese chars as below which are valid UTF-8 chars but this program is returning "false" for this input which means these are not valid UTF-8.

漢字仮名漢字仮名漢字仮名漢字仮名漢字仮名漢字

Is it like we have to extend the HIGH_BIT_RANGE further to accept these chars?

@satishakumar
Copy link

Its checking valid Ascii characters not utf 8

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment