Skip to content

Instantly share code, notes, and snippets.

@ammar
Created November 3, 2010 15:55
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save ammar/661257 to your computer and use it in GitHub Desktop.
Save ammar/661257 to your computer and use it in GitHub Desktop.
UTF-8 aware string chop. (the firs gist was posted as anonymous)
# The rubyist version. Thanks to James Edward Gray II
def chop_utf8(s)
return unless s
lead = s.sub(/.\z/mu, '')
last = s[/.\z/mu] || ''
[lead, last]
end
=begin
# Super-bloated C-minded version. Keeping for posterity.
# UTF-8 aware string chop. Returns an array with two elements, the first
# contains the given string excluding the last character, and the second,
# also the last, contains the last character.
def chop_utf8(s)
return unless s
a = s.unpack('C*')
c, w = 0, 0
lead, last = '', ''
while c < a.length
case a[c]
when 0x00..0x7E; w = 1
when 0xC2..0xDF; w = 2
when 0xE0..0xEF; w = 3
when 0xF0..0xF4; w = 4
else w = 1 # other ASCII
end
if (c + w) >= a.length
last = a[c..c+(w-1)].pack('c*')
else
lead << a[c..c+(w-1)].pack('c*')
end
c += w
end
[lead, last]
end
=end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment