Skip to content

Instantly share code, notes, and snippets.

@mdchaney
Last active February 6, 2024 22:20
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save mdchaney/05c5b1283bd1a7d65c8fa6519436e338 to your computer and use it in GitHub Desktop.
Save mdchaney/05c5b1283bd1a7d65c8fa6519436e338 to your computer and use it in GitHub Desktop.
Fix string encoding in Ruby
This function takes a string that is in Latin-1 (ISO-8859-1), UTF-8, or plain ASCII and returns the string properly encoded as UTF-8. In addition, any Microsoft smart quotes (stupid quotes) are replaced with plain ASCII equivalents. This is useful for reading CSV files or other text files from a web upload and getting to a "known good" state.
def fix_encoding(str)
# The "b" method returns a copied string with encoding ASCII-8BIT
str = str.b
# Strip UTF-8 BOM if it's at start of file
if str =~ /\A\xEF\xBB\xBF/n
str = str.gsub(/\A\xEF\xBB\xBF/n, '')
end
if str =~ /([\xc0-\xff][\x80-\xbf]{1,3})+/n
# String has actual UTF-8 characters
str.force_encoding('UTF-8')
elsif str =~ /[\x80-\xff]/n
# Get rid of Microsoft stupid quotes
if str =~ /[\x82\x8b\x91\x92\x9b\xb4\x84\x93\x94]/n
str = str.tr("\x82\x8b\x91\x92\x9b\xb4\x84\x93\x94".b, "''''''\"\"\"")
end
# There was no UTF-8, but there are high characters. Assume to
# be Latin-1, and then convert to UTF-8
str.force_encoding('ISO-8859-1').encode('UTF-8')
else
# No high characters, just mark as UTF-8
str.force_encoding('UTF-8')
end
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment