Last active
February 6, 2024 22:20
-
-
Save mdchaney/05c5b1283bd1a7d65c8fa6519436e338 to your computer and use it in GitHub Desktop.
Fix string encoding in Ruby
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This function takes a string that is in Latin-1 (ISO-8859-1), UTF-8, or plain ASCII and returns the string properly encoded as UTF-8. In addition, any Microsoft smart quotes (stupid quotes) are replaced with plain ASCII equivalents. This is useful for reading CSV files or other text files from a web upload and getting to a "known good" state. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
def fix_encoding(str) | |
# The "b" method returns a copied string with encoding ASCII-8BIT | |
str = str.b | |
# Strip UTF-8 BOM if it's at start of file | |
if str =~ /\A\xEF\xBB\xBF/n | |
str = str.gsub(/\A\xEF\xBB\xBF/n, '') | |
end | |
if str =~ /([\xc0-\xff][\x80-\xbf]{1,3})+/n | |
# String has actual UTF-8 characters | |
str.force_encoding('UTF-8') | |
elsif str =~ /[\x80-\xff]/n | |
# Get rid of Microsoft stupid quotes | |
if str =~ /[\x82\x8b\x91\x92\x9b\xb4\x84\x93\x94]/n | |
str = str.tr("\x82\x8b\x91\x92\x9b\xb4\x84\x93\x94".b, "''''''\"\"\"") | |
end | |
# There was no UTF-8, but there are high characters. Assume to | |
# be Latin-1, and then convert to UTF-8 | |
str.force_encoding('ISO-8859-1').encode('UTF-8') | |
else | |
# No high characters, just mark as UTF-8 | |
str.force_encoding('UTF-8') | |
end | |
end |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment