Navigation Menu

Skip to content

Instantly share code, notes, and snippets.

@wycats
Created May 17, 2011 22:18
Show Gist options
  • Star 95 You must be signed in to star a gist
  • Fork 20 You must be signed in to fork a gist
  • Save wycats/83c011e40e1970df0ef4 to your computer and use it in GitHub Desktop.
Save wycats/83c011e40e1970df0ef4 to your computer and use it in GitHub Desktop.

Ruby Encoding Cheat Sheet

  1. Only call force_encoding on BINARY Strings.
  2. When receiving a BINARY string from the network or file system, make sure to force_encode it to its correct encoding.
    • In general, the encoding information is provided in an out-of-band channel, such as the Content-Type header in HTTP
    • If you don't know the encoding, the String is BINARY forever and should not be concatenated with non-BINARY strings
  3. When calling force_encoding on a BINARY String, immediately call encode! afterwards. This will transcode the String to the default_internal encoding
  4. When using a regular expression with /u, make sure that only Unicode Strings are possible
  5. When using a regular expression with /n, make sure that only BINARY Strings are possible
  6. If you get an incompatible encoding between BINARY (ASCII-8BIT) and another encoding, the correct debugging approach is to identify where the BINARY String came from. Usually, this means that a library read in BINARY data from the network and didn't give it an encoding.
  7. In app code, never use force_encoding to convert BINARY data into a particular encoding. By the time you've reached app code, you have lost the information about which encoding is being used. Instead, find where the String came into Ruby, and fix it to set up the encoding based on the information it knows.
  8. In library code, only use force_encoding to convert BINARY data into an encoding if you have information about what encoding is being used. This means that you have a header in network protocols or a magic comment in templates (like ERB) or source files.
  9. Only include the magic comment in source files that actually contain characters from that encoding
  10. To combine two Strings with known, but different encodings, use encode to transcode the Strings into the same encoding, then combine them.
@danfarino
Copy link

lonny: BINARY is an official encoding in Ruby 1.9. It is synonymous with ASCII-8BIT:

ruby-1.9.2-p180 :001 > "abc".encode('binary').encoding
=> #Encoding:ASCII-8BIT

@metaskills
Copy link

Yehuda, your concise notes and musings have been an inspiration to a few projects of mine. The biggest is TinyTDS which was the first backend connection mode for SQL Server that handled encodings correctly. Both it and the adapter will be released in the upcoming Rails Installer project. I'll be bookmarking this for reference anytime I have to think about encodings again. Keep it up and thanks again for sharing!

@luckydev
Copy link

thanks :)

@brianmario
Copy link

use force_encoding to convert

It might be worth a slight wording change to denote that force_encoding isn't actually "converting" anything, rather switching the encoding flag on the string. No bytes are actually modified and the String isn't even read at all with force_encoding.

Just to make it absolutely clear what it's doing vs encode and encode! at least?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment