If you've landed here it means you've been hit by this message in your program. In this post I'll quickly introduce you to what "UTF-8 byte sequences" are, why they can be invalid and how to solve this problem in Ruby.
UTF-8 is, as explained in Wikipedia, is a set codepoints (in simple words: numbers representing characters). Every character in UTF-8 is a sequence of 1 up to 4 bytes.
Apart from UTF-8 there are also other encodings like ISO-8859-1 or Windows-1252 - you may have seen these names before in your programming career. These encodings cover a big set of characters, including special latin characters etc.
Now, even though UTF-8 covers a huge set of characters as well it is not 100% compatible with the above mentioned encodings. Take a look at the following picture:
- Both UTF-8 and ISO-8859-1 are ASCII compatible - the include the same codepoints for digits and latin alphabet
- UTF-8 includes characters not present in ISO-8859-1, like the rocket emoji 🚀
- Both UTF-8 and ISO-8859-1 include "Å" characters, but these letters are defined using different codepoints - c385 in UTF-8 and c5 in ISO-8859-1
Ruby's default encoding since 2.0 is UTF-8. This means that Ruby will treat any string you input as an UTF-8 encoded string unless you tell it explicitly that it's encoded differently.
Let's use the Å
character from the introductory diagram to present this problem.
Imagine you have a file file.txt
containing a following string: "vandflyver \xC5rhus"
. As you already know C5
codepoint corresponds to Å
in ISO-8859-1 and isn't present in UTF-8 encoding. Ruby however doesn't know that the original encoding of the file is ISO-8859-1 and will by default interpret it as UTF-8.
So, the following operation will result in the infamous "UTF-8 Invalid byte sequence": https://gist.github.com/8508e6356336624c57d05fd988b5023b
The "invalid UTF-8 byte sequence" here is our "Å" (C5) character as it's not present in UTF-8. Fortunately there are a few ways to solve this problem.
If you know the encoding in which the file was originally written then all you have to do is to provide the encoding name when reading the input file. Ruby will automatically handle the character conversion for you: https://gist.github.com/b2bfdae1005d88eab7d49f60dc11b1ee
In the last line I've used String.unpack method to print the converted character's codepoint. As you can see it got correctly converted from C5
to C385
🎉
In many cases you won't be that lucky to know the original encoding of the file. In this case String.encode method comes in handy. You can use it to skip invalid UTF-8 characters or replace them with a string of your choice.
Check out the following examples: https://gist.github.com/2da47fb8202fb627c856494e2291f750
May not be beautiful, but it's still better than crashing the app, right?
In case you don't know the source encoding and don't want to skip the invalid characters you can use a character encoding detection gem called charlock_holmes. It'll analyze the string and provide you with the most probable source encoding and guess confidence (also a language code as a bonus :P).
Check it out in action: https://gist.github.com/87485314188538b4c428abc660180e98
First of all I hope that this post helped you to solve the Ruby issue you had. On the other hand I'm sure that also you've learned something useful. String encodings can sometimes be really f***ed up, so it's really worth knowing what's going on under the hood.