Skip to content

Instantly share code, notes, and snippets.

@larskanis
Last active February 3, 2020 14:05
Show Gist options
  • Save larskanis/9c91d9bc399b1488875832e0c5ae80b1 to your computer and use it in GitHub Desktop.
Save larskanis/9c91d9bc399b1488875832e0c5ae80b1 to your computer and use it in GitHub Desktop.
utf8-default-enc

Set default for Encoding.default_external to UTF-8 on Windows

This issue is related to https://bugs.ruby-lang.org/issues/13488 where we already discussed the topic an postponed the change for ruby-3. Patch is here:

Currently Encoding.default_external is initialized to the local console encoding of the Windows installation unless changed per option -E. This is e.g. cp850 for Western Europe. It should be changed to UTF-8.

RubyInstaller provided a checkbox for RUBYOPT=-Eutf-8 since version 2.4. This checkbox was disabled per default, but I noticed from bug reports, that many people enabled it. With RubyInstaller-2.7.0 this checkbox is enabled per default. So we already have a steady migration towards UTF-8 on Windows.

Changing to UTF-8 fixes various inconsistencies within ruby and with external tools. A very annoying case is that writing a text to file writes the file content in UTF-8, since this is the default ruby source encoding. But reading the content back, tags the string with the wrong encoding. But not in irb since it already set Encoding.default_external = "utf-8" on it's own.

s = "äöü"
File.write("x", s)   # => 6 bytes
File.read("x") == s  # => true in irb but false in .rb file

Another issue is that many non-asian regions have distinct legacy encodings for OEM-ANSI (aka Encoding.find('locale') ) and ASCII (aka Encoding.find('filesystem') ), so that a file written in current default external encoding Encoding.find('locale') is not properly interpret in Windows GUI tools like notepad. It is therefore uncommon to store files in OEM-ANSI encoding and doing so is almost certainly wrong.

RubyInstaller ships the MSYS2 environment, which defaults to UTF-8 as well.

Powershell made the switch to UTF-8 (without BOM) in Powershell-6.0 and even more in 6.1.

Changing the default of Encoding.default_external to UTF-8 is a trade-off. It doesn't fit to every case, but in my experience this is the best overall option.

There are some alternatives to it:

Changing the Windows console to codepage 65001:

  • The Windows implementation of 65001 is buggy in the console. I didn't verify it lately but chcp 65001 didn't work reliable years ago.
  • It is not the default and input methods like IME are incompatible.

Setting Encoding.default_internal in addition:

  • This triggers transcoding of output strings, which is not enabled on other systems, causing unexpected results and incompatibilities.

Change ruby to use Encoding.find("filesystem") as encoding for file operations:

  • That would fix the compatibility with some builtin Windows tools, but doesn't fix encoding issues due to increased use of UTF-8.

Please note that changing Encoding.default_external doesn't affect file or IO output, unless Encoding.default_internal is set as well (which is not the default). So inspecting ruby's output with Windows builtin more will most likely result in garbage (since strings are usually UTF-8 in ruby) regardless of the particular default_external setting. On the other hand output inspected with MSYS2 less is most likely correct, since it expects UTF-8 input.

The patch is currently about Windows only, because I would like to focus on that question for now. Possibly it's a subsequent question whether Encoding.default_external should default to UTF-8 on all operating systems or at least in case of LANG=C locale (which currently triggers US-ASCII).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment