- ASCII is 0 - 7F (128 total characters) - ASCII is a subset of UTF-8
- UTF-8 - variable width 1 to 4 bytes
Numberof bytes | Bits forcode point | Firstcode point | Lastcode point | Byte 1 | Byte 2 | Byte 3 | Byte 4 |
---|---|---|---|---|---|---|---|
1 | 7 | U+0000 | U+007F | 0xxxxxxx | |||
2 | 11 | U+0080 | U+07FF | 110xxxxx | 10xxxxxx | ||
3 | 16 | U+0800 | U+FFFF | 1110xxxx | 10xxxxxx | 10xxxxxx | |
4 | 21 | U+10000 | U+10FFFF | 11110xxx | 10xxxxxx | 10xxxxxx | 10xxxxxx |
- UTF-16 - 2 bytes for everything (UTF-8 and UTF-16 can roundtrip, though technically there are some unused ranges that cannot convert to UTF-16 from UTF-8)
- Many Windows Unicode files use UCS-2 (predecessor to UTF-16) and include a BOM - for instance, PowerShell
- UTF-16 can be preferred in Asia because Asian characters take 2 bytes instead of the 3 needed for UTF-8
- The byte order mark (BOM) can be used to indicate a files Unicode encoding - UTF-8, UTF-16, UTF-32 - common on Windows, not so much on other platforms. In UTF-8 its 3 bytes, UTF-16 2 bytes, UTF-32 4 bytes
- Joels primer The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) solid intro
- UTF-8 characters can be looked up at FileFormat.info - for instance 4 byte 𠜎 (U+2070E) with hackable URL http://www.fileformat.info/info/unicode/char/2070E/index.htm
- Unicode normalization - bit of an advanced topic, but realize that same characters don't have to be represented as same byte sequences - this can make upper / lower case conversion tricky and searching hard
- Uses a codepage which describes what to do with characters over ASCII 7-bit range
- Some platforms use DBCS (double byte) code pages
- Windows has a host of APIs around converting between ANSI, DBCS, MBCS and Unicode, though we can mostly ignore them
- Codepage is a system level setting (use
chcp
to set incmd
) - Codepage
65001
is UTF-8 - but don't use it - it's not fully implemented / buggy. Here be dragons. - System langugage changeable with PowerShell cmdlet Set-WinSystemLocale - requires a reboot to fully take effect. Puppet does this in AppVeyor
- Windows APIs are "wide character" / encoded as
UTF-16LE
, which is mostly the universal encoding for strings in Windows (there are older equivalent ANSI style APIs for backward compat). Data stored in registry, file names, etc - are all UTF-16. - Windows COM / OLE support uses wide strings
- Terminal support is lousy, depending on which version of Windows (many improvements in Windows 10). Things may render incorrectly / segfault Ruby or crash console (especially if
chcp
used during a session without launching a newcmd
) - Use ConEmu if you care about appropriate rendering / Unicode handling
LANG
/LC_*
variables on nix control how Ruby starts- Typically set to
LANG=xxxx.UTF-8
likeLANG=en_US.UTF-8
and same forLC_ALL
Unclear which Ruby prefers when, but enabling UTF-8 where its not the default can be tricky. Cumulus (based on Debian) requires this for instance:
sed -i -e 's/# en_US.UTF-8 UTF-8/en_US.UTF-8 UTF-8/' /etc/locale.gen
echo 'LANG="en_US.UTF-8"' > /etc/default/locale
dpkg-reconfigure --frontend=noninteractive locales
update-locale LANG=en_US.UTF-8
export LC_ALL=en_US.UTF-8
- Posix APIs often deal with opaque arrays of bytes with no encoding.
This can set stage for trickier situations where different processes are running with different encodings. For instance the output of
ps
could contain multiple encodings in each line. Changing from a non-UTF8 encoding that wrote files / filenames / etc containing non-ASCII characters to a UTF8 encoding will create problems reading data.
- Ruby added encodings in 1.9 support - prior it was just an array of bytes
- String manipulation for mismatched encodings can crash - i.e.
+
,gsub
, etc (in certain orders)invalid byte sequence in US-ASCII
- Depending on string manipulation, Ruby may change encoding of resulting string to
ASCII_8BIT
, making them unusable for subsequent string operations - Regular expressions can't execute when string encodings are mismatched
.valid_encoding?
only checks fs the bytes are plausibly in an encoding, not that a string is valid - this API isn't sufficient- Beware certain APIs that have bugs:
- Etc - different versions of Ruby behave differently with how strings are returned / encoded. Puppet has a Puppet::Etc helper to address this
- ENV (Windows) will corrupt environment strings by attempting to convert
UTF-16LE
toEncoding.default_external
in a lossy way - Puppet has helpers in util.rb to address reading / writing env vars - Http libs - depending on Ruby version mismatched encodings are merged into binary strings
- Uri parsing -
Uri.escape
/Uri.unescape
are deprecated.Uri.escape
turnsUTF-8
strings intoASCII
. Alternatively,CGI.escape
has other problems. Puppet has helpers in util.rb calleduri_encode
anduri_query_encode
to address these problems.
- Ruby has a few
Encoding
values set at startup. Some are not present / usable based on OS.Encoding.default_internal
- typically unset, will be used when reading IO to transcode (we generally don't use this)Encoding.default_external
- derived from some combo ofLANG
/LC_ALL
on nix, codepage on Windows- Common code pages on Windows are 437 (
Encoding::IBM437
) for US, 1252 (Encoding::CP1252
) for Europe, 932 (Encoding::CP932
akaEncoding::Windows_31J
for Japan) - Typically UTF-8 on non-Windows, but sometimes see ISO-8859-1 (
Encoding::ISO_8859_1
)
- Common code pages on Windows are 437 (
Encoding.find(:filesystem)
- the encoding to use for the filesystem paths / namesEncoding::locale_charmap
- default encoding of environment
- Ruby can treat strings as blobs of bytes
[1] pry(main)> Encoding::ASCII_8BIT == Encoding::BINARY
=> true
- Some Ruby codepages are only designed to be intermediates b/w other codepages - see
dummy?
- Ruby can be launched in a particular internal / external encoding using
-E
ruby -E ISO-8859-1:UTF-8 -e "p [Encoding.default_external, \
Encoding.default_internal]"
[#<Encoding:ISO-8859-1>, #<Encoding:UTF-8>]
- Ruby 2.4 has best Unicode support to date (adds support for 9.0 - better handling of upcase, downcase, capitalize) BUT still has some bugs (mostly because of strings not being normalized) - Ruby 2.4 Unicode support test chart - Internationalization in Ruby 2.4
- Puppet manifests are always UTF-8
- Puppet data - i.e. YAML and JSON files are always UTF-8
- OS files are typically in “system” encoding -
Encoding.default_external
in Ruby, but this can vary based on the OS - Prefer to intern all strings as UTF-8 to prevent unexpected string manipulation failures / convert at the boundaries if different (i.e writing to logs, reports, etc)
Win32OLE
(COM) support is globally configured to transcode to UTF-8 within Ruby byWIN32OLE.codepage = WIN32OLE::CP_UTF8
- Windows API calls going through helpers typically perform conversions between UTF-8 and UTF-16LE
- Be explicit when performing any IO specific using any IO derived class. Stock Ruby supports this a few different ways
open("transcoded.txt", "r:ISO-8859-1:UTF-8") do |io|
puts "transcoded text:"
p io.read
end
File.open(File.join(manifestsdir, "site.pp"), "w", :encoding => Encoding::UTF_8) do |f|
f.puts("notify { 'ManifestFromRelativeDefault': }")
end
If encodings are omitted, Ruby will use Encoding.default_external
/ Encoding.default_internal
which might not be correct (i.e. for loading manifests). Note that YAML parsers automatically specify UTF-8, but JSON does not IIRC (JSON can be UTF-8, UTF-16 or UTF-32 according to spec)
- But use Puppet FileSystem APIs and not Ruby
File
anyhow. Note our APIs have different signature than ruby with respect to octal mode:
Puppet::FileSystem.open(pem_path, nil, 'w:UTF-8') do |f|
# with password protection enabled
pem = private_key.to_pem(OpenSSL::Cipher::DES.new(:EDE3, :CBC), mixed_utf8)
f.print(pem)
end
- If piping data through without parsing as string, leave as
BINARY
/ASCII_8BIT
- StringIO allows encoding to be specified, but ignores it
- Be distrustful of console output (particularly on Windows) / similarly for browsers - i.e. Chrome OSX bugs sometimes render replacement characters for valid characters
- Use POSIX Unicode character classes where possible:
[[:alnum:]]
instead of\w
,[[:space:]]
instead of\s
- avoid special Ruby non-POSIX character classes like
[[:word:]]
and[[:ascii:]]
as these don't often translate to other languages / integrations
- Windows accounts may be localized - i.e.
Administrator
andSYSTEM
. Use their well-known SIDs instead of names. Note that Puppet manifests don't yet support SIDs everywhere but the Puppet code contains SID helpers in sid.rb likesid_to_name
. - Things fixed in Unicode Adoption Blockers epic
- TBD
- #encode - transcode string from current encoding to another encoding (sometimes a no-op)
- #force_encoding - Ruby distrust - use sparingly when you know better than Ruby (generally not recommended)
- Puppet has a CharacterEncoding helper with 3 methods:
convert_to_utf8
override_encoding_to_utf_8
scrub
- replace invalid bytes with Unicode replacement char
- Ruby files themselves should all be UTF-8, but when in doubt, use escape sequences like
\u
or actual byte arrays for string data - Where applicable, use Unicode strings when testing - grep for
mixed_utf8
in Puppet code:
# different UTF-8 widths
# 1-byte A
# 2-byte ۿ - http://www.fileformat.info/info/unicode/char/06ff/index.htm - 0xDB 0xBF / 219 191
# 3-byte ᚠ - http://www.fileformat.info/info/unicode/char/16A0/index.htm - 0xE1 0x9A 0xA0 / 225 154 160
# 4-byte 𠜎 - http://www.fileformat.info/info/unicode/char/2070E/index.htm - 0xF0 0xA0 0x9C 0x8E / 240 160 156 142
let (:mixed_utf8) { "A\u06FF\u16A0\u{2070E}" } # Aۿᚠ𠜎
- Be careful with how data is transported when asserting in tests
- Manifests will be SCP'd, and valid UTF-8 bytes written on coordinator are transferred as UTF-8
- Shell commands are sent as strings, depending on host config bytes may be misinterpreted
- SSH weirdness
- Windows uses Cygwin
- Not all platforms inherit user environment (for security reasons) - working on resolution within Beaker to unify this
- There were some bugs in old Beaker / SSH libs (net-ssh in particular didn't handle Unicode properly over a certain string length)
From wikipedia - might help illustrate the variable-byte encoding aspect of UTF-8:
https://en.wikipedia.org/wiki/UTF-8