Iristyle/encoding.md

## encoding.md

      
    Raw
  

              encoding.md
            
          
    Basic encoding info


ASCII is 0 - 7F (128 total characters) - ASCII is a subset of UTF-8
UTF-8 - variable width 1 to 4 bytes


Numberof bytes
Bits forcode point
Firstcode point
Lastcode point
Byte 1
Byte 2
Byte 3
Byte 4


1
7
U+0000
U+007F
0xxxxxxx
 
 
2
11
U+0080
U+07FF
110xxxxx
10xxxxxx
 
 
3
16
U+0800
U+FFFF
1110xxxx
10xxxxxx
10xxxxxx
 

4
21
U+10000
U+10FFFF
11110xxx
10xxxxxx
10xxxxxx
10xxxxxx


UTF-16 - 2 bytes for everything (UTF-8 and UTF-16 can roundtrip, though technically there are some unused ranges that cannot convert to UTF-16 from UTF-8)

Many Windows Unicode files use UCS-2 (predecessor to UTF-16) and include a BOM - for instance, PowerShell
UTF-16 can be preferred in Asia because Asian characters take 2 bytes instead of the 3 needed for UTF-8


The byte order mark (BOM) can be used to indicate a files Unicode encoding - UTF-8, UTF-16, UTF-32 - common on Windows, not so much on other platforms. In UTF-8 its 3 bytes, UTF-16 2 bytes, UTF-32 4 bytes
Joels primer The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) solid intro
UTF-8 characters can be looked up at FileFormat.info - for instance 4 byte 𠜎 (U+2070E) with hackable URL http://www.fileformat.info/info/unicode/char/2070E/index.htm
Unicode normalization - bit of an advanced topic, but realize that same characters don't have to be represented as same byte sequences - this can make upper / lower case conversion tricky and searching hard

Platform Specifics

Windows


Uses a codepage which describes what to do with characters over ASCII 7-bit range
Some platforms use DBCS (double byte) code pages 
Windows has a host of APIs around converting between ANSI, DBCS, MBCS and Unicode, though we can mostly ignore them
Codepage is a system level setting (use chcp to set in cmd)
Codepage 65001 is UTF-8 - but don't use it - it's not fully implemented / buggy. Here be dragons.
System langugage changeable with PowerShell cmdlet Set-WinSystemLocale - requires a reboot to fully take effect. Puppet does this in AppVeyor
Windows APIs are "wide character" / encoded as UTF-16LE, which is mostly the universal encoding for strings in Windows (there are older equivalent ANSI style APIs for backward compat). Data stored in registry, file names, etc - are all UTF-16.
Windows COM / OLE support uses wide strings
Terminal support is lousy, depending on which version of Windows (many improvements in Windows 10). Things may render incorrectly / segfault Ruby or crash console (especially if chcp used during a session without launching a new cmd)
Use ConEmu if you care about appropriate rendering / Unicode handling

Non-Windows


LANG / LC_* variables on nix control how Ruby starts
Typically set to LANG=xxxx.UTF-8 like LANG=en_US.UTF-8 and same for LC_ALL

Unclear which Ruby prefers when, but enabling UTF-8 where its not the default can be tricky.
Cumulus (based on Debian) requires this for instance:
sed -i -e 's/# en_US.UTF-8 UTF-8/en_US.UTF-8 UTF-8/' /etc/locale.gen
echo 'LANG="en_US.UTF-8"' > /etc/default/locale
dpkg-reconfigure --frontend=noninteractive locales
update-locale LANG=en_US.UTF-8
export LC_ALL=en_US.UTF-8

Posix APIs often deal with opaque arrays of bytes with no encoding.
This can set stage for trickier situations where different processes are running with different encodings. For instance the output of ps could contain multiple encodings in each line.
Changing from a non-UTF8 encoding that wrote files / filenames / etc containing non-ASCII characters to a UTF8 encoding will create problems reading data.

Ruby Gotchas


Ruby added encodings in 1.9 support - prior it was just an array of bytes
String manipulation for mismatched encodings can crash - i.e. +, gsub, etc (in certain orders)

invalid byte sequence in US-ASCII


Depending on string manipulation, Ruby may change encoding of resulting string to ASCII_8BIT, making them unusable for subsequent string operations
Regular expressions can't execute when string encodings are mismatched
.valid_encoding? only checks fs the bytes are plausibly in an encoding, not that a string is valid - this API isn't sufficient
Beware certain APIs that have bugs:

Etc - different versions of Ruby behave differently with how strings are returned / encoded. Puppet has a Puppet::Etc helper to address this
ENV (Windows) will corrupt environment strings by attempting to convert UTF-16LE to Encoding.default_external in a lossy way - Puppet has helpers in util.rb to address reading / writing env vars
Http libs - depending on Ruby version mismatched encodings are merged into binary strings
Uri parsing - Uri.escape / Uri.unescape are deprecated. Uri.escape turns UTF-8 strings into ASCII. Alternatively, CGI.escape has other problems. Puppet has helpers in util.rb called uri_encode and uri_query_encode to address these problems.


Ruby has a few Encoding values set at startup. Some are not present / usable based on OS.

Encoding.default_internal - typically unset, will be used when reading IO to transcode (we generally don't use this)
Encoding.default_external - derived from some combo of LANG / LC_ALL on nix, codepage on Windows

Common code pages on Windows are 437 (Encoding::IBM437) for US, 1252 (Encoding::CP1252) for Europe, 932 (Encoding::CP932 aka Encoding::Windows_31J for Japan)
Typically UTF-8 on non-Windows, but sometimes see ISO-8859-1 (Encoding::ISO_8859_1)


Encoding.find(:filesystem) - the encoding to use for the filesystem paths / names
Encoding::locale_charmap - default encoding of environment


Ruby can treat strings as blobs of bytes

[1] pry(main)> Encoding::ASCII_8BIT == Encoding::BINARY
=> true


Some Ruby codepages are only designed to be intermediates b/w other codepages - see dummy?
Ruby can be launched in a particular internal / external encoding using -E

ruby -E ISO-8859-1:UTF-8 -e "p [Encoding.default_external, \
  Encoding.default_internal]"
[#<Encoding:ISO-8859-1>, #<Encoding:UTF-8>]


Ruby 2.4 has best Unicode support to date (adds support for 9.0 - better handling of upcase, downcase, capitalize) BUT still has some bugs (mostly because of strings not being normalized) - Ruby 2.4 Unicode support test chart - Internationalization in Ruby 2.4

Ruby code best practices - what are the rules?


Puppet manifests are always UTF-8
Puppet data - i.e. YAML and JSON files are always UTF-8
OS files are typically in “system” encoding - Encoding.default_external in Ruby, but this can vary based on the OS
Prefer to intern all strings as UTF-8 to prevent unexpected string manipulation failures / convert at the boundaries if different (i.e writing to logs, reports, etc)
Win32OLE (COM) support is globally configured to transcode to UTF-8 within Ruby by WIN32OLE.codepage = WIN32OLE::CP_UTF8
Windows API calls going through helpers typically perform conversions between UTF-8 and UTF-16LE
Be explicit when performing any IO specific using any IO derived class. Stock Ruby supports this a few different ways

open("transcoded.txt", "r:ISO-8859-1:UTF-8") do |io|
  puts "transcoded text:"
  p io.read
end
File.open(File.join(manifestsdir, "site.pp"), "w", :encoding => Encoding::UTF_8) do |f|
  f.puts("notify { 'ManifestFromRelativeDefault': }")
end
If encodings are omitted, Ruby will use Encoding.default_external / Encoding.default_internal which might not be correct (i.e. for loading manifests). Note that YAML parsers automatically specify UTF-8, but JSON does not IIRC (JSON can be UTF-8, UTF-16 or UTF-32 according to spec)

But use Puppet FileSystem APIs and not Ruby File anyhow. Note our APIs have different signature than ruby with respect to octal mode:

Puppet::FileSystem.open(pem_path, nil, 'w:UTF-8') do |f|
  # with password protection enabled
  pem = private_key.to_pem(OpenSSL::Cipher::DES.new(:EDE3, :CBC), mixed_utf8)
  f.print(pem)
end

If piping data through without parsing as string, leave as BINARY / ASCII_8BIT
StringIO allows encoding to be specified, but ignores it
Be distrustful of console output (particularly on Windows) / similarly for browsers - i.e. Chrome OSX bugs sometimes render replacement characters for valid characters
Use POSIX Unicode character classes where possible:

[[:alnum:]] instead of \w, [[:space:]] instead of \s
avoid special Ruby non-POSIX character classes like [[:word:]] and [[:ascii:]] as these don't often translate to other languages / integrations


Windows accounts may be localized - i.e. Administrator and SYSTEM. Use their well-known SIDs instead of names. Note that Puppet manifests don't yet support SIDs everywhere but the Puppet code contains SID helpers in sid.rb like sid_to_name.
Things fixed in Unicode Adoption Blockers epic

C++ code best practices


TBD

Converting things


#encode - transcode string from current encoding to another encoding (sometimes a no-op)
#force_encoding - Ruby distrust - use sparingly when you know better than Ruby (generally not recommended)
Puppet has a CharacterEncoding helper with 3 methods:

convert_to_utf8
override_encoding_to_utf_8
scrub - replace invalid bytes with Unicode replacement char


Testing

Unit


Ruby files themselves should all be UTF-8, but when in doubt, use escape sequences like \u or actual byte arrays for string data
Where applicable, use Unicode strings when testing - grep for mixed_utf8 in Puppet code:

# different UTF-8 widths
# 1-byte A
# 2-byte ۿ - http://www.fileformat.info/info/unicode/char/06ff/index.htm - 0xDB 0xBF / 219 191
# 3-byte ᚠ - http://www.fileformat.info/info/unicode/char/16A0/index.htm - 0xE1 0x9A 0xA0 / 225 154 160
# 4-byte 𠜎 - http://www.fileformat.info/info/unicode/char/2070E/index.htm - 0xF0 0xA0 0x9C 0x8E / 240 160 156 142
let (:mixed_utf8) { "A\u06FF\u16A0\u{2070E}" } # Aۿᚠ𠜎
Beaker


Be careful with how data is transported when asserting in tests

Manifests will be SCP'd, and valid UTF-8 bytes written on coordinator are transferred as UTF-8
Shell commands are sent as strings, depending on host config bytes may be misinterpreted


SSH weirdness

Windows uses Cygwin
Not all platforms inherit user environment (for security reasons) - working on resolution within Beaker to unify this
There were some bugs in old Beaker / SSH libs (net-ssh in particular didn't handle Unicode properly over a certain string length)
Numberof bytes	Bits forcode point	Firstcode point	Lastcode point	Byte 1	Byte 2	Byte 3	Byte 4
1	7	U+0000	U+007F	0xxxxxxx
2	11	U+0080	U+07FF	110xxxxx	10xxxxxx
3	16	U+0800	U+FFFF	1110xxxx	10xxxxxx	10xxxxxx
4	21	U+10000	U+10FFFF	11110xxx	10xxxxxx	10xxxxxx	10xxxxxx