Skip to content

Instantly share code, notes, and snippets.

@Iristyle
Last active April 1, 2022 14:04
Show Gist options
  • Star 4 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save Iristyle/ce041dbadffdb8156e9a25de1ed8cf00 to your computer and use it in GitHub Desktop.
Save Iristyle/ce041dbadffdb8156e9a25de1ed8cf00 to your computer and use it in GitHub Desktop.
Encoding in Puppet

Basic encoding info

  • ASCII is 0 - 7F (128 total characters) - ASCII is a subset of UTF-8
  • UTF-8 - variable width 1 to 4 bytes
Numberof bytes Bits forcode point Firstcode point Lastcode point Byte 1 Byte 2 Byte 3 Byte 4
1 7 U+0000 U+007F 0xxxxxxx      
2 11 U+0080 U+07FF 110xxxxx 10xxxxxx    
3 16 U+0800 U+FFFF 1110xxxx 10xxxxxx 10xxxxxx  
4 21 U+10000 U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Platform Specifics

Windows

  • Uses a codepage which describes what to do with characters over ASCII 7-bit range
  • Some platforms use DBCS (double byte) code pages
  • Windows has a host of APIs around converting between ANSI, DBCS, MBCS and Unicode, though we can mostly ignore them
  • Codepage is a system level setting (use chcp to set in cmd)
  • Codepage 65001 is UTF-8 - but don't use it - it's not fully implemented / buggy. Here be dragons.
  • System langugage changeable with PowerShell cmdlet Set-WinSystemLocale - requires a reboot to fully take effect. Puppet does this in AppVeyor
  • Windows APIs are "wide character" / encoded as UTF-16LE, which is mostly the universal encoding for strings in Windows (there are older equivalent ANSI style APIs for backward compat). Data stored in registry, file names, etc - are all UTF-16.
  • Windows COM / OLE support uses wide strings
  • Terminal support is lousy, depending on which version of Windows (many improvements in Windows 10). Things may render incorrectly / segfault Ruby or crash console (especially if chcp used during a session without launching a new cmd)
  • Use ConEmu if you care about appropriate rendering / Unicode handling

Non-Windows

  • LANG / LC_* variables on nix control how Ruby starts
  • Typically set to LANG=xxxx.UTF-8 like LANG=en_US.UTF-8 and same for LC_ALL

Unclear which Ruby prefers when, but enabling UTF-8 where its not the default can be tricky. Cumulus (based on Debian) requires this for instance:

sed -i -e 's/# en_US.UTF-8 UTF-8/en_US.UTF-8 UTF-8/' /etc/locale.gen
echo 'LANG="en_US.UTF-8"' > /etc/default/locale
dpkg-reconfigure --frontend=noninteractive locales
update-locale LANG=en_US.UTF-8
export LC_ALL=en_US.UTF-8
  • Posix APIs often deal with opaque arrays of bytes with no encoding. This can set stage for trickier situations where different processes are running with different encodings. For instance the output of ps could contain multiple encodings in each line. Changing from a non-UTF8 encoding that wrote files / filenames / etc containing non-ASCII characters to a UTF8 encoding will create problems reading data.

Ruby Gotchas

  • Ruby added encodings in 1.9 support - prior it was just an array of bytes
  • String manipulation for mismatched encodings can crash - i.e. +, gsub, etc (in certain orders)
    • invalid byte sequence in US-ASCII
  • Depending on string manipulation, Ruby may change encoding of resulting string to ASCII_8BIT, making them unusable for subsequent string operations
  • Regular expressions can't execute when string encodings are mismatched
  • .valid_encoding? only checks fs the bytes are plausibly in an encoding, not that a string is valid - this API isn't sufficient
  • Beware certain APIs that have bugs:
    • Etc - different versions of Ruby behave differently with how strings are returned / encoded. Puppet has a Puppet::Etc helper to address this
    • ENV (Windows) will corrupt environment strings by attempting to convert UTF-16LE to Encoding.default_external in a lossy way - Puppet has helpers in util.rb to address reading / writing env vars
    • Http libs - depending on Ruby version mismatched encodings are merged into binary strings
    • Uri parsing - Uri.escape / Uri.unescape are deprecated. Uri.escape turns UTF-8 strings into ASCII. Alternatively, CGI.escape has other problems. Puppet has helpers in util.rb called uri_encode and uri_query_encode to address these problems.
  • Ruby has a few Encoding values set at startup. Some are not present / usable based on OS.
    • Encoding.default_internal - typically unset, will be used when reading IO to transcode (we generally don't use this)
    • Encoding.default_external - derived from some combo of LANG / LC_ALL on nix, codepage on Windows
      • Common code pages on Windows are 437 (Encoding::IBM437) for US, 1252 (Encoding::CP1252) for Europe, 932 (Encoding::CP932 aka Encoding::Windows_31J for Japan)
      • Typically UTF-8 on non-Windows, but sometimes see ISO-8859-1 (Encoding::ISO_8859_1)
    • Encoding.find(:filesystem) - the encoding to use for the filesystem paths / names
    • Encoding::locale_charmap - default encoding of environment
  • Ruby can treat strings as blobs of bytes
[1] pry(main)> Encoding::ASCII_8BIT == Encoding::BINARY
=> true
  • Some Ruby codepages are only designed to be intermediates b/w other codepages - see dummy?
  • Ruby can be launched in a particular internal / external encoding using -E
ruby -E ISO-8859-1:UTF-8 -e "p [Encoding.default_external, \
  Encoding.default_internal]"
[#<Encoding:ISO-8859-1>, #<Encoding:UTF-8>]

Ruby code best practices - what are the rules?

  • Puppet manifests are always UTF-8
  • Puppet data - i.e. YAML and JSON files are always UTF-8
  • OS files are typically in “system” encoding - Encoding.default_external in Ruby, but this can vary based on the OS
  • Prefer to intern all strings as UTF-8 to prevent unexpected string manipulation failures / convert at the boundaries if different (i.e writing to logs, reports, etc)
  • Win32OLE (COM) support is globally configured to transcode to UTF-8 within Ruby by WIN32OLE.codepage = WIN32OLE::CP_UTF8
  • Windows API calls going through helpers typically perform conversions between UTF-8 and UTF-16LE
  • Be explicit when performing any IO specific using any IO derived class. Stock Ruby supports this a few different ways
open("transcoded.txt", "r:ISO-8859-1:UTF-8") do |io|
  puts "transcoded text:"
  p io.read
end
File.open(File.join(manifestsdir, "site.pp"), "w", :encoding => Encoding::UTF_8) do |f|
  f.puts("notify { 'ManifestFromRelativeDefault': }")
end

If encodings are omitted, Ruby will use Encoding.default_external / Encoding.default_internal which might not be correct (i.e. for loading manifests). Note that YAML parsers automatically specify UTF-8, but JSON does not IIRC (JSON can be UTF-8, UTF-16 or UTF-32 according to spec)

  • But use Puppet FileSystem APIs and not Ruby File anyhow. Note our APIs have different signature than ruby with respect to octal mode:
Puppet::FileSystem.open(pem_path, nil, 'w:UTF-8') do |f|
  # with password protection enabled
  pem = private_key.to_pem(OpenSSL::Cipher::DES.new(:EDE3, :CBC), mixed_utf8)
  f.print(pem)
end
  • If piping data through without parsing as string, leave as BINARY / ASCII_8BIT
  • StringIO allows encoding to be specified, but ignores it
  • Be distrustful of console output (particularly on Windows) / similarly for browsers - i.e. Chrome OSX bugs sometimes render replacement characters for valid characters
  • Use POSIX Unicode character classes where possible:
    • [[:alnum:]] instead of \w, [[:space:]] instead of \s
    • avoid special Ruby non-POSIX character classes like [[:word:]] and [[:ascii:]] as these don't often translate to other languages / integrations
  • Windows accounts may be localized - i.e. Administrator and SYSTEM. Use their well-known SIDs instead of names. Note that Puppet manifests don't yet support SIDs everywhere but the Puppet code contains SID helpers in sid.rb like sid_to_name.
  • Things fixed in Unicode Adoption Blockers epic

C++ code best practices

  • TBD
Converting things
  • #encode - transcode string from current encoding to another encoding (sometimes a no-op)
  • #force_encoding - Ruby distrust - use sparingly when you know better than Ruby (generally not recommended)
  • Puppet has a CharacterEncoding helper with 3 methods:
    • convert_to_utf8
    • override_encoding_to_utf_8
    • scrub - replace invalid bytes with Unicode replacement char

Testing

Unit

  • Ruby files themselves should all be UTF-8, but when in doubt, use escape sequences like \u or actual byte arrays for string data
  • Where applicable, use Unicode strings when testing - grep for mixed_utf8 in Puppet code:
# different UTF-8 widths
# 1-byte A
# 2-byte ۿ - http://www.fileformat.info/info/unicode/char/06ff/index.htm - 0xDB 0xBF / 219 191
# 3-byte ᚠ - http://www.fileformat.info/info/unicode/char/16A0/index.htm - 0xE1 0x9A 0xA0 / 225 154 160
# 4-byte 𠜎 - http://www.fileformat.info/info/unicode/char/2070E/index.htm - 0xF0 0xA0 0x9C 0x8E / 240 160 156 142
let (:mixed_utf8) { "A\u06FF\u16A0\u{2070E}" } # Aۿᚠ𠜎

Beaker

  • Be careful with how data is transported when asserting in tests
    • Manifests will be SCP'd, and valid UTF-8 bytes written on coordinator are transferred as UTF-8
    • Shell commands are sent as strings, depending on host config bytes may be misinterpreted
  • SSH weirdness
    • Windows uses Cygwin
    • Not all platforms inherit user environment (for security reasons) - working on resolution within Beaker to unify this
    • There were some bugs in old Beaker / SSH libs (net-ssh in particular didn't handle Unicode properly over a certain string length)
@MosesMendoza
Copy link

From wikipedia - might help illustrate the variable-byte encoding aspect of UTF-8:

Numberof bytes Bits forcode point Firstcode point Lastcode point Byte 1 Byte 2 Byte 3 Byte 4
1 7 U+0000 U+007F 0xxxxxxx      
2 11 U+0080 U+07FF 110xxxxx 10xxxxxx    
3 16 U+0800 U+FFFF 1110xxxx 10xxxxxx 10xxxxxx  
4 21 U+10000 U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

https://en.wikipedia.org/wiki/UTF-8

@Sharpie
Copy link

Sharpie commented Mar 19, 2018

This is video is also an awesome 10 minute primer on what character encodings are and how UTF-8 works: https://www.youtube.com/watch?v=MijmeoH9LT4

If you're starting from scratch with encodings, this is a great source of foundational knowledge.

@Sharpie
Copy link

Sharpie commented Mar 19, 2018

variable width 1 to 4 bytes

It's actually 1 to 6 bytes. The longest "header" byte in UTF-8 is:

1111110x

Where the 6 leading 1s indicate there will be 6 bytes, including the header, in that character. Characters requiring more than 4 bytes to encode are very rare though.

@Sharpie
Copy link

Sharpie commented Mar 19, 2018

Hah, spoke too soon. RFC 3629 restricted UTF-8 to 4 bytes at most even though the format technically supports up to 6.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment