Skip to content

Instantly share code, notes, and snippets.

@jborean93
Last active January 7, 2023 05:37
Show Gist options
  • Star 4 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jborean93/44b4688cc518d67bd7bc2192648384a3 to your computer and use it in GitHub Desktop.
Save jborean93/44b4688cc518d67bd7bc2192648384a3 to your computer and use it in GitHub Desktop.
Details about console encoding in PowerShell

Console Encoding in PowerShell

This guide is all about console encoding and how PowerShell deals with the console and sending/receiving data to child processes. There are multiple layers of processing that are involved when it comes to outputting data to the console in PowerShell. Problems can occur at any layer which is why it's important to have an understanding of this problem to solve issues that might occur.

Text Encoding

The first layer is dealing with how text is represented at the raw byte level. Understanding this is important because processes communicate with each other using bytes and data sent/received from the console are also done through bytes. Computers use encoding schemes to map a character to a numeric value and the raw bytes that represents the text can differ widely between encoding schemes. For example the string café is encoded in the following bytes in these various encoding scheme:

Encoding Bytes (as decimal)
ASCII 99, 97, 102, 63
UTF-8 99, 97, 102, 195, 169
IBM 437 99, 97, 102, 130
Windows 1257 99, 97, 102, 233

Normal latin alphabet characters are the same across these encoding schemes (this is not a hard rule but the norm) but when it comes to é it starts to differ.

  • ASCII does not have any mapping for this character so it becomes ? (63 in ASCII)
  • UTF-8 represents it using 2 bytes (195, 169)
  • The remaining 2 just have a different mappings for that character

This is why it's important that the encoding scheme that is used to encode a string is the same one that is used to decode bytes. For example café encoded with UTF-8 but decoded with IBM 437 will result in the string caf├⌐ which is incorrect.

[Text.Encoding]::GetEncoding(437).GetString([Text.Encoding]::UTF8.GetBytes('café'))
# café

The following commands can be used to play around with encoding and decoding strings to and from bytes in PowerShell:

# Use this to list the various encodings supported by .NET and their names
[Text.Encoding]::GetEncodings() | ForEach-Object -Process {
    $encoding = $_.GetEncoding()
    [PSCustomObject]@{
        Name = $encoding.WebName
        CodePage = $encoding.CodePage
    }
}

# Use either the Name or CodePage property from the above in GetEncoding()
[Text.Encoding]::GetEncoding('utf-8').GetBytes('string value')

# To convert back from bytes you can do
[Text.Encoding]::GetEncoding('utf-8').GetString([byte[]]@(99, 97, 102, 195, 169))

Process stdio

The second layer is understanding how native processes communicate with each other and the boundary between a process and the console. On both Windows and Unix hosts, processes send and receive data using 3 pipes:

  • stdout: Used by the applications to write normal output data
  • stderr Used by applications to write data typically containing error details, not there's no strict rule, this can contain anything the application wants to write to it
  • stdin: Used to read data sent to the application

These pipes transfer data as bytes, so if a process wants to output text it will need to encode it using an encoding scheme. When exchanging text over these bytes, something still needs to encode that text to bytes and back again. This can either be done manually by the process or by some helper operations and is highly dependent on the environment the code is running in.

There are 3 main scenarios on Windows that control how a process gets assigned the 3 stdio pipes:

  • When a console process is spawned with a new console, the stdio pipes that are associated with that console are assigned to the process
    • This is like starting powershell.exe from the taskbar or start menu
    • Any data that it outputs is sent to the console to process/display
    • Any data entered by the user in the console is sent to the process through the stdin pipe
  • When a console process spawns a new process without stating a new console is required then the existing handles are inherited
    • This is like cmd.exe starting powershell.exe, the PowerShell process runs in the same console window
    • Anything the new child process outputs is sent to the same stdout/stderr of the parent
  • When a new process is created with explicit pipes for some or all of the stdio pipes those sstdio handles will be redirected to the pipes specified
    • This is common when you deal with redirection to a file or are trying to capture output in PowerShell

GUI based applications on Windows are not guaranteed to have stdio pipes but they can be assigned explicitly if needed. Typically this shouldn't matter as GUIs use visual elements to convey data rather than the stdio pipes that console applications do. Unix hosts are similar, although the methods of doing so are different, to Windows.

In traditional shells when you pipe data you are connecting the stdout and stdin pipes of each process. This is usually achieved by creating a pipe where the write end is the stdout of the first process and read end is the stdin of the second process. For example if I was to run ipconfig | findstr data I am joining stdout of ipconfig to stdin of findstr allowing findstr to find the string I want in the output of ipconfig.

                   |                      -> <stdout> -> | <stdin> -> findstr -> <stdout> -> | <console stdout>
<console stdin> -> | <stdin> -> ipconfig                 |                                   |
                   |                      -> <stderr> -> | --------------------> <stderr> -> | <console stderr>

Stdio pipes aren't just limited to text, binary data can be exchanged like wget http://mysite.com/my.tar.gz | tar -xvf without it ever having to touch the disk into a temporary file.

PowerShell adds a twist to this due to fundamental difference in how commands work. When running something like Get-ChildItem | Format-Table these commands are exchanging rich .NET objects between code that runs in the same process. The pipelined bridge between the 2 commands are not the stdout to stdin pipe seen in other shells but rather something internal to PowerShell. When running commands/applications in PowerShell like ipconfig | Select-String data or Get-Item | findstr it now needs to convert those rich objects into bytes to send/receive on the stdio pipes of the newly spawned native processes. One unfortunate aspect of PowerShell is that even when piping data between multiple native commands like ipconfig | findsr it will still be converted to and from text and exchangeda cross the PowerShell pipeline. At best this creates unecessary overhead but at worst it can break pipelines, e.g. the wget .. | tar example won't work because the binary data is corrupted in the bytes <-> text conversion that PowerShell does.

PowerShell's mechanism is to first convert the .NET object to a string and then encode that string to bytes to send to the process. Any output it receives on the stdout pipe is then decoded back to a string and then output on the PowerShell pipeline. What can go wrong here is when PowerShell uses a different encoding mechanism for the bytes <-> text conversion than what was used to encode/decode on the other end. The encoding scheme used for this operation is controlled by the $OutputEncoding preference variable which is explained below.

The Console

The final layer of interactive with a process is the console (or terminal). The console is the ultimate destination of output from a process and a way to send input, through keystrokes, into a process. It's the interface between a command line application and the end user.

This is the part where Windows and Unix differenciates between each other. On Unix, the console/terminal application is opened specifically and it itself spawns a shell to run. These pipes transfer raw bytes so the encoding scheme is highly dependent on the

TODO: Fill this out TODO: Talk about fonts and why unicode beyond the BMP will not work in conhost

The Windows Command Line blog has a nice series of articles on the history of the command line in Windows which is a great source of information about this topic.

PowerShell Configuration

There are 3 configuration options in PowerShell that are related to encoding for the console and native process input/output:

Item Windows PowerShell Default PowerShell Default
$OutputEncoding ASCIIEncoding TODO: see if it's extended ASCII/OEM UTF8Encoding (no BOM)
[Console]::OutputEncoding Dependent on system settings Dependent on system settings
[Console]::InputEncoding Dependent on system settings Dependent on system settings

While not strictly the same, the $OutputEncoding and [Console]::InputEncoding are tied together and in most cases should be set to the same value.

TODO: Add info on the default system settings for the Console encoding properties.

The following tests will be using this PowerShell script, save this to a file called proc_io.ps1 somewhere on your host.

[CmdletBinding()]
param (
    [Parameter(Mandatory)]
    [string]
    $Path,  # The file the script should write the raw stdin bytes to.

    [Parameter(Mandatory)]
    [string]
    $OutputData,  # Base64 encoded bytes that the script should output to the stdout pipe.
    
    # Raw = raw FileStream read and write with bytes
    # .NET = [Console]::Read and [Console]::Write ($Path and $OutputData are treated as UTF-8)
    [Parameter()]
    [ValidateSet('Raw', '.NET')]
    [string]
    $Method = 'Raw',

    [int]
    $InputCodepage = $null,

    [int]
    $OutputCodepage = $null
)

Add-Type -TypeDefinition @'
using Microsoft.Win32.SafeHandles;
using System;
using System.Runtime.InteropServices;

namespace RawConsole
{
    public class NativeMethods
    {
        [DllImport("Kernel32.dll")]
        public static extern int GetConsoleCP();

        [DllImport("Kernel32.dll")]
        public static extern int GetConsoleOutputCP();

        [DllImport("Kernel32.dll")]
        public static extern SafeFileHandle GetStdHandle(
            int nStdHandle);

        [DllImport("Kernel32.dll")]
        public static extern bool SetConsoleCP(
            int wCodePageID);

        [DllImport("Kernel32.dll")]
        public static extern bool SetConsoleOutputCP(
            int wCodePageID);
    }
}
'@

$origInputCP = [RawConsole.NativeMethods]::GetConsoleCP()
$origOutputCP = [RawConsole.NativeMethods]::GetConsoleOutputCP()

if ($InputCodepage) {
    [void][RawConsole.NativeMethods]::SetConsoleCP($InputCodepage)
}
if ($OutputCodepage) {
    [void][RawConsole.NativeMethods]::SetConsoleOutputCP($OutputCodepage)
}

try {    
    $outputBytes = [Convert]::FromBase64String($OutputData)
    $utf8NoBom = [Text.UTF8Encoding]::new($false)
    
    if ($Method -eq 'Raw') {
        $stdinHandle = [RawConsole.NativeMethods]::GetStdHandle(-10)
        $stdinFS = [IO.FileStream]::new($stdinHandle, 'Read')
    
        $stdoutHandle = [RawConsole.NativeMethods]::GetStdHandle(-11)
        $stdoutFS = [IO.FileStream]::new($stdoutHandle, 'Write')
    
        $inputRaw = [byte[]]::new(1024)
        $inputRead = $stdinFS.Read($inputRaw, 0, $inputRaw.Length)
        $outputFS = [IO.File]::Create($Path)
        $outputFS.Write($inputRaw, 0, $inputRead)
        $outputFS.Dispose()
        
        $stdoutFS.Write($outputBytes, 0, $outputBytes.Length)
    
        $stdinFS.Dispose()
        $stdinHandle.Dispose()
        
        $stdoutFS.Dispose()
        $stdoutHandle.Dispose()
    }
    elseif ($Method -eq '.NET') {
        $inputRaw = [Text.StringBuilder]::new()
        while ($true) {
            $char = [Console]::Read()
            if ($char -eq -1) {
                break
            }
    
            [void]$inputRaw.Append([char]$char)
        }
        [IO.File]::WriteAllText($Path, $inputRaw.ToString(), $utf8NoBom)
    
        $outputString = $utf8NoBom.GetString($outputBytes)
        [Console]::Write($outputString)
    }    
}
finally {
    [void][RawConsole.NativeMethods]::SetConsoleCP($origInputCP)
    [void][RawConsole.NativeMethods]::SetConsoleOutputCP($origOutputCP)
}

This script essentially captures the raw bytes that was sent to it and saves it to the path specified. It will also output the raw bytes the caller specifies from a base64 encoded string to test PowerShell's behaviour when capturing stdout from a process. There are 2 modes of operation with this script:

  • RAW: Reads and writes the raw bytes on the stdin/stdout pipes of the process
  • .NET: Uses the .NET [Console] class to read and write data, this operates on text based on the console codepage.

$OutputEncoding

This variable controls the encoding PowerShell uses when it pipes text to a native application like "string" | my_application.exe. This does not control the encoding PowerShell uses when it reads output from a process, see [Console]::OutputEncoding further below. The psuedo code of PowerShell going to run "string" | my_application.exe is essentially:

  • Create a pipe
  • Set the input data to $inputData = $OutputEncoding.GetBytes("string")
  • Create a new process and assign the stdin to the pipe created in the first step
  • Write $inputData to the pipe

This essentially that $OutputEncoding should be modified based on what encoding the native application is expecting. This behaviour is highly dependent on the native application itself so the default in PowerShell may not work properly.

Here is an example of piping a string that is UTF-8 encoded.

[Console]::OutputEncoding = [Text.Encoding]::UTF8  # Used as a baseline for these tests
$OutputEncoding = [Text.UTF8Encoding]::new($false)  # Default in PowerShell 6+

$string = 'café'
$stringBytes = $OutputEncoding.GetBytes($string)
$stringB64 = [Convert]::ToBase64String($stringBytes)

# proc_io.ps1 is the script from above.
$output = $string | powershell.exe -NoLogo -File proc_io.ps1 -Path input -OutputData $stringB64

Format-Hex -Path input
$output

### Outputs the following

   Label: C:\temp\input

          Offset Bytes                                           Ascii
                 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
          ------ ----------------------------------------------- -----
0000000000000000 63 61 66 C3 A9 0D 0A                            caf�
café

The input file demonstrates that the piped string was UTF-8 encoded and that PowerShell was able to decode the stdout bytes as the UTF-8 string.

Trying the IBM 437 encoding mechanism, default on many en-US Windows PowerShell setups:

[Console]::OutputEncoding = [Text.Encoding]::UTF8  # Used as a baseline for these tests
$OutputEncoding = [Text.Encoding]::GetEncoding(437)

$string = 'café'
$stringBytes = $OutputEncoding.GetBytes($string)
$stringB64 = [Convert]::ToBase64String($stringBytes)

$output = $string | powershell.exe -NoLogo -File proc_io.ps1 -Path input -OutputData $stringB64

Format-Hex -Path input
$output

### Outputs the following

   Label: C:\temp\input

          Offset Bytes                                           Ascii
                 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
          ------ ----------------------------------------------- -----
0000000000000000 63 61 66 82 0D 0A                               caf���
caf�

In this case the input data was café in the IBM 437 encoding but the encoding PowerShell used to decode the stdout bytes is not $OutputEncoding. This is a prime example of when native applications output data PowerShell may not be able to decode it correctly. So if $OutputEncoding does not control how PowerShell decode bytes it captures from native processes it starts then what does.

[Console]::OutputEncoding

The [Console]::OutputEncoding value has a few purposes in PowerShell

  • Controls the encoding that PowerShell uses when reading stdout from processes it spawns
  • Controls the output codepage of the console
    • This can be used by other processes to control what encoding they output to stdout to
    • It also calls the Win32 method SetConsoleOutputCP

Using the last example we can get PowerShell to capture the proper string:

$OutputEncoding = [Console]::OutputEncoding = [Text.Encoding]::GetEncoding(437)

$string = 'café'
$stringBytes = $OutputEncoding.GetBytes($string)
$stringB64 = [Convert]::ToBase64String($stringBytes)

$output = $string | powershell.exe -NoLogo -File proc_io.ps1 -Path input -OutputData $stringB64

Format-Hex -Path input
$output

### Outputs the following

   Label: C:\temp\input

          Offset Bytes                                           Ascii
                 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
          ------ ----------------------------------------------- -----
0000000000000000 63 61 66 82 0D 0A                               caf???
café

Sometimes applications call SetConsoleOutputCP themselves causing all sorts of strife when PowerShell goes to read the data. The only solution to this is to pre-empt that change and set [Console]::OutputEncoding to whatever the native application does so that PowerShell is able to decode the output.

# Start from a baseline, our console is running with IBM 437
$OutputEncoding = [Console]::InputEncoding = [Console]::OutputEncoding = [Text.Encoding]::GetEncoding(437)

# The desired output string must be UTF-8 as -Method .NET expects that
$string = 'café'
$stringBytes = [Text.UTF8Encoding]::new($false).GetBytes($string)
$stringB64 = [Convert]::ToBase64String($stringBytes)

$output = $string | powershell.exe -NoLogo -File proc_io.ps1 -Path input -OutputData $stringB64 -Method .NET -InputCode 65001 -OutputCodePage 65001

Format-Hex -Path input
$output

### Outputs the following

   Label: C:\temp\input

          Offset Bytes                                           Ascii
                 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
          ------ ----------------------------------------------- -----
0000000000000000 63 61 66 EF BF BD 0D 0A                         caf�??
caf├

In this case the native process that was invoked is setting the input and output console codepage to 65001 (UTF-8) but PowerShell is sending and receiving the data using IBM 437. The stdin the process received is now no longer valid and the output it sent over stdout will be decoded incorrectly by PowerShell. In these cases the only thing that can be done is to set input/output codepage and encoding PowerShell uses to the one the native applicaton is setting it to:

# The native app is going to set the cp to 65001, make PowerShell do this before it does
$OutputEncoding = [Console]::InputEncoding = [Console]::OutputEncoding =  [Text.UTF8Encoding]::new($false)

$string = 'café'
$stringBytes = [Text.UTF8Encoding]::new($false).GetBytes($string)
$stringB64 = [Convert]::ToBase64String($stringBytes)

$output = $string | powershell.exe -NoLogo -File proc_io.ps1 -Path input -OutputData $stringB64 -Method .NET -InputCode 65001 -OutputCodePage 65001

Format-Hex -Path input
$output

### Outputs the following

   Label: C:\temp\input

          Offset Bytes                                           Ascii
                 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
          ------ ----------------------------------------------- -----
0000000000000000 63 61 66 C3 A9 0D 0A                            caf�
café

In this case the native app was able to read the correct text value through stdin and PowerShell is able to capture the stdout as a string properly.

[Console]::InputEncoding

The [Console]::InputEncoding value is used to control the console input codepage that is set. This corresponds to the SetConsoleCP command. This should correspond with the $OutputEncoding preference variable as having them set different can lead to bad behaviour.

[Console]::OutputEncoding = [Text.Encoding]::UTF8  # Used as a baseline for these tests
[Console]::InputEncoding = [Text.Encoding]::GetEncoding(437)
$OutputEncoding = [Text.UTF8Encoding]::new($false)

$string = 'café'
$stringBytes = $OutputEncoding.GetBytes($string)
$stringB64 = [Convert]::ToBase64String($stringBytes)

# proc_io.ps1 is the script from above.
$output = $string | powershell.exe -NoLogo -File proc_io.ps1 -Path input -OutputData $stringB64 -Method .NET

Format-Hex -Path input
$output

### Outputs the following

   Label: C:\temp\input

          Offset Bytes                                           Ascii
                 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
          ------ ----------------------------------------------- -----
0000000000000000 63 61 66 E2 94 9C E2 8C 90 0D 0A                caf����
café

The output for stdout is fine but the input that was received from PowerShell over stdin is not correct. When the -Method .NET parameter is used the proc_io.ps1 script will read the stdin using [Console]::Read. This is essentially calling the same method as -Method Raw but it will automatically encode the bytes to a string using the value of the console input codepage ([Console]::InputEncoding). When the console input codepage does not match $OutputEncoding the string a native application reads may not be correct.

In this example café using UTF-8 ($OutputEncoding) now becomes 63 61 66 C3 A9 when it's sent to the native process. The native process reads those bytes and uses the console's input codepage to decode those bytes. Because the input cp is IBM 437 the bytes now becomes the string caf├⌐. When that native process goes to use that string it will no longer represent the original text value that was passed in. In the case of proc_io.ps1 it wrote the string to a file using UTF-8 encoding so the text file contains caf├⌐ UTF-8 encoded.

This is why it's important to keep these 2 options in sync with each other, when setting either it should be done like

$OutputEncoding = [Console]::InputEncoding = $newEncodingValue

This is why it is perplexing that PowerShell 6+ has set the default value for $OutputEncoding to UTF-8 no BOM but the console input codepage is still the OEM default. This behaviour causes trouble in very simple use cases like piping data to Python 3.

'café' | python.exe -c "import sys; data = sys.stdin.read(); print(data)"

# Outputs
# café

Windows PowerShell is unaffected by this problem as the default value for $OutputEncoding is set to the same value as [Console]::InputEncoding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment