This guide is all about console encoding and how PowerShell deals with the console and sending/receiving data to child processes. There are multiple layers of processing that are involved when it comes to outputting data to the console in PowerShell. Problems can occur at any layer which is why it's important to have an understanding of this problem to solve issues that might occur.
The first layer is dealing with how text is represented at the raw byte level.
Understanding this is important because processes communicate with each other using bytes and data sent/received from the console are also done through bytes.
Computers use encoding schemes to map a character to a numeric value and the raw bytes that represents the text can differ widely between encoding schemes.
For example the string café
is encoded in the following bytes in these various encoding scheme:
Encoding | Bytes (as decimal) |
---|---|
ASCII | 99, 97, 102, 63 |
UTF-8 | 99, 97, 102, 195, 169 |
IBM 437 | 99, 97, 102, 130 |
Windows 1257 | 99, 97, 102, 233 |
Normal latin alphabet characters are the same across these encoding schemes (this is not a hard rule but the norm) but when it comes to é
it starts to differ.
- ASCII does not have any mapping for this character so it becomes
?
(63 in ASCII) - UTF-8 represents it using 2 bytes (195, 169)
- The remaining 2 just have a different mappings for that character
This is why it's important that the encoding scheme that is used to encode a string is the same one that is used to decode bytes.
For example café
encoded with UTF-8 but decoded with IBM 437 will result in the string café
which is incorrect.
[Text.Encoding]::GetEncoding(437).GetString([Text.Encoding]::UTF8.GetBytes('café'))
# café
The following commands can be used to play around with encoding and decoding strings to and from bytes in PowerShell:
# Use this to list the various encodings supported by .NET and their names
[Text.Encoding]::GetEncodings() | ForEach-Object -Process {
$encoding = $_.GetEncoding()
[PSCustomObject]@{
Name = $encoding.WebName
CodePage = $encoding.CodePage
}
}
# Use either the Name or CodePage property from the above in GetEncoding()
[Text.Encoding]::GetEncoding('utf-8').GetBytes('string value')
# To convert back from bytes you can do
[Text.Encoding]::GetEncoding('utf-8').GetString([byte[]]@(99, 97, 102, 195, 169))
The second layer is understanding how native processes communicate with each other and the boundary between a process and the console. On both Windows and Unix hosts, processes send and receive data using 3 pipes:
stdout
: Used by the applications to write normal output datastderr
Used by applications to write data typically containing error details, not there's no strict rule, this can contain anything the application wants to write to itstdin
: Used to read data sent to the application
These pipes transfer data as bytes, so if a process wants to output text it will need to encode it using an encoding scheme. When exchanging text over these bytes, something still needs to encode that text to bytes and back again. This can either be done manually by the process or by some helper operations and is highly dependent on the environment the code is running in.
There are 3 main scenarios on Windows that control how a process gets assigned the 3 stdio pipes:
- When a console process is spawned with a new console, the stdio pipes that are associated with that console are assigned to the process
- This is like starting
powershell.exe
from the taskbar or start menu - Any data that it outputs is sent to the console to process/display
- Any data entered by the user in the console is sent to the process through the
stdin
pipe
- This is like starting
- When a console process spawns a new process without stating a new console is required then the existing handles are inherited
- This is like
cmd.exe
startingpowershell.exe
, the PowerShell process runs in the same console window - Anything the new child process outputs is sent to the same
stdout/stderr
of the parent
- This is like
- When a new process is created with explicit pipes for some or all of the stdio pipes those sstdio handles will be redirected to the pipes specified
- This is common when you deal with redirection to a file or are trying to capture output in PowerShell
GUI based applications on Windows are not guaranteed to have stdio pipes but they can be assigned explicitly if needed. Typically this shouldn't matter as GUIs use visual elements to convey data rather than the stdio pipes that console applications do. Unix hosts are similar, although the methods of doing so are different, to Windows.
In traditional shells when you pipe data you are connecting the stdout
and stdin
pipes of each process.
This is usually achieved by creating a pipe where the write end is the stdout
of the first process and read end is the stdin
of the second process.
For example if I was to run ipconfig | findstr data
I am joining stdout
of ipconfig
to stdin
of findstr
allowing findstr
to find the string I want in the output of ipconfig
.
| -> <stdout> -> | <stdin> -> findstr -> <stdout> -> | <console stdout>
<console stdin> -> | <stdin> -> ipconfig | |
| -> <stderr> -> | --------------------> <stderr> -> | <console stderr>
Stdio pipes aren't just limited to text, binary data can be exchanged like wget http://mysite.com/my.tar.gz | tar -xvf
without it ever having to touch the disk into a temporary file.
PowerShell adds a twist to this due to fundamental difference in how commands work.
When running something like Get-ChildItem | Format-Table
these commands are exchanging rich .NET objects between code that runs in the same process.
The pipelined bridge between the 2 commands are not the stdout
to stdin
pipe seen in other shells but rather something internal to PowerShell.
When running commands/applications in PowerShell like ipconfig | Select-String data
or Get-Item | findstr
it now needs to convert those rich objects into bytes to send/receive on the stdio
pipes of the newly spawned native processes.
One unfortunate aspect of PowerShell is that even when piping data between multiple native commands like ipconfig | findsr
it will still be converted to and from text and exchangeda cross the PowerShell pipeline.
At best this creates unecessary overhead but at worst it can break pipelines, e.g. the wget .. | tar
example won't work because the binary data is corrupted in the bytes <-> text conversion that PowerShell does.
PowerShell's mechanism is to first convert the .NET object to a string and then encode that string to bytes to send to the process.
Any output it receives on the stdout
pipe is then decoded back to a string and then output on the PowerShell pipeline.
What can go wrong here is when PowerShell uses a different encoding mechanism for the bytes <-> text conversion than what was used to encode/decode on the other end.
The encoding scheme used for this operation is controlled by the $OutputEncoding
preference variable which is explained below.
The final layer of interactive with a process is the console (or terminal). The console is the ultimate destination of output from a process and a way to send input, through keystrokes, into a process. It's the interface between a command line application and the end user.
This is the part where Windows and Unix differenciates between each other. On Unix, the console/terminal application is opened specifically and it itself spawns a shell to run. These pipes transfer raw bytes so the encoding scheme is highly dependent on the
TODO: Fill this out TODO: Talk about fonts and why unicode beyond the BMP will not work in conhost
The Windows Command Line blog has a nice series of articles on the history of the command line in Windows which is a great source of information about this topic.
There are 3 configuration options in PowerShell that are related to encoding for the console and native process input/output:
Item | Windows PowerShell Default | PowerShell Default |
---|---|---|
$OutputEncoding |
ASCIIEncoding TODO: see if it's extended ASCII/OEM | UTF8Encoding (no BOM) |
[Console]::OutputEncoding |
Dependent on system settings | Dependent on system settings |
[Console]::InputEncoding |
Dependent on system settings | Dependent on system settings |
While not strictly the same, the $OutputEncoding
and [Console]::InputEncoding
are tied together and in most cases should be set to the same value.
TODO: Add info on the default system settings for the Console encoding properties.
The following tests will be using this PowerShell script, save this to a file called proc_io.ps1
somewhere on your host.
[CmdletBinding()]
param (
[Parameter(Mandatory)]
[string]
$Path, # The file the script should write the raw stdin bytes to.
[Parameter(Mandatory)]
[string]
$OutputData, # Base64 encoded bytes that the script should output to the stdout pipe.
# Raw = raw FileStream read and write with bytes
# .NET = [Console]::Read and [Console]::Write ($Path and $OutputData are treated as UTF-8)
[Parameter()]
[ValidateSet('Raw', '.NET')]
[string]
$Method = 'Raw',
[int]
$InputCodepage = $null,
[int]
$OutputCodepage = $null
)
Add-Type -TypeDefinition @'
using Microsoft.Win32.SafeHandles;
using System;
using System.Runtime.InteropServices;
namespace RawConsole
{
public class NativeMethods
{
[DllImport("Kernel32.dll")]
public static extern int GetConsoleCP();
[DllImport("Kernel32.dll")]
public static extern int GetConsoleOutputCP();
[DllImport("Kernel32.dll")]
public static extern SafeFileHandle GetStdHandle(
int nStdHandle);
[DllImport("Kernel32.dll")]
public static extern bool SetConsoleCP(
int wCodePageID);
[DllImport("Kernel32.dll")]
public static extern bool SetConsoleOutputCP(
int wCodePageID);
}
}
'@
$origInputCP = [RawConsole.NativeMethods]::GetConsoleCP()
$origOutputCP = [RawConsole.NativeMethods]::GetConsoleOutputCP()
if ($InputCodepage) {
[void][RawConsole.NativeMethods]::SetConsoleCP($InputCodepage)
}
if ($OutputCodepage) {
[void][RawConsole.NativeMethods]::SetConsoleOutputCP($OutputCodepage)
}
try {
$outputBytes = [Convert]::FromBase64String($OutputData)
$utf8NoBom = [Text.UTF8Encoding]::new($false)
if ($Method -eq 'Raw') {
$stdinHandle = [RawConsole.NativeMethods]::GetStdHandle(-10)
$stdinFS = [IO.FileStream]::new($stdinHandle, 'Read')
$stdoutHandle = [RawConsole.NativeMethods]::GetStdHandle(-11)
$stdoutFS = [IO.FileStream]::new($stdoutHandle, 'Write')
$inputRaw = [byte[]]::new(1024)
$inputRead = $stdinFS.Read($inputRaw, 0, $inputRaw.Length)
$outputFS = [IO.File]::Create($Path)
$outputFS.Write($inputRaw, 0, $inputRead)
$outputFS.Dispose()
$stdoutFS.Write($outputBytes, 0, $outputBytes.Length)
$stdinFS.Dispose()
$stdinHandle.Dispose()
$stdoutFS.Dispose()
$stdoutHandle.Dispose()
}
elseif ($Method -eq '.NET') {
$inputRaw = [Text.StringBuilder]::new()
while ($true) {
$char = [Console]::Read()
if ($char -eq -1) {
break
}
[void]$inputRaw.Append([char]$char)
}
[IO.File]::WriteAllText($Path, $inputRaw.ToString(), $utf8NoBom)
$outputString = $utf8NoBom.GetString($outputBytes)
[Console]::Write($outputString)
}
}
finally {
[void][RawConsole.NativeMethods]::SetConsoleCP($origInputCP)
[void][RawConsole.NativeMethods]::SetConsoleOutputCP($origOutputCP)
}
This script essentially captures the raw bytes that was sent to it and saves it to the path specified.
It will also output the raw bytes the caller specifies from a base64 encoded string to test PowerShell's behaviour when capturing stdout
from a process.
There are 2 modes of operation with this script:
RAW
: Reads and writes the raw bytes on thestdin/stdout
pipes of the process.NET
: Uses the .NET[Console]
class to read and write data, this operates on text based on the console codepage.
This variable controls the encoding PowerShell uses when it pipes text to a native application like "string" | my_application.exe
.
This does not control the encoding PowerShell uses when it reads output from a process, see [Console]::OutputEncoding
further below.
The psuedo code of PowerShell going to run "string" | my_application.exe
is essentially:
- Create a pipe
- Set the input data to
$inputData = $OutputEncoding.GetBytes("string")
- Create a new process and assign the stdin to the pipe created in the first step
- Write
$inputData
to the pipe
This essentially that $OutputEncoding
should be modified based on what encoding the native application is expecting.
This behaviour is highly dependent on the native application itself so the default in PowerShell may not work properly.
Here is an example of piping a string that is UTF-8 encoded.
[Console]::OutputEncoding = [Text.Encoding]::UTF8 # Used as a baseline for these tests
$OutputEncoding = [Text.UTF8Encoding]::new($false) # Default in PowerShell 6+
$string = 'café'
$stringBytes = $OutputEncoding.GetBytes($string)
$stringB64 = [Convert]::ToBase64String($stringBytes)
# proc_io.ps1 is the script from above.
$output = $string | powershell.exe -NoLogo -File proc_io.ps1 -Path input -OutputData $stringB64
Format-Hex -Path input
$output
### Outputs the following
Label: C:\temp\input
Offset Bytes Ascii
00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
------ ----------------------------------------------- -----
0000000000000000 63 61 66 C3 A9 0D 0A caf�
café
The input
file demonstrates that the piped string was UTF-8 encoded and that PowerShell was able to decode the stdout bytes as the UTF-8 string.
Trying the IBM 437 encoding mechanism, default on many en-US Windows PowerShell setups:
[Console]::OutputEncoding = [Text.Encoding]::UTF8 # Used as a baseline for these tests
$OutputEncoding = [Text.Encoding]::GetEncoding(437)
$string = 'café'
$stringBytes = $OutputEncoding.GetBytes($string)
$stringB64 = [Convert]::ToBase64String($stringBytes)
$output = $string | powershell.exe -NoLogo -File proc_io.ps1 -Path input -OutputData $stringB64
Format-Hex -Path input
$output
### Outputs the following
Label: C:\temp\input
Offset Bytes Ascii
00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
------ ----------------------------------------------- -----
0000000000000000 63 61 66 82 0D 0A caf���
caf�
In this case the input data was café
in the IBM 437 encoding but the encoding PowerShell used to decode the stdout
bytes is not $OutputEncoding
.
This is a prime example of when native applications output data PowerShell may not be able to decode it correctly.
So if $OutputEncoding
does not control how PowerShell decode bytes it captures from native processes it starts then what does.
The [Console]::OutputEncoding
value has a few purposes in PowerShell
- Controls the encoding that PowerShell uses when reading
stdout
from processes it spawns - Controls the output codepage of the console
- This can be used by other processes to control what encoding they output to
stdout
to - It also calls the Win32 method SetConsoleOutputCP
- This can be used by other processes to control what encoding they output to
Using the last example we can get PowerShell to capture the proper string:
$OutputEncoding = [Console]::OutputEncoding = [Text.Encoding]::GetEncoding(437)
$string = 'café'
$stringBytes = $OutputEncoding.GetBytes($string)
$stringB64 = [Convert]::ToBase64String($stringBytes)
$output = $string | powershell.exe -NoLogo -File proc_io.ps1 -Path input -OutputData $stringB64
Format-Hex -Path input
$output
### Outputs the following
Label: C:\temp\input
Offset Bytes Ascii
00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
------ ----------------------------------------------- -----
0000000000000000 63 61 66 82 0D 0A caf???
café
Sometimes applications call SetConsoleOutputCP
themselves causing all sorts of strife when PowerShell goes to read the data.
The only solution to this is to pre-empt that change and set [Console]::OutputEncoding
to whatever the native application does so that PowerShell is able to decode the output.
# Start from a baseline, our console is running with IBM 437
$OutputEncoding = [Console]::InputEncoding = [Console]::OutputEncoding = [Text.Encoding]::GetEncoding(437)
# The desired output string must be UTF-8 as -Method .NET expects that
$string = 'café'
$stringBytes = [Text.UTF8Encoding]::new($false).GetBytes($string)
$stringB64 = [Convert]::ToBase64String($stringBytes)
$output = $string | powershell.exe -NoLogo -File proc_io.ps1 -Path input -OutputData $stringB64 -Method .NET -InputCode 65001 -OutputCodePage 65001
Format-Hex -Path input
$output
### Outputs the following
Label: C:\temp\input
Offset Bytes Ascii
00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
------ ----------------------------------------------- -----
0000000000000000 63 61 66 EF BF BD 0D 0A caf�??
caf├
In this case the native process that was invoked is setting the input and output console codepage to 65001 (UTF-8) but PowerShell is sending and receiving the data using IBM 437.
The stdin
the process received is now no longer valid and the output it sent over stdout
will be decoded incorrectly by PowerShell.
In these cases the only thing that can be done is to set input/output codepage and encoding PowerShell uses to the one the native applicaton is setting it to:
# The native app is going to set the cp to 65001, make PowerShell do this before it does
$OutputEncoding = [Console]::InputEncoding = [Console]::OutputEncoding = [Text.UTF8Encoding]::new($false)
$string = 'café'
$stringBytes = [Text.UTF8Encoding]::new($false).GetBytes($string)
$stringB64 = [Convert]::ToBase64String($stringBytes)
$output = $string | powershell.exe -NoLogo -File proc_io.ps1 -Path input -OutputData $stringB64 -Method .NET -InputCode 65001 -OutputCodePage 65001
Format-Hex -Path input
$output
### Outputs the following
Label: C:\temp\input
Offset Bytes Ascii
00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
------ ----------------------------------------------- -----
0000000000000000 63 61 66 C3 A9 0D 0A caf�
café
In this case the native app was able to read the correct text value through stdin
and PowerShell is able to capture the stdout
as a string properly.
The [Console]::InputEncoding
value is used to control the console input codepage that is set.
This corresponds to the SetConsoleCP command.
This should correspond with the $OutputEncoding
preference variable as having them set different can lead to bad behaviour.
[Console]::OutputEncoding = [Text.Encoding]::UTF8 # Used as a baseline for these tests
[Console]::InputEncoding = [Text.Encoding]::GetEncoding(437)
$OutputEncoding = [Text.UTF8Encoding]::new($false)
$string = 'café'
$stringBytes = $OutputEncoding.GetBytes($string)
$stringB64 = [Convert]::ToBase64String($stringBytes)
# proc_io.ps1 is the script from above.
$output = $string | powershell.exe -NoLogo -File proc_io.ps1 -Path input -OutputData $stringB64 -Method .NET
Format-Hex -Path input
$output
### Outputs the following
Label: C:\temp\input
Offset Bytes Ascii
00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
------ ----------------------------------------------- -----
0000000000000000 63 61 66 E2 94 9C E2 8C 90 0D 0A caf����
café
The output for stdout
is fine but the input that was received from PowerShell over stdin
is not correct.
When the -Method .NET
parameter is used the proc_io.ps1
script will read the stdin
using [Console]::Read
.
This is essentially calling the same method as -Method Raw
but it will automatically encode the bytes to a string using the value of the console input codepage ([Console]::InputEncoding
).
When the console input codepage does not match $OutputEncoding
the string a native application reads may not be correct.
In this example café
using UTF-8 ($OutputEncoding
) now becomes 63 61 66 C3 A9
when it's sent to the native process.
The native process reads those bytes and uses the console's input codepage to decode those bytes.
Because the input cp is IBM 437 the bytes now becomes the string café
.
When that native process goes to use that string it will no longer represent the original text value that was passed in.
In the case of proc_io.ps1
it wrote the string to a file using UTF-8 encoding so the text file contains café
UTF-8 encoded.
This is why it's important to keep these 2 options in sync with each other, when setting either it should be done like
$OutputEncoding = [Console]::InputEncoding = $newEncodingValue
This is why it is perplexing that PowerShell 6+ has set the default value for $OutputEncoding
to UTF-8 no BOM but the console input codepage is still the OEM default.
This behaviour causes trouble in very simple use cases like piping data to Python 3.
'café' | python.exe -c "import sys; data = sys.stdin.read(); print(data)"
# Outputs
# café
Windows PowerShell is unaffected by this problem as the default value for $OutputEncoding
is set to the same value as [Console]::InputEncoding
.