Skip to content

Instantly share code, notes, and snippets.

@NodeJSmith
Last active January 23, 2024 06:17
Show Gist options
  • Star 5 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save NodeJSmith/e7e37f2d3f162456869f015f842bcf15 to your computer and use it in GitHub Desktop.
Save NodeJSmith/e7e37f2d3f162456869f015f842bcf15 to your computer and use it in GitHub Desktop.
UnicodeEncodeError in Windows agent CI pipelines

Background

Running Python scripts in CI pipelines on Windows agents can sometimes cause UnicodeEncodeErrors that don't seem to make any sense. For example, running a simple command with a tool built in click or running dbx build or prefect build may run fine on your local Windows machine, but fail when running in a CI pipeline.

For example, running the below code:

# example.py
import click

@click.command(context_settings={"help_option_names": ["-h", "--help"]})
def main():
    """
    App description with Unicode ‣
    """
    pass

if __name__ == '__main__':
    main()

On a standard US Windows 10 machine will output the following:

Usage: click_example_unicode.py [OPTIONS]

  App description with Unicode ‣

Options:
  -h, --help  Show this message and exit.

While in an Azure/Github CI pipeline we will see this:

Traceback (most recent call last):
  File "example.py", line 11, in <module>
    main()
  File "C:\hostedtoolcache\windows\Python\3.6.8\x64\lib\site-packages\click\core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "C:\hostedtoolcache\windows\Python\3.6.8\x64\lib\site-packages\click\core.py", line 1052, in main
    with self.make_context(prog_name, args, **extra) as ctx:
  File "C:\hostedtoolcache\windows\Python\3.6.8\x64\lib\site-packages\click\core.py", line 914, in make_context
    self.parse_args(ctx, args)
  File "C:\hostedtoolcache\windows\Python\3.6.8\x64\lib\site-packages\click\core.py", line 1370, in parse_args
    value, args = param.handle_parse_result(ctx, opts, args)
  File "C:\hostedtoolcache\windows\Python\3.6.8\x64\lib\site-packages\click\core.py", line 2347, in handle_parse_result
    value = self.process_value(ctx, value)
  File "C:\hostedtoolcache\windows\Python\3.6.8\x64\lib\site-packages\click\core.py", line 2309, in process_value
    value = self.callback(ctx, self, value)
  File "C:\hostedtoolcache\windows\Python\3.6.8\x64\lib\site-packages\click\core.py", line 1270, in show_help
    echo(ctx.get_help(), color=ctx.color)
  File "C:\hostedtoolcache\windows\Python\3.6.8\x64\lib\site-packages\click\utils.py", line 298, in echo
    file.write(out)  # type: ignore
  File "C:\hostedtoolcache\windows\Python\3.6.8\x64\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2023' in position 62: character maps to <undefined>

Initial Research

I spent some time a few months ago looking into the cause of this but was unable to find anything. I then encountered it in a different context last week and found myself unable to leave it alone. Why would this be happening in what seems like the same environment - what is it about the Windows agent in these CI pipelines that breaks this simple code?

I tried to find more examples of this issue being reported, but it is not common - it seems that for the most part these issues either aren't seen or aren't reported. Or maybe everyone else understood what was happening and fixed it on their own, and I just had to spend a week doing research because I'm slow.

I started trying to gather data from the two environments, using a Windows agent CI pipeline and my own Windows 10 laptops. I tried printing out all of the information from the sys module that I thought would be helpful, and found that the results were different based on where it was ran (note the sys.stdout.encoding value).

python -c "import sys; print('sys.stdout.encoding', sys.stdout.encoding); print('sys.flags',sys.flags);print('sys.getfilesystemencoding',sys.getfilesystemencoding())"

local Windows 10:

sys.stdout.encoding utf-8

sys.flags sys.flags(debug=0, inspect=0, interactive=0, optimize=0, dont_write_bytecode=0, no_user_site=0, no_site=0, ignore_environment=0, verbose=0, bytes_warning=0, quiet=0, hash_randomization=1, isolated=0, dev_mode=False, utf8_mode=0)

sys.getfilesystemencoding utf-8

CI pipeline:

sys.stdout.encoding cp1252

sys.flags sys.flags(debug=0, inspect=0, interactive=0, optimize=0, dont_write_bytecode=0, no_user_site=0, no_site=0, ignore_environment=0, verbose=0, bytes_warning=0, quiet=0, hash_randomization=1, isolated=0, dev_mode=False, utf8_mode=0)

sys.getfilesystemencoding utf-8

But I couldn't figure out why the output was different. I went through the Python standard library (Python) code and even tried reading some of the C code, such as the Python initialization functions. I figured maybe the binaries were compiled differently or there was some flag or environment variable on the Windows agent image that was causing this.

I tried checking the console code page with chcp which only made things more confusing because the console code page was 65001 which should have meant that it would support Unicode and, furthermore, it didn't match sys.stdout.encoding which was cp1252!

I eventually reached out to the Python mailing list and got some useful information that ended up leading me down the correct path.

If stdout isn't a console (e.g. a pipe), it defaults to using the
process code page (i.e. CP_ACP), such as legacy code page 1252
(extended Latin-1). You can override just sys.std* to UTF-8 by setting
the environment variable `PYTHONIOENCODING=UTF-8`. You can override
all I/O to use UTF-8 by setting `PYTHONUTF8=1`, or by passing the
command-line option `-X utf8`.

Background

The locale system in Windows supports a common system locale, plus a
separate locale for each user. By default the process code page is
based on the system locale, and the thread code page (i.e.
CP_THREAD_ACP) is based on the user locale. The default locale of the
Universal C runtime combines the user locale with the process code
page. (This combination may be inconsistent.) ...

and

> any idea why this would be happening in this situation? AFAIK, stdout
> *is* a console when these images are running the python process.

If sys.std* are console files, then in Python 3.6+,
sys.std*.buffer.raw will be _io._WindowsConsoleIO. The latter presents
itself to Python code as a UTF-8 file stream, but internally it uses
UTF-16LE with the wide-character API functions ReadConsoleW() and
WriteConsoleW().

> is there a way I can check the locale and code page values that you
> mentioned? I assume I could call GetACP using ctypes, but maybe
> there is a simpler way?

io.TextIOWrapper uses locale.getpreferredencoding(False) as the
default encoding. Actually, in 3.11+ it uses locale.getencoding()
unless UTF-8 mode is enabled, which is effectively the same as
locale.getpreferredencoding(False). On Windows this calls GetACP() and
formats the result as "cp%u" (e.g. "cp1252").

The comment about sys.std*.buffer.raw turned out to be the key!

Findings

The cause of these issues and all of my confusion turned out to be very simple, which made the issue all the more frustrating.

While it appears that the Windows agent executes commands and outputs the results to the console, the same as you would on a local machine, that's not accurate. The CI agent, for all OSes, actually executes the commands and redirects the output to a file (assumably this works out better for the CI pipeline). This is easy to see when you look at sys.stdout.buffer.raw:

Local Windows machine:

$ python -c "import sys; print(sys.stdout.buffer.raw, '\n', sys.stdout.encoding)"
<_io._WindowsConsoleIO mode='wb' closefd=False>
utf-8

Windows CI pipeline

$ python -c "import sys; print(sys.stdout.buffer.raw, '\n', sys.stdout.encoding)"
<_io.FileIO name='<stdout>' mode='wb' closefd=False>
cp1252

Local machine with redirect to file

$ python -c "import sys; print(sys.stdout.buffer.raw, '\n', sys.stdout.encoding)" > test.txt && type test.txt
<_io.FileIO name='<stdout>' mode='wb' closefd=False>
cp1252

Because the output is redirected to a file, the console encoding is no longer relevant. Python uses locale.getpreferredencoding(False) to get the encoding to use for the file which turns out to be, you guessed it, cp1252. You can alo see this information with a simple Powershell script:

Get-WinSystemLocale | Select-Object Name, DisplayName,
@{ n='OEMCP'; e={ $_.TextInfo.OemCodePage } },
@{ n='ACP';   e={ $_.TextInfo.AnsiCodePage } }

Name  DisplayName             OEMCP  ACP
----  -----------             -----  ---
en-US English (United States)   437 1252

This is the reason why the Windows agent shows the encoding of sys.stdout as cp1252 - this is the default code page for the Windows agent image.

The fact that these exceptions are caused by the command output being redirected to a file also explains why these issues do not often appear in the wild - most people do not run python commands in the console and output the results to a file. Sure, this undoubtedly happens, but probably not as often with the kind of CLI tools that make heavy use of Unicode like DBX, Prefect, etc. And when redirecting output from the console to a file in Mac or Linux, the encoding of the file is utf8 already, so no exceptions are ever raised.

Solutions

There are a few solutions that can be used to resolve this, but they all have to happen in the pipeline, and cannot be implemented by Python libraries/packages (so far as I can tell, at least).

  • You can add the environment variable 'PYTHONUTF8' to your pipeline with the value of 1
  • You can add the environment variable 'PYTHONIOENCODING' to your pipeline with the value of 'utf8'
  • You can call python/python.exe with the argument -X utf8 before your script (e.g. python -X utf8 ./path/to/python_file.py)

Note: If using an environment variable you will need to set this on the pipeline itself, not in a run command.

Azure DevOps example:

variables:
- name: PYTHONUTF8
  value: 1

Github Actions example:

env:
  PYTHONIOENCODING: "utf8"

This could also be resolved if the Windows agents did not redirect the output of commands to a file or if Windows had a default code page that was unicode compatible. I believe the latter may be happening in Windows 11, although it seems to be introducing problems of its own.

@kvnkho
Copy link

kvnkho commented Mar 16, 2023

I have a small Prefect job I'm doing and I ran into this. I owe you a coffee lol! You have a Venmo or something? My e-mail is in my profile. This saved me hours of debugging.

@NodeJSmith
Copy link
Author

@kvnkho No need for any of that, I'm honestly just happy that my literal hours of debugging this actually helped someone save time!

I wrote it up here with all of the error codes and everything so that it would hopefully end up at the top of the Google results when someone hit the same errors I had. Sounds like it worked and my frustrations were not in vain!

@basnijholt
Copy link

Wow, happy I found this! Didn't have a Windows machine where I could reproduce this, so debugging this was very painful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment