Skip to content

Instantly share code, notes, and snippets.

@snnn
Last active May 12, 2023 23:38
Show Gist options
  • Save snnn/4bd04b08731b01d975d55dd64770c389 to your computer and use it in GitHub Desktop.
Save snnn/4bd04b08731b01d975d55dd64770c389 to your computer and use it in GitHub Desktop.
encoding

This is a discussion. Below are my personal opinions. They do not represent the team. After reading the text, you should only take the facts it contains, then use your judgement to decide what would be the best for you.

Assumptions

Before starting the discussion, let's make some assumptions.

Here we only consider Windows and Linux operating systems. For Linux we only consider Glibc as the C runtime.

We assume Linux applications only uses UTF-8. On POSIX systems that use glibc, different processes could use different locales. Which means, they could use different encodings. And for decades most filesystems(like ext4) are encoding neutral. So, even in the same folder different filenames could be encoded in different ways. Linux kernel doesn't care about file paths encodings. Only userland applications do. Glibc is a userspace library. It has file path manipulating functions like dirname(3). However, these functions only accept multibyte character strings and they were not implemented with different code paths for different encodings. Hence, for simplicity, this discussion will assume Linux applications only use UTF-8.

We assume Windows applications do not use UTF-8 when interacting with Windows APIs. Windows only added UTF-8 as a codepage recently.

Source file encodings

Without any doubt, every C/C++ source file in this project should be encoded in UTF-8. It only matters if the file contains any non-ASCII characters. However, C/C++ compilers do not assume that. If a source file doesn't have a BOM at the beginning to indicate its encoding, a C/C++ compiler will choose the encoding based on the user's current locale settings. So, on Windows the coding could be CP936, ISO-8859-1, BIG5 or something else, but normally would not be UTF-8. So in most cases it would be wrong. Therefore, you should either add a BOM header to the source files, or explictly tell the compiler what the coding is by passing in a compiler flag like "/utf8".

Multibyte Character Strings vs Wide Character strings

A multibyte character is a character composed of sequences of one or more bytes. Each byte sequence represents a single character in the extended character set(source). For example, UTF-8 strings are multibyte character strings.

A wide character is either 16-bit or 32-bit. If it is 32-bit, it would be big enough to store any character of any human language in this world. But it would waste too much. So 16-bit wide character strings are more common.

Windows APIs used to have an ANSI version(which uses Multibyte Character Strings) and a Unicode version(which uses Wide Character strings) for each API.

File paths

In kernel mode:

  • Windows treat all file paths as UTF-16
  • Linux treat all file paths as opaque byte buffers without an explict encoding.

In user mode:

  • Linux: if you only use UTF-8, there is no problem.
  • Windows: if you only use UTF-16(wide char strings), there is no problem. ANSI version APIs have many restrictions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment