Skip to content

Instantly share code, notes, and snippets.

@ChrisDenton
Last active June 2, 2022 11:05
Show Gist options
  • Star 10 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save ChrisDenton/f432ca0003cf25520b448972e11f9097 to your computer and use it in GitHub Desktop.
Save ChrisDenton/f432ca0003cf25520b448972e11f9097 to your computer and use it in GitHub Desktop.

Windows File Paths

In this article I'm going to attempt to explain most of what I know about Windows file paths and also some of the weird DOSisms that keep things interesting.

I'll start with NT kernel paths. These aren't usually used directly from user space but I promise they're important to fully understanding Win32 paths.

NT kernel paths

In Windows everything is an object. And if the object has a name it can be accessed via the kernel's object manager. The kernel uses paths to query the object manager. These look similar to a UNIX path. For example:

\Device\HarddiskVolume2\directory\file.ext

As you're likely aware, a path is made up of "components", seperated by a \. Each component represents a directory name or a file name. In NT, components are arrays of UTF-16 code units. Any character except \ (0x005C) is allowed in component names. Even NULL (0x0000) is allowed.

Relative paths

If a directory is opened, kernel APIs allow you to open sub paths based on that directory. For example if you open the directory:

\Device\HarddiskVolume2\directory

You can then open a relative path, like so:

subdir\file.ext

So the absolute path of the file will be:

\Device\HarddiskVolume2\directory\subdir\file.ext

This is the only type of relative path understood by the kernel. In the NT kernel . and .. have no special meaning and can be regular files or directories (but almost certainly shouldn't be).

Symlinks

Device paths such as \Device\HarddiskVolume2 are all very well but often you want a more meaningful or consistent name. To this end NT supports symbolically linking from one path to another. Many of these meaningful names will be collected into a single NT folder: \??.

For example, to access a drive by its GUID you can use:

\??\Volume{a2f2fe4e-fb6b-4442-9244-1342c61c4067}

Or you can use a friendly drive name:

\??\C:

The : here has no special meaning. It's just part of the symlink name.

Filesystems

While the kernel allows almost anything in component names, filesystems may be more restrictive. For example, an NT path can include a component called C: but a filesystem may not allow you to create a directory with that name.

Microsoft's filesystem drivers will not allow the following characters in component names:

Disallowed Description
\ / Path seperators
: Dos drive and NTFS file stream seperator
* ? Wildcards
< > " DOS wildcards
| Pipe
NUL to US ASCII control codes; aka Unicode C0 control codes (U+0000 to U+001F inclusive). Note that DEL (U+007F) is allowed.

Each component in a path is currently limited to 255 UTF-16 code units.

Filesystem paths may or may not be case sensitive. In Windows they are typically case insensitive but this cannot always be assumed. In some circumstances case sensitivity can even differ on a per directory basis.

File streams

The above disallowed characters applies to component names but NTFS understands an addtional syntax: file streams. Each file (including directories) can have multiple streams of data. You can address them like so:

file.ext:stream_name

Which is also equivalent to:

file.ext:stream_name:$DATA

The stream name cannot contain a NULL (0x0000) or have the characters \, /, :. Like path components, it's limited to 255 UTF-16 code units.

The $DATA part of the stream identifier is a stream type. Valid types are assigned by Microsoft and always start with a $. If not specified, the type defaults to $DATA.

Win32

The Win32 API is built as a layer on top of the NT kernel. It implements an API that was originally built for those familiar with Win16 and DOS so it doesn't directly deal with NT paths. Instead it converts Win32 paths to NT paths before calling the kernel.

Essentially Win32 paths are a user-space compatibility layer.

Encoding

In Windows, all paths are treated as Unicode. However the Win32 API provides convinence functions to automatically convert the system encoding to UTF-16 (and vice versa). This helps to avoid the Mojibake problem by only having one canonical encoding. The UTF-16 conversion happens before everything else so interpreting paths only needs to operate on UTF-16 strings. The rest of this section assumes such a conversion has been done, if necessary.

For caveats and further information see Appendix A.

Absolute Win32 paths

All absolute paths start with a root. On *nix the root is /. For the NT kernel it's \. In contrast, Win32 has four types of root and they're all longer than one character.

  • C:\, D:\, E:\, etc. The first letter is a (case insensitive) drive letter that can be any ascii letter from A to Z.
  • \\server\share\ where server is the name of the server and share is the name of the shared directory. It is used to access a shared directory on a server therefore you must always specifiy both a server name and share name.
  • \\.\. These are typically used to access devices other than drives or server shares (e.g. named pipes). So they are not usually filesystem paths.
  • \\?\. These can be used to access any type of device.

The following table shows each type and an example of how the Win32 root is converted to a kernel path.

Type Win32 path Kernel path
DOS C:\Windows \??\C:\Windows\
UNC \\server\share\ \??\UNC\server\share\
Device \\.\PIPE\name \??\PIPE\name
Verbatim \\?\C:\Windows
\\?\UNC\server\share\
\\?\PIPE\name
\??\C:\Windows
\??\UNC\server\share\
\??\PIPE\name

From the table above it looks like device paths and verbatim paths work the same way. However, that's only because I left off a column: the namespace. The namespace determines what happens to the part of the path after the root.

Type Namespace Example
DOS Win32 C:\Windows
UNC Win32 \\server\share\
Device Win32 \\.\PIPE\name
Verbatim NT \\?\C:\Windows
\\?\UNC\server\share\
\\?\PIPE\name

The next two sections will explain the effects the namespace has.

NT namespace

Paths in the NT namespace are passed almost directly to the kernel without any transformations or substitutions.

The only Win32 paths in the NT namespace are verbatim paths (i.e. those that start with \\?\). When converting a verbatim path to a kernel path, all that happens is the root \\?\ is changed to the kernel path \??\. The rest of the path is left untouched.

Note that this is the only way to use kernel paths in the Win32 API. If you start a path with \??\ or \Device\ then it can have very different results.

Win32 namespace

This section applies to all Win32 paths except for verbatim paths (those that start with \\?\).

When converting a Win32 path to a kernel path there are additional transformations and restrictions that are applied to DOS drive paths, UNC paths and Device paths. Some of these transformations are useful while others are an unfortunate holdover from DOS or early Windows.

Win32 namespaced paths are restricted to a length less than 260 UTF-16 code units. This restriction can be lifted on newer versions of Windows 10 but it requires both the user and the application to opt in.

When paths are in this namespace, one of two transformations may happen:

  • If the path is a drive or relative path and the file name (the final component without the extension) is a special device name then it will be interpreted as a DOS device path. So C:\Windows\COM1 gets turned into the kernel path \??\COM1. See Appendix B for more details.
  • Otherwise the following transformations are applied:
    • First, all occurences of / are changed to \.
    • All path components consisting of only a single . are removed.
    • A sequence containing more than one \ is replaced with a single \. E.g. \\\ is collapsed to \.
    • All .. path components will be removed along with their parent component. The Win32 root (e.g. C:\, \\server\share, \\.\) will never be removed.
    • If a component name ends with a . then the final . is removed, unless another . comes before it. So dir. becomes dir but dir.. remains as it is. I'm sure there's a reason for this.
    • For the filename only (aka the last component), all trailing dots and spaces are stripped.

For example, this:

C:/path////../../../to/.////file.. ..

Is changed to:

C:\to\file

Which becomes the kernel path:

\??\C:\to\file

This transformation all happens without touching the filesystem.

Relative Win32 paths

Relative paths are usually resolved relative to the current directory. The current directory is a global mutable value that stores an absolute Win32 path to an existing directory. The current directory only supports DOS drive paths (e.g. C:\) and UNC paths (e.g. \\server\share). Using any other path type when setting the current directory is liable to break relative paths therefore verbatim paths (\\?\) should not be used.

There are three categories of relative Win32 paths.

Type Examples
Path Relative file.ext
.\file.ext
..\file.ext
Root Relative \file.ext
Drive Relative D:file.ext

Although Path Relative forms come in three flavours there are really only two. file.txt is interpreted exactly the same way as .\file.txt (see Win32 namespace). However, the .\ prefix can help to avoid ambiguities introduced by drive relative paths.

Drive Relative paths are interpreted as being relative to the specified drive's current directory (note: usually only the command prompt has per drive current directories). Root relative are relative to the root of the current directory.

Drive Relative and Root Relative paths should be avoided whenever possible. Developers and users rarely understand how they're resolved so their results can be surprising. Additionally the Drive Relative paths syntax introduces ambiguity with file streams.

Further reading

If you would like more detailed descriptions of Windows paths, see these articles:

Appendix

I've tried to keep this document short(ish) and focused on the most relevant information but in doing so details fell by the wayside. For now I've collected some of them into this appendix.

Appendix A: UTF-16

Internally the Windows NT kernel uses UTF-16 strings. Their definition is conceptually similar to Rust's Vec<u16>:

struct UnicodeString {
    length: u16,
    capacity: u16,
    buffer: *mut u16,
}

In the Win32 API there are generally two types of strings that applications can choose to use. Both are NULL terminated.

  • Multibyte: *mut u8
  • Wide: *mut u16.

Multibyte strings can be in any encoding supported by the OS. Windows will automatically convert to and from a UTF-16 UnicodeString as needed. If a Multibyte string contains bytes that are invalid for that encoding then they may be replaced when converting to UTF-16.

Recent versions of Windows also have the UTF-8 local encoding which, like other local encodings, is lossily converted to and from UTF-16.

Wide strings are UTF-16 and are put into a UnicodeString struct without being checked, except to get the length. This means that, unlike Rust's String, Windows does not check if a wide string is valid UTF-16. So it's possible for malicious applications to create file names with isolated surrogates (i.e. invalid Unicode).

Appendix B: Special DOS device names

In the Win32 namespace, if a path is an absolute DOS drive or a relative path and if a filename (aka the final component) matches a special DOS device name then the path is ignored and replaced with that DOS device. For example:

C:\directory\subdir\COM1

Gets translated to:

\\.\COM1

Which becomes the kernel path

\??\COM1

These are the DOS device names that get the path replaced:

  • AUX
  • CON
  • CONIN$
  • CONOUT$
  • COM1, COM2, COM3, COM4, COM5, COM6, COM7, COM8, COM9, COM², COM³, COM¹
  • LPT1, LPT2, LPT3, LPT4, LPT5, LPT6, LPT7, LPT8, LPT9, LPT², LPT³, LPT¹
  • NUL
  • PRN

However the algorithm for matching device names is not as simple as a direct comparision. When comparing file names to special DOS device names, it's as if the following steps were applied to the file name:

  1. ASCII letters are uppercased
  2. anything after a . and the . itself are removed
  3. any trailing spaces ( ) are stripped.

For example, these filenames are all interpreted as \\.\COM1:

  • "COM1.ext"
  • "COM1     "
  • "COM1 . .ext"

One final note, when opening a file path such as C:\Test\COM1, it will only resolve to \\.\COM1 if the parent directory C:\Test exists. Otherwise opening the file will fail with an invalid path error.

Appendix C: Volume paths

One form of path I've only briefly mentioned is GUID paths. These aren't used as much and are essentially just Verbatim or Device paths which aren't handled any differently. Still, it can be useful to be aware of paths such as:

\\?\Volume{79D3A0DE-481C-4D52-A70B-F06A16C020C2}\file.ext

This addresses a volume according to its GUID instead of a drive letter. It is useful for partitions that don't have an assigned letter or for when you need to be sure you're addressing a specific volume, regardless of where it is mounted.

If you read the kernel section you've probably guessed that these GUID paths are just symlinks to, for example, \Device\HarddiskVolume2. In this way a Drive path like C: will be exactly equivalent to a Volume path if they are both symlinked to the same volume.

There are other such symlinks but their use is even rarer and are possibly considered an implementation detail.

@jernejs
Copy link

jernejs commented May 31, 2022

There's a change in Windows 11 regarding DOS device names – extensions aren't ignored any more, so aux.c is now a valid Win32 file and not a device name any more.

@ChrisDenton
Copy link
Author

ChrisDenton commented May 31, 2022

Yeah, I haven't yet updated this with the Windows 11 changes. I'd also add that .\aux is now valid as well.

As far as I know it's now only a case insensitive match for the string aux, etc that can get parsed as a device name. Though trailing dots and spaces are still stripped so aux.. .. is a device name. I'd guess this couldn't be changed without breaking compatibility.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment