Skip to content

Instantly share code, notes, and snippets.

@chungy
Created December 8, 2013 02:29
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save chungy/7852622 to your computer and use it in GitHub Desktop.
Save chungy/7852622 to your computer and use it in GitHub Desktop.
Technical information about UMSDOS.
UMSDOS uses a fairly simple system to store metadata information in
the --LINUX-.--- files. Each full metadata block is a multiple of 64
bytes, up to 256 bytes, depending on the length of the filename.
UMSDOS uses a deterministic way to convert Linux filenames into
MS-DOS-compatible 8.3 style names, handling situations like
case-sensitivity, uniqueness when the filenames differ after the 8th
character, special filenames not allowed on MS-DOS and FAT, and so on.
It allows a fairly full set of typical POSIX functionality, only
lacking sparse file support (which would be impossible to implement
while allowing non-UMSDOS aware systems to correctly access a file's
content). Hard links are specially treated; the link names have
mirrored metadata and the files that appear on disk contain only the
path name to the actual hidden link file. The link file contains the
contents of hard linked files, is stored in the UMSDOS metadata, but
is not directly accessible from UMSDOS. Additionally, the
--LINUX-.--- control files do not appear under UMSDOS and there is no
way to store a file named as such (or one with a different case) on
the system.
Fields in a --LINUX-.--- file:
unsigned char name_length
unsigned char flags
unsigned short number_of_links
uid_t (unsigned short?) uid
gid_t (unsigned short?) gid
long atime
long mtime
long ctime
unsigned char dev_minor [1]
unsigned char dev_major [1]
unsigned short mode
char spare[12] // reserved bytes, not used
char name[220] [2]
[1] the device major/minor numbers are treated as a single unsigned
short in the original C sources, but effectively it's easier to treat
them as separate. These might be reversed on big-endian.
[2] This char array can be *up to* 220 bytes long, but is usually much
shorter; short enough to be only 28 bytes. It's only as long as it
needs to be for the entire metadata block to be a multiple of 64
bytes. \x00-padded, but not \x00 terminated.
The flags field is a little bit special and is only used for
supporting hard links. The value of 1 means the file is hidden; that
is, it never shows up in any kind of stat, like how the --linux-.---
file is treated. A value of 2 means that the file represents a hard
link, and like a symlink, the contents of it point to its actual
destination, which is the hidden link file. The hard linked name
itself gets a mode of 100777 set, and the number_of_links field is set
to 1 like any other regular file. The hidden link file, instead,
contains all of the metadata to be displayed for hard links; this
includes a proper number_of_links count, time stamps, permissions, and
so forth.
All of these fields are not endian-safe! A big-endian system running
Linux 2.4 and creating/using a UMSDOS filesystem will have all of its
bytes swapped compared to a little-endian system. Most likely, any
UMSDOS filesystems you'll see around will be little-endian thanks to
its rather niche purpose of providing POSIX semantics on top of an
MS-DOS system, but it'd be trivial to support both little- and
big-endian.
Metadata entries can be cleared by zeroing out the entire entry. This
should make it simple to support even instances where much of the
beginning of the file is just \x00s; upon reading a \x00 of the first
char, the name_length field, seek forward 63 bytes and try reading the
next one, and so forth. Files that are renamed from a rather long
name to a shorter name would have no problem just zeroing out the
extra name bytes, but the kernel driver instead writes a new entry at
the end of the --linux-.--- file instead. The same method is easily
applied to renaming a file from a short name to a long one: zero out
the entry and make a new one at the end of --linux-.---. This might
lead to some horribly space-inefficient metadata files over time, but
that might be better handled through an independent fsck or other
clean-up utility.
UMSDOS has special functionality to allow certain characters and names
not allowed on MS-DOS and/or FAT, but bear no special meaning to
Linux. POSIX systems typically only disallow two characters, \x00 and
/ (\x2f), beyond that, any character or string of characters may be
used. Forbidden characters in DOS/FAT names are:
* Control characters \x01 to \x1f and \x7f
- UMSDOS still doesn't allow the storage of \x7f as a character in
a file name. There shouldn't be any technical reason for
disallowing it, but it's probably an oversight.
* Space character
- Technically, not actually forbidden by DOS, but most programs and
tools make it difficult to store and use such names. Linux's
msdos filesystem with check=s also forbids its use, and chkdsk,
scandisk, and dosfsck all report it as an error if a name does
contain a space. It's best to avoid it at least.
* " * + , ; < = > ? [ \ ] | : .
- The period can only appear once in a filename, and its use is
solely to separate the basename from the extension, which are
stored separately in FAT. A file cannot lack a basename, or in
otherwords start with a period. Multiple periods are not a valid
DOS or FAT name. A file on DOS may be referenced with a trailing
period, but this means there is no extension and it has the same
meaning as leaving the trailing period out.
UMSDOS generates FAT file names on a rather simple method:
1. Lower-case filenames that can fit in 8+3 limits are stored as-is;
for example, the file "dir.c" is stored simply as "dir.c".
2. Upper-case is always mangled. A directory in which the only file
ever stored in it that is called "Makefile" will be stored as
makefile.{__.
3. Extra periods are converted to underscores and also mangled.
linux-2.4.37.11.tar.gz will be stored as linux-2_.{__.
4. Control characters, spaces, other special characters, and bytes
above \x7f get converted to #s. C:\DOS\RUN gets stored as
c##dos#r.{__.
There are additional strings of characters that may not make up the
entirety of the basename. UMSDOS mangles the name so that they may be
used as on Linux. For example, on DOS, the file 'aux.sh' cannot be
stored or accessed; however, a name like '-aux.sh' is OK and can be
stored. The following strings are forbidden as a whole part of a
name: AUX, CLOCK$, COM1, COM2, COM3, COM4, CON, LPT1, LPT2, LPT3,
LPT4, NUL, PRN. Additional ones reserved by certain TSRs but not
blocked by DOS itself are EMMXXXX0, XMSXXXX0, and SETVERXX; these are
also mangled by UMSDOS. The doschk utiltity additionally lists
MS$MOUSE and SMARTAAR as reserved names, but UMSDOS does not mangle
them.
Extensions for mangled names are generated deterministically, using
base 32 with up to 9216 unique (mangled name?) files. The last two
characters are just base-32, with 0 replaced with a _. The first
character is one of, in order: { } ( ) ! ` ^ & @
The extension is based on the location of the file's metadata in
--linux-.---, in multiples of 64. The entry beginning at 0x00 becomes
{__, the one at 0x3f {_1, the one at 0x12a00 (pos 1192) }58. In
base-32, effectively the highest number possible in this scheme is 9vv
(translated as @vv); it may seem odd, but it avoids any clashes with
any extensions common in the DOS (or Windows) world, such as .com,
.doc, .123, and so on, while still allowing a reasonable number of
files in a directory (most will never reach anywhere close to 9216 :).
A lot of these filename restrictions are not present under VFAT;
spaces, +, =, commas, and periods may be freely used. Windows
Explorer and command.com/cmd.exe do not allow creating filenames with
a leading period (normally represents a hidden file on Linux), but it
is not a restriction of the filesystem itself. This isn't
particularly relevant to UMSDOS which predates VFAT and is concerned
about 8+3 semantics of DOS before Windows 95. In some ways, UMSDOS
maintains better compatibility than VFAT does; it doesn't futz around
with filesystem structures liable to be removed or corrupted when
running scandisk or defrag from MS-DOS.
There was an experimental UVFAT in development for a short time that
shared the base filename space with regular VFAT. It allows for more
meaningful down-conversions from Linux names, making accessibility
from Windows much more convenient. It is incompatible with UMSDOS and
I have not explored it, but it may be worth looking into at least as
inspiration for future expansions. There are some ideas I have to
make UMSDOS behave better, especially for some fringe circumstances
where you might want to store a file named --linux-.--- or where you
have a name that conforms to DOS and 8+3 limits but also looks like a
mangled name. Largely, the UMSDOS limitations cannot be repaired
without breaking compatibility, so it'd be better to start off
fresh... and the old DOS restrictions aren't quite so relevant
anymore, but VFAT is; utilizing that would be beneficial. The
posixovl project already is one such attempt at being a modern
filesystem of this kind, but it too suffers from many limitations and
is rather unstable.
Well that's all there is to it. The goal of this project at the
moment is to be compatible with UMSDOS as it is, providing both a FUSE
filesystem for it as well as some tools to manipulate/poke around it
without having to mount it. For my purposes, I'm only really
concerned with operating with a Slackware 11 UMSDOS installation,
which as far as I'm aware, is the last distribution that still
supported UMSDOS; one of the last hold-outs on the Linux 2.4 kernel
for that matter. The hope is to also have a more stable filesystem
than Linux 2.4 had; even with 2.4.37.11, the last Linux 2.4 release
ever, there are a number of ways to break UMSDOS directories entirely
user-side, and not even as the root-user!
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment