Skip to content

Instantly share code, notes, and snippets.

@sternenseemann
Last active July 8, 2022 08:47
Show Gist options
  • Save sternenseemann/a00d91b8e58cca3e18792771483b4c25 to your computer and use it in GitHub Desktop.
Save sternenseemann/a00d91b8e58cca3e18792771483b4c25 to your computer and use it in GitHub Desktop.

WIP: The wonders of build target configuration

Background

I’m interested in this topic from a specific perspective which is probably relevant to the reader: I view platform configuration mainly from the perspective of nixpkgs which is a multi-platform package repository which has first-class (?) support for cross-compilation. Here the package set itself must have an idea of the involved platform(s) and pass that information on to the configure scripts et cetera of the packaged software.

I started researching this topic more seriously when refactoring code in cabal2nix (rather distribution-nixpkgs) that “parses” Nix’s system strings.

Terminology

Seemingly everyone has slightly different jargon for this weird little strings with the dashes in them that describe a kind of system people are running:

  • LLVM/clang calls them LLVM Triples.
  • autotools calls them system type (or name) or target triplet — quite confusingly as it is not only used to describe the target platform. Occasionally they are also referred to as configuration (or configuration name (of the system)), e.g. in config.sub and config.guess. This is mirrored in nixpkgs where the autotools system type is stored in the config attribute of the platform attribute set.
  • Nix calls (a specific subset of) them system. Note that system is not what nixpkgs would call a platform — in fact the latter contains the former in the system attribute.

We’re going to have to settle for an umbrella term for this document which doesn’t appear in this list. Something with “triple” or “triplet” would be great as people usually understand this term, but actually only LLVM’s triples are triples at all — autotools’ “triplets” can have from 2 to 4 components.

For this document we are going to call everything that is a thing with dashes that describes a system a “platform string” because ultimately that’s what we can say about them for sure.

The Schools of Thought

autotools’ target triplets

The autoconf manual keeps the description of target triplets relatively short and vague. The main documented points are:

  • The form of the triplet is “~cpu-vendor-os~, where os can be system or ~kernel-system~” (taken verbatim from section 14.1 of the autoconf manual).
  • Configure scripts should look at triplets by using shell globbing (which reinforces the point that they are, first and foremost: strings). For example, i?86-*-* checks for 32-bit x86 or *-*-linux* for something with a Linux kernel.
  • The primary source of truth for triplets is autoconf, mostly in the form of the config.guess script that works out (or tries to) which target triplet is appropriate for the machine its running on.
  • Additionally, the set of triplets is subject to change: autotools may start supporting new ones at any time, with very little constraints what they might look like.

The most canonical version of a target triplet is actually a quadruplet.

So the most canonical version of a target triplet is actually a quadruplet! Even weirder, though, is that they don’t even have to have three components or more. It is perfectly legal to omit the vendor, yielding e.g. x86_64-linux which is expanded to x86_64-pc-linux-gnu, and the OS, e.g. riscv64 defaults to riscv64-unknown-none. This creates further ambiguity, as none is both a valid vendor and OS type, for example (see also the section on riscv-none-elf).

The key consequence of this is the following: You can’t split the target triplet into its components without knowledge about the possible values for each. Assuming you know that something is a valid autotools target triplet (which is already quite the assumption in some cases):

  1. If the triplet has four components, you’re golden, split at the dashes. However, I’m not certain if autotools may allow additional dashes in some components in the future, like for the OS part, maybe to allow x86_64-v2 et cetera as CPU parts? I’m not certain.
  2. If the triplet has three components it may either be cpu-kernel-system or cpu-vendor-os.
  3. If the triplet has two, the first component is the CPU part and the second one is “usually, but not always the OS” — instead it can also be a vendor, so it is either cpu-os or cpu-vendor. Additionally, there is (at the moment) one two component target triplet which is a special alias for mips-dec-ultrix4.2, namely decstation-3100.
  4. If the triplet has one, it is cpu, implying OS none usually. Alternatively, it can be a “single-component [shorthand] not valid as part of multi-component configurations”.

This means, effectively, that you can’t do split it correctly unless you are autoconf yourself, since the most common case, three components, is ambiguous: You need to know what strings are legal for the vendor, OS, kernel and system part — and even worse some strings like none may be valid for the OS and vendor part.

So the options you are left with are: Limit yourself to a subset of all triplets and reject everything else or reimplement the 1860 line config.sub shell script. The former option sort of works for downstream projects (e.g. nixpkgs does it), but it’s still problematic, as the vendor part is not restricted to a certain number of legal values which may also cause you grief if you misinterpret something as the vendor part that isn’t.

Additionally, note that splitting the triple is not all config.sub / autoconf does: After identifying the parts, it normalizes them and fills in the missing pieces accordingly to non-trivial rules, e.g. it will supplement ibm as vendor to s390x-* if it is omitted. Consequently, it would be very beneficial to run the target triplet through config.sub first when you have to parse one. This also will guarantee that the resulting triplet has three or four components with the second always being vendor. However, this is often not possible, as you would need to ship the script somehow and be able to shell out to it.

In conclusion, autotools target triples are strings consisting of multiple string components joined by dashes, but only autoconf knows how exactly. They can be ambiguous and, since autotools is the ultimate, but changing source of truth, it is hard to untangle the string correctly in all cases.

In general, though, it’ll make sense to humans what a target triple describes. You also won’t run into trouble if you either obtain the target triple by running config.guess on the machine you are interested in or are told by your device’s vendor or cross toolchain distributor.

If you are dealing with a lot of hypothetical triplets (e.g. as a nixpkgs maintainer), the config.sub script will help you get a sense of how autoconf interprets a specific input string. If you’re feeling brave, read its source (both config.sub and config.guess are available as part of the gnu-config nixpkgs package or in the build-aux directory of the autoconf source tree).

The main takeaway to be had here is to treat target triplets as opaque strings as much as possible. The autoconf developers actively discourage something as profane as using globbing on the triplet, recommending probing for the specific property instead (e.g. check for linux headers instead of *-linux*). The target triplets are subject to change, so you will only be on the safe side if you treat them as implementation details of autoconf.

Nix’s systems

Fundamentally, a system is some kind of string which the Nix daemon uses to decide whether it can build a particular derivation. Every derivation, at the core, looks like this:

derivation {
  name = "my-derivation";
  system = "i686-linux";

  /* instructions how to build the derivation */
}

The system string the daemon uses is picked in Nix’s configuration script. If it is equal to a derivation’s one, it’ll happily buildi it, otherwise it will try to defer the task to a remote builder.

Nix’s configuration script uses autoconf, so the ultimate source of truth is autoconf, mediated by these normalizations:

  • In host_cpu,
    • i*86 is transformed to i686
    • amd64 is transformed to x86_64 (autoconf does this by itself nowadays)
    • armv{6,7} is transformed to armv{6,7}l
  • In host_os,
    • linux-{musl,gnu}* are transformed to plain linux
    • Version numbers attached are dropped (e.g. the 0.3 in gnu0.3)

The final system is obtained by joining the autoconf CPU and OS parts by a dash, so it is a more predictable variant of the two component autotools target triplet, forming either cpu-os or cpu-kernel-system, although the latter is very rare for Nix itself.

Nix is not the most portable software in the world, so the number of system strings actually generated by the configure script is limited:

  • *-linux (probably mainly with aarch64, i686, x86_64, armv6, armv7 and other common CPUs in use today)
  • x86_64-darwin, aarch64-darwin
  • i686-netbsd, x86_64-netbsd (maybe more CPUs?)
  • x86_64-freebsd (maybe other CPUs?)
  • x86_64-openbsd (although the port never made it into the main Nix source tree, so extremely rare)

However, these are not the only system strings in use today: nixpkgs not only receives system strings from Nix via builtins.currentSystem, but also interprets system strings passed in by the user as cross compilation targets. Among these are systems Nix will probably never run on, like avr-none. As a result, a world of Nix system strings exists somewhat disconnected from autoconf.

Additionally, due to the transformation of linux-{musl,gnu}* to just linux, no three component Nix systems exist in practice and support for them is probably poor.

nixpkgs’ platforms

LLVM Triples

Parsing Platform Strings

Don’t.

Case studies

riscv-none-elf

Copy link

ghost commented Jul 6, 2022

This is bikeshedding a bit, but maybe {host,build,target}Platform.canonical?

It's a single word, dodges the confusion of "four-element triples", and is relatively unambiguous compared to "system (type,name)", "platform", and "configuration" since there is only one taxonomy (autoconf) which uses the adjective "canonical" and has a canonicalization routine.

The documentation would explain that "this field contains the canonical autoconf name for the platform; it should agree with the name reported by config.sub". I don't think we should (or can?) shell out to config.sub, but we can declare that any disagreement is considered a bug to be fixed.

Obviously such a change would need an RFC, not just a PR.

it is perfectly legal to omit the vendor

We could also have hostPlatform.triplet, documented as "Equal to hostPlatform.canonical with the vendor component omitted. The result will consist of three hyphen-separated components. Note: the third component may have two subcomponents: the libc and the abi.".

@sternenseemann
Copy link
Author

I wouldn't fret too much about the vendor field. It is basically a comment.

Not true! First of all, there are valid autoconf names that omit OS in favor of the vendor upon which the vendor name is used to determine the OS (this is not possible always, but I'll need to update my note regardless…), e.g.:

> ./result/config.sub mipsel-mips
mipsel-mips-elf
> ./result/config.sub x86_64-apple
x86_64-apple-macos

Additionally, it is important to know what counts as a vendor to disambiguate it from the OS correctly when parsing. Case in point is this nixpkgs bug:

nix-repl> (lib.systems.elaborate { config = "riscv32-none-elf"; }).parsed.kernel.name
"none"

> ./result/config.sub riscv-none-elf
riscv-none-elf

Here nixpkgs will take none as the kernel because it thinks none is not a valid vendor whereas in reality it is both a valid kernel and vendor in autoconf (none-elf kernel-system is not a thing in autconf though).

This is bikeshedding a bit, but maybe {host,build,target}Platform.canonical?

I don't think it's worth it to rename config necessarily, it'll always be fuzzy. Or do you want to add a new field?

And system is <libc>(e?abi<abi>)?. Yeah, it's a mess.

I think we should still be conservative here and operate with known string (formats) instead of trying to outright parse it, since autoconf doesn't devote much passion to this and mostly treats it as opaque with some known values:

# As a final step for OS-related things, validate the OS-kernel combination
# (given a valid OS), if there is a kernel.
case $kernel-$os in
        linux-gnu* | linux-dietlibc* | linux-android* | linux-newlib* | linux-musl* | linux-uclibc* )
                ;;
        uclinux-uclibc* )
                ;;
        -dietlibc* | -newlib* | -musl* | -uclibc* )
                # These are just libc implementations, not actual OSes, and thus
                # require a kernel.
                echo "Invalid configuration \`$1': libc \`$os' needs explicit kernel." 1>&2
                exit 1
                ;;
        kfreebsd*-gnu* | kopensolaris*-gnu*)
                ;;
        vxworks-simlinux | vxworks-simwindows | vxworks-spe)
                ;;
        nto-qnx*)
                ;;
        os2-emx)
                ;;
        *-eabi* | *-gnueabi*)
                ;;
        -*)
                # Blank kernel with real OS is always fine.
                ;;
        *-*)
                echo "Invalid configuration \`$1': Kernel \`$kernel' not known to work with OS \`$os'." 1>&2
                exit 1
                ;;
esac

I think this conclusion is unsupported.

Note that this conclusion has the specific constraint that the parsing logic should need to know as little as possible about legal values, since I investigated this for cabal2nix where I wanted to avoid the need to continuously update the software in order to keep working with all targets supported by nixpkgs.

Parsing in nixpkgs is mostly fine and necessary, as long as you have a limited number of supported targets, it'll mostly work fine (although there's always the danger to introduce discrepancies/bugs). If you want to support arbitrary autoconf triplets, anything short of executing config.sub / reimplementing it completely won't work.

The documentation would explain that "this field contains the canonical autoconf name for the platform; it should agree with the name reported by config.sub". I don't think we should (or can?) shell out to config.sub, but we can declare that any disagreement is considered a bug to be fixed.

Yes, sounds good. We can't shell out to config.sub, but we can write a test suite that verifies this claim for all known targets in nixpkgs fairly easily.

(One possibility would ofc be implementing a limited bash interpreter and coreutils emulation in Nix to run config.sub, but that would most likely be too slow, too impractical and a huge pain to write.)

Note: the third component may have two subcomponents: the libc and the abi

There are a lot of platforms for which an ABI part just plainly doesn't exist in any triplet in the wild, so sounds like a bad idea.

Copy link

ghost commented Jul 8, 2022

I wouldn't fret too much about the vendor field. It is basically a comment.

Not true!

  1. omit OS in favor of the vendor upon which the vendor name is used to determine
  2. it is important to know what counts as a vendor to disambiguate it

Specifically, in canonicalized autoconf-names, the vendor field is just a comment.

Here nixpkgs will take none as the kernel because it thinks none is not a valid vendor

I think we will have to stop trying to guess what is a valid vendor.

This is bikeshedding a bit, but maybe {host,build,target}Platform.canonical?

I don't think it's worth it to rename config necessarily, it'll always be fuzzy. Or do you want to add a new field?

That is what I was proposing. And a comment in the documentation encouraging the use of canonical rather than config when possible.

And system is (e?abi)?. Yeah, it's a mess.

since autoconf doesn't devote much passion to this and mostly treats it as opaque with some known values:

Autoconf's config.guess calls this LIBCABI and parses it that way.

Note: the third component may have two subcomponents: the libc and the abi

There are a lot of platforms for which an ABI part just plainly doesn't exist in any triplet in the wild, so sounds like a bad idea.

This sounds like you're saying that because some platforms have only one ABI, nixpkgs should not support more than one ABI on any platform....

One possibility would ofc be implementing a limited bash interpreter and coreutils emulation in Nix

😄 Reminds me of Sutherland's wheel of reincarnation, where their graphics processor grew its own operating system, and eventually needed its own graphics processor to offload the burden of rendering...

Copy link

ghost commented Jul 8, 2022

Stepping back a bit, it looks like you really want to have nixpkgs' autoconf-name-parser support all the same inputs that config.sub supports. I think it's okay if we only support the fixed points of config.sub as inputs to our autoconf-name-parser.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment