tekknolagi/every-vm-tutorial-you-ever-studied-is-wrong.md

## every-vm-tutorial-you-ever-studied-is-wrong.md

      
    Raw
  

              every-vm-tutorial-you-ever-studied-is-wrong.md
            
          
    Note: this was originally several Reddit posts, chained and linked. But now that Reddit is dying I've finally moved them out. Sorry about the mess.

URL: https://www.reddit.com/r/ProgrammingLanguages/comments/up206c/stack_machines_for_compilers/i8ikupw/
Summary: stack-based vs register-based in general.
There are a wide variety of machines that can be described as "stack-based" or "register-based", but not all of them are practical. And there are a lot of other decisions that affect that practicality (do variables have names or only address/indexes? fixed-width or variable-width instructions? are you interpreting the bytecode (and if so, are you using machine stack frames?) or turning it into machine code? how many registers are there, and how many are special? how do you represent multiple types of variable? how many scopes are there(various kinds of global, local, member, ...)? how much effort/complexity can you afford to put into your machine? etc.)

a pure stack VM can only access the top element at any time (some instructions pop multiple elements). No practical language uses this; it is only suitable for use in a calculator, and has difficulty even dealing with variables efficiently. (likely they will be done by name, which is inefficient - though I suppose globals could have indices if you're careful)
a stack VM plus "load/store at offset from top of stack". In languages that use this, functions are likely nothing special (i.e. there are no stack frames). Disassembling the stack code by hand can be confusing, since the ever-changing stack size means the offset to a given variable (= stack slot) changes (but programs can deal with it just as easily as the next type). It is possible for the stack to get "misaligned" if incorrect code is generated.
a stack VM plus "load/store at offset from bottom of stack" or equivalently "load/store at index within local variables" (which they pretend are not on the stack, but in practice they are). This kind of language has stack frames, making life much saner (both for programmers, and in case of incorrect codegen), though it could be argued that you're paying for an extra hardware register (but if you're writing an interpreter you probably shouldn't care about this, and if you're generating machine code you can do better if you really have to).

most well-known "stack-based" language VMs use some form of this.


a register-based VM with an infinite number of "registers" (= local variables or temporaries, = stack slots). This turns out to be very similar to the previous, except expressions generate anonymous local variables rather than having push/pop instructions everywhere - the entire stack frame gets created at function start.

most well-known "register-based" language VMs use some form of this. Within this category, however, there is a very wide variety of designs.


a register-based VM where there are only a finite number of registers, with explicit stack spills, like real machine code would do. This is a lot more complexity, and since you have to hard-code the VM ISA's number of registers, it is impossible to generate an architecture-independent interpreter for, since real machines have widely varying numbers of registers (sometimes even the same hardware if you use it differently). I suppose you could say that compilers generating machine code use this as a temporary representation, after all architecture-independent optimizations have been done.

I'll be back to write more later, but at least the basic post is complete.

URL: https://www.reddit.com/r/ProgrammingLanguages/comments/up206c/stack_machines_for_compilers/i8jtxpz/
Summary: general instruction encoding, N-argument, ancillary tables
1/3 Okay, writing more "later" now (ping /u/bare_et_spoergsmaal ; see also replies to this)
You quoted:

large number instructions doing register-to-register moves to prepare for operations

This only makes sense for finite-register machines (and specifically, only for those that use two-address instructions), which are mostly used for real hardware rather than VMs (though I suppose they could be used if an infinite-register VM also supports a finite number of "special" registers).

Now, let's talk about instruction encodings. In some form or another, an instruction consists of an Opcode (e.g. "add") and some number of arguments (usually 0-3). Arguments might specify something to "read", "write", or "modify", or might just specify special flag. We might also speak of instructions having implicit arguments (fixed registers, hidden state, stack slots, etc.), but when counting those often aren't considered arguments.
It is usually a good idea to make it easy to figure out how many arguments (and of what type) an instruction has (thus, it is not solely for neatness that arithmetic instructions are usually grouped together, differing only in a few bits). For simplicity's sake, let us mostly assume that the opcode is one byte and each argument is one byte, though there may be reasons to use other encodings in certain places, and opcodes will be carefully-grouped.
In a stack-based machine, most instructions do not take any immediate arguments; instead, they affect the stack (both the stack pointer as well as the value currently at the top of the stack). Sometimes the only instruction that takes an argument might be "push immediate integer constant"; I have even seen VMs that do not allow any arguments at all (instead there are several instructions that push specific small contstants; larger constants have to be built using arithmetic). Practically, however, you'll want to allow arguments for the "load/store at offset" instructions, and possible more. Notably, in most stack-based VMs, arithmetic instructions never take any arguments (though some stack-based VMs might have separate "add immediate"-style instructions to simplify "push immediate; add" - which I mostly discuss below in the context of register VMs).
In a register-based machine, most instructions do take arguments. Particularly, arithmetic instructions usually take 2 or 3 arguments ("modify, read" or "write, read, read"). However, encoding that directly would make most instructions take several bytes, so often some arguments are implicit. For a finite-register machine, you might be able to encode each register in a few bits, but practical VMs must have infinite registers, so we need a whole byte per argument (to get from "256 registers" to "infinite registers", either have a "extra data" instruction before it (which will require normal immediate reading to use |= rather than =), or else set a bit in the opcode to read some bytes following, or else use indirection, or else switch out the data tables (see below). This should be done in opcode-agnostic code. Note that your choices here may affect how easy it is to detect jumping into the middle of an instruction).
My preferred VM design uses a single special register called the "accumulator"; this is an implicit "modify" argument to all arithmetic instructions, so that no opcode ever needs more than 1 explicit argument (and instructions that take 0 arguments are pretty rare). Thus, the bytecode is completely fixed-width (16-bit), except if the "extra data" stuff comes into play (depending on design, if extra data is needed before any instruction, all jumps must verify that the previous instruction was not "extra-data"; if extra data is needed after an instruction, verify that it decodes as a NOP/TRAP. One possibility is "if the high bit is set, it is a NOP/TRAP. TRAP is probably a safer default - and besides, unless somebody is patching bytecode, is NOP even useful?). (note that the design of UTF-8 is interesting here, though I don't recommend following it too closely, since it assumes most data is ASCII but our instruction frequencies are not going to be that nice). Since there's only one machine-like register, we can declare it clobbered by every control flow instruction, which also keeps life simple/sane (it might have a meaningful value after return, but for gotos we should clobber it explicitly to prevent idiots from doing something weird).
Additionally, I strongly recommend having external data tables that the argument is usually looked up in. This severely reduces the need for "extra data" arguments, since e.g. while it's easy to have a function that's more than 256 instructions long, it's much rarer to have one with more than 256 branches. A list of tables you probably want:

a table of integer constants
a table of string constants
(possibly other tables of constants if you support other types. But those two deserve dedicated tables regardless)
a table of local variable names (solely for debugging)
a table of global variables that this function accesses

for simplicity, these are often names since each function usually doesn't know how many total global variables exist, but you could intern them into indices as you load/generate the code


a table of other (user-defined) functions that this function calls

side note: I recommend implementing this without touching the machine stack


a table of builtin/intrinsic functions that this function uses (or, if there are few enough, this table might be global. That said, it's probably worth merging identical (or small) tables regardless???)
a table of jump targets within this function

possibly store/generate a name for debugging reasons
if you want to eliminate "extra data" opcodes entirely, this table might also include data for "switch out the data tables", though this may make it harder to debug - alternatively, you might have a "jump to function" that acts similarly (I admit this idea is a recent development on my part ... wait, how do local variables work if you do this?)


It should be noted that this encoding is optimized for execution, not manipulation. If you do optimizations, they should probably be done earlier in 3-address form (though note that it is quite possible to go from this form back to 3-address form if needed), and likely in SSA as well (but be aware that going back from SSA to mutable-register form isn't entirely trivial if you don't want to waste registers - which we don't, since we have the soft limit of 256 locals)

URL: https://www.reddit.com/r/ProgrammingLanguages/comments/up206c/stack_machines_for_compilers/i8ju0cg/
Summary: Opcode details - binary and unary, register and immediate arguments, commutative, mirrored, and reversed.
2/3 Finally, let's talk about what opcodes to provide:

Binary operators:


Boolean (6 mandatory opcodes, 20 possible opcodes):

EQ, NE, LT, LE, GT, GE

I've seen VM designs where these these are be merged into a single CMP opcode, with an argument specifying the kind of comparison. But I think that's pretty dumb for various reasons (note: it does, however, make sense during optimization), and in any case it's not possible since we only get one explicit argument and one implied argument in my design. That said, these are probably encoded adjacent to each other, so we if we really wanted we could say "the type of comparison is an argument that is embedded in the opcode byte itself".
Note that jump instructions can embed a comparison directly; these instructions will only appear if you assign the result of a comparison into a variable or use it in a further arithmetic expression.
You might need to split out signed vs unsigned comparisons (not needed for EQ and NE). But unlike division/modulus below, you could always just xor the top bit.


optionally LOGICAL_AND, but note that the frontend normally won't generate it (the && and || operators usually create branches)
there's less need for LOGICAL_OR since often any nonzero value is okay. But if storing a bool back in an integer variable, supporting this removes a coercsion of the output if you don't know the input is already a strict bool
some people might find LOGICAL_XOR and LOGICAL_XNOR interesting, but the are just equivalent to NE and EQ after coercing both inputs to bool (if you don't know they already are)
if you're really pushing it, you could add LOGICAL_{NAND,NOR,LT,LE,GT,GE} as well so that all the 10 nontrivial truth tables are accessible, but who does that?
there is no need for a reversed version of any of these. LT is the reverse of GT, LE is the reverse of GE, and the rest are commutative


Arithmetic, commutative (5 mandatory opcodes, 14 possible opcodes):

ADD, MUL

if statically-typed (please), also FADD and FMUL


BITWISE_XOR, BITWISE_OR, BITWISE_AND (I use long names so AND doesn't get visually confused with ADD)
optionally BITWISE_XNOR (implement as ~(a ^ b))
if you're really pushing it, BITWISE_{NAND,NOR,LT,LE,GT,GE} (technically these last 4 aren't commutative, but as above their inverse is in the set)


Arithmetic, noncommutative (2x 7 mandatory opcodes, 2x 11 possible opcodes):

SUB

there is no point in an x86-style CMP instruction. Since we don't have a flags register, we need to change the value in the accumulator and branch off of that (overflow? what overflow?)


BITWISE_SHL, BITWISE_SHR, and optionally ARITHMETIC_SHR (let's be honest, do people ever really use that intentionally?)
SDIV, UDIV, SMOD, UMOD

Yes, division/modulus are different for signed and unsigned integers. Unsigned div/mod cannot be implemented in terms of signed divmod. Signed div/mod can be implemented in terms of unsigned by adding branches, but ick.

Note that this is distinct from the problem of "what's the sign of the result of modulus" (which finally got standardized by C to what the hardware actually does; note that many famous languages do something weird instead)
You can only merge the signed/unsigned versions (and skip the second trap check) if your language only supports BigIntegers.


Don't forget to handle division by 0 and division of the most negative number by -1


FDIV if you support floating-point numbers at all. If your language is dynamically-typed (read: chooses to perform badly) you do not need any other floating-point-specific opcodees; if it is statically-typed
you also need FADD, FMUL (these 2 are commutative), FSUB, and FMOD. Note that FDIV and FMOD do not support the divmod identity. Note that all floating-point opcodes should be grouped together separately from the integer instructions, since they take a different type of argument.
if statically-typed, you also need separate opcodes for "adding" strings and other important builtin data types (e.g. lists). But you could instead convert those to intrinsic calls - in fact, if you support operator overloading of user-defined types inside your language, those should get converted to function calls first, then those function calls get treated either as user-defined function calls or intrinsic function calls, so strings/lists are not really special at all (assuming they don't have dedicated opcodes of course))
since these are noncommutative, you need a separate "reversed" instruction for all of these: RSUB, RSDIV, etc. The forward opcode is accumulator = accumulator OP argument, the reverse is accumulator = argument OP accumulator


For all of the above binary operators (including the reversed ones), there should be 2 actual opcode values (separated by a single bit, so the VM can handle them sanely) - one where the argument is interpreted as an index into the (runtime) local variables, and the other where the argument is an index into the appropriate constant data table (usually integer, sometimes float)

I call these "immediate" (suffix _IMM) even though it does a table lookup, since in the assembly language it is an immediate. I don't believe there's a point in separate functions that directly use the raw immediate.


Unary operators:

there is no unary +x, that's a NOP and should not be emitted.
there is no unary -x, that's just 0-x (RSUB with immediate argument)
there is no unary !x, that's just x == 0
there is no unary ~x, that's just x ^ -1
Unlike real machines, popcount(x) is probably not important enough to have its own opcode in a VM; it can be an intrinsic. Note that parity(x) is just popcount(x) & 1.


URL: https://www.reddit.com/r/ProgrammingLanguages/comments/up206c/stack_machines_for_compilers/i8ju1bm/
Summary: special instructions - memory, control flow, function arguments
3/3 opcode details, continued

Memory stuff (5 mandatory instructions):

LOAD, STORE (taking an index into the local variables)
LOAD_GLOBAL, STORE_GLOBAL (taking an index into the current function's data tables, which in turn contains the name (or possibly interned index) of a global variable)

alternatively you could demote these to be intrinsics, in hopes that people stop using global variables


if you support closures, you might also need special load/store for captured arguments (but please, make everyone's life easier and use eager captures, not lazy ones!)
if you are a multiuser server or something, you probably want LOAD_USERDATA and STORE_USERDATA (in fact, I've seen 3 levels of this (account, character, session) - though it used sigil-based distinctions rather than something sane). Note there's nothing stopping you from having user.foo = x generate this automagically.
if you support general-purpose objects, you might need some sort of LOAD_FIELD and STORE_FIELD, but keep in mind that the latter requires 3 arguments no matter how you count it (maybe demote it to an intrinsic ... or maybe just store the extra arguments in the same place the intrinsic does. I am, after all, quite skeptical of trying to cram an Object in the accumulator)
LOAD_IMM (taking an index into the integer constant table)
I don't believe there's any point in having an explicit "zero the accumulator" instruction, since x OP 0 will be encoded as OP_IMM, and 1 OP 2 will be constant-folded. There may, however, be some use in a STORE_ZERO instruction (but it is somewhat less useful if we zero out all local variables while setting up the stack frame, and let's be honest: we should)

but see the below discussion about ARG


I can imagine a COPY bytecode that treats both its explicit argument and the accumulator as local variable indices. But since you'd have to set the accumulator somehow in the first place, and there's no other context in which the accumulator might already be set to a local variable index, this currently isn't saving any instructions.

but again, see the ARG discussion


Control flow:

JUMP (unconditional; argument is index in the data table of jump targets)

under some circumstances (e.g. if you use the "switch out the extra data" idea), you might wish to enforce that if you ever jump to a bytecode position, you always jump to it. This is as simple as checking that the target's previous instruction (be sure to consider whether extra data is possible or valid!) is some kind of jump (or you're at the start of the bytecode - but are you sure you want to be jumping there?). That said, I have seen VMs that have a LABEL instruction even though it does nothing at runtime. (may be useful for debugging too)


JUMP_EQ, JUMP_NE, JUMP_LT, JUMP_LE, JUMP_GT, JUMP_GE

only the first 2 are strictly necessary
note that if your frontend actually generates the other 4 (when the argument is not 0 already), you're probably declaring UB (using 8-bit integers for demonstration purposes 127 > -1 is true, but -128 > 0 is false). But if it gives us better codegen, EMBRACE UNDEFINED BEHAVIOR

this is one of the few times it's tempting to have multiple accumulator registers, but it's not worth it. Just generate the separate comparison operators, sigh
actually, what if we instead just made a "saturating subtraction" instruction? Actually 2, because now we care about unsigned. But call then SCMP and UCMP so we don't have to explain what saturation is? Or at that point should we just use a normal cmp? What generates better machine code?


jump if accumulator OP 0 using signed comparisons. There is no need for separate unsigned instructions since the argument is hardcoded as 0:

x ULT 0 is always false
x ULE 0 becomes x EQ 0
x UGT 0 becomes x NE 0
x UGE 0 is always true
(what do you mean, you don't want UB for unsigned integers even if you want it for signed ones)


late edit: it is often worth considering these to take two branch targets and unconditionally jump, at least for analysis purposes. Note that if you actually implement the bytecode that way, JUMP_LT and JUMP_GT happen to merge into a single opcode.
late edit: it might also be useful to consider Fortran-style arithmetic if, with 3 total targets, but think: how would that be encoded?


RETURN (unconditional; argument is ignored)

note that this counts as a jump instruction (technical term "terminator"), unlike calls
if the function returns an (integer) value, it should probably go in the accumulator, where the caller can store or use it as needed.

But (especially if it's a complex type) you might instead put it in the same place you put arguments. Or, for that matter, there's nothing stopping you from having an arbitrary number of out or inout arguments!
arguments


note that, unlike real machines, a VM should not read the return location from the VM data stack. Use a separate stack just for call/return (perhaps arrange the obligatory data structure into an implicit linked list?)


CALL_FUNCTION, CALL_INTRINSIC

all functions and intrinsics should specify the expected number and type of arguments. See below for my suggestion for how arguments are passed (note that this is the part where you have the most choice)
these look identical at the bytecode level, other than which data table the target is looked up in.
in the runtime, however, CALL_INTRINSIC necessarily involves a function call (on the real machine stack), whereas CALL_FUNCTION is best implemented simply by tweaking the data of the VM stack and remaining in the same machine stack

besides eliminating stack overflows, this also means you can pause/resume the VM at any time, swap between tasks if needed, etc.  you might interpret opcodes differently when resuming onto them - or you might require certain opcodes to appear in pairs, or who knows what.


note that intrinsics should NEVER call back into the VM. If you're tempted to do so, you should probably be adding a special-purpose Opcode instead (you might have e.g. a "prompt the user with a list of menu options, and jump to that target" instruction)

likewise, while I'm thinking about it: if you support operator overloading, don't dispatch to it from an ADD opcode - only use that when you know both arguments are integers (maybe there's a way to avoid this, but it's sure to be complex and thus bug-prone)


ARG, ARG_IMM, and possibly some helpers: store arguments in preparation for a function/intrinsic call

the most sensible place to store incoming arguments is at the bottom of the current stack frame, below "true" local variables; the most sensible place to store outgoing arguments is at the top of the current stack frame, above all the temporaries (note: in the IR, you might not know how many temporaries there are yet, so you should use refer to them be argument index rather than local-variable index). No copying is necessary if you simply use overlapping (data) stack frames.
now I hope you're paying enough attention to ask "but wait? how do you specify both the value and the index of the argument?".  Well, there are several ways to do it - you could use dedicated opcode bits (see the technicality for what to do if you have more than N (8?) arguments), you could use extra data instructions, you could use a separate arg_data table that contains the pairs, or you could use the accumulator (initialize it to zero (or some other value?), then have each ARG intruction increment (or decrement) it as a side-effect - this does mean you can't do evaluate arguments as you go (though if the inner functions have more arguments you can't do that anyway), since all calls clobber it (hm, but do they really have to?)). Consider also whether arguments are passed in forward or backward order - and also, in what order foo(bar(), baz()) and foo(bar(1, 2, 3), baz(4, 5, 6)) are evaluated.

depending on how exactly these instructions (and their helpers) are implemented, some of them might be generally useful as well


technically these aren't needed at all, but if we don't have them, there would be a lot of register-register copies, which take 2 instructions each
there is no need for an "arg from accumulator" instruction, since outgoing arguments are just ordinary (nameless) local variables.
it may be tempting to use the accumulator for the first (or last) argument, but that adds a surprising amount of complexity, so I'm not sure it's worth it.


If you add all of the above together, including the weird ones, this is around 120 opcodes, which is still under 7 bits (so we can dedicate a single bit for extra data), though with not much to spare for future expansion. If we only use the sensible ones it's about 64, but remember to preserve space for future expansion and not cram it (so give up on the idea of there being two dedicated bits for flagging extra data - that means you cannot specify both "this is extra data, so trap instead of executing" and "this instruction needs extra data, so remember to go look for it" if you want to handle 15 bits of extra data per instruction)
Edit: if you've gotten here, you might also read my thoughts on static typing in relation to VMs (which links back to here)

URL: https://www.reddit.com/r/ProgrammingLanguages/comments/v7q9xg/how_to_implement_static_typing_in_a_c_bytecode_vm/ibmexzq/
Summary: static typing, inside and outside the VM, and their tension. Tagging and how to avoid it, even with unwinding.
There is some tension between "statically-typed inside the VM" and "statically-typed outside the VM".
Most VM languages make the decision to YOLO inside the VM, and use some kind of tagged union, tagged pointer, or NAN tagging outside the VM. Using tagging is mandatory since dynamic languages don't know ahead of time which member will be valid.
However, if you have full static type info, you can use untagged unions to represent VM objects. This is a slight penalty to debugging, and makes unwinding tricky, but it is worth it. Besides, you can still have a debug mode where the full type (or partial type) is stored as part of the object itself (which is then a 2-word-wide tagged union rather than a 1-word-wide raw union as it is in release mode) and asserted on reads, but otherwise the type is never relied upon. But if you support any kind of suspending, you probably need to store the partial types somewhere. You could have an explicit unwind table, or you could just use a parallel bit-vector (as in: struct-of-array, not array-of-struct - with the understanding that array-of-bit is optimized).
Chances are that you can get away with only 2 partial types, but it may be convenient to use 3 or 4, depending on what exact decisions you want to make:

A primitive type. This includes integers, floats, maybe decimals or fixed-point numbers if you want. These require no lifetime-tracking and no deallocation.
A simple allocated type. You'll may want to do refcounting on these, but otherwise there is nothing fancy about them; the only thing you need to do with them is deallocate them when their lifetime ends. Strings belong here, as do arrays of primitives ... but if you think about it, you should be able to handle this as a special vtable for one of the below.
A object whose class is defined in the in-VM language. For these, you fundamentally cannot use the host's vtable mechanism, but you can (and should) create your own vtable system to deallocate the types. You should NOT use a dict-based structure - since you have static type info, you can transform field names into field indices (if you need reflection, make your vtable support that).
A VM-support object, if this really needs to be handled specially. This includes dicts and arrays for non-primitive types.
A host object type, strongly owned. If your host exposes any objects to the VM, this is what they'll use.
A host object type, weakly owned. This may be useful if you're thinking in terms of a multiplayer game, and the player may log out while the script is suspended. Alternatively, you may hard-code setting these in the VM's "context" struct.
A function pointer (host or in-VM, with or without capture). If these are allowed at all, you should probably desugar them in the frontend to use vtables. For once, Java is a useful inspiration.)

(yes, I listed more than 4 things above. You should be able to eliminate/merge enough easily so your bitvector uses only 2 bits for the partial types. But life will get MUCH easier if you can reduce it to only the 2: "primitives and standard VM objects" - you should be able to support all of 2..6 with a single custom vtable design if you think about it.)
Note that if you use a stack VM, you'll need to use range analysis to determine the type of temporaries on the stack. Depending on what exact unwinding strategy you want, this may be obvious anyway. But I am a strong proponent of register-based VMs (with a single special accumulator) anyway. I wrote up a length thing a while back about them - I recommend reading it all! (note that it links back here now)
Note that within the language that gets compiled to run in the VM, you may offer additional types. However, these should desugar into simpler accesses in the frontend.
For example, if you have a magic player object, the code:
player.attack = player.strength

might desugar into player_set_attack(player_get_strength()) or even player_set(PLAYER_ATTACK, player_get(PLAYER_STRENGTH)) before the VM sees it at all. A key observation is that the host should probably never expose enums directly into the VM language (an annoyance I learned the hard way from other people's game VM code).
Important note: in general, I assume you can trust your bytecode not to be malicious (since otherwise, VM escapes are trivial), which mostly means "bytecode is generated at startup, not from a cached file", but could also mean "I trust the cached file" or "I have a way to verify the bytecode statically as well" (which is quite doable, but that's more work).

URL: none
Just a brief later update - in the above, I would again emphasize: the thing that is by far the least set is what exactly the function-call sequence looks like. To avoid the need to potentially change stack map (which is otherwise not needed for a register-based language - though even here it is only needed if interruption between arguments is a concern, so doing extra copies is also feasible) for function calls, perhaps allow the CALL opcode to specify a start-of-arguments slot anywhere in the frame (and always do a copy to the new frame?). (note that the number/types of arguments can be read off the function descriptor)

  
## inlining-in-c.md

      
    Raw
  

              inlining-in-c.md
            
          
    (this one isn't from Reddit, but it's one of my best StackOverflow posts and it's relevant to think about for compilers and certain kinds of interpreters)
URL: https://stackoverflow.com/a/51229603/1405588
Instead of answering "what does it do?", I'm answering "how do I make it do what I want?" There are 5 kinds of inlining, all available in GNU C89, standard C99, and C++. MSVC has some of them (note that I haven't tested the MSVC code)
always inline, unless the address is taken

Add __attribute__((always_inline)) to any declaration, then use one of the
below cases to handle the possibility of its address being taken.
You should probably never use this, unless you need its semantics (e.g. to affect the assembly in a certain way, or to use alloca). The compiler usually knows better than you whether it's worth it.
MSVC has __forceinline which appears mostly the same, but apparently it refuses to inline in quite a few common circumstances (e.g. when optimization is off) where other compilers manage just fine.
inline and emit a weak symbol (like C++, aka "just make it work")

__attribute__((weak))
void foo(void);
inline void foo(void) { ... }

Note that this leaves a bunch of copies of the same code lying around, and the linker picks one arbitrarily.
MSVC doesn't appear to have an exact equivalent in C mode, although there are a couple of similar things. __declspec(selectany) appears to be talking about data only, so might not apply to functions? There is also linker support for weak aliases, but does that work here?
inline, but never emit any symbol (leaving external references)

__attribute__((gnu_inline))
extern inline void foo(void) { ... }

MSVC's __declspec(dllimport), combined with an actual definition (otherwise unusual), supposedly does this.
emit always (for one TU, to resolve the preceding)

The hinted version emits a weak symbol in C++, but a strong symbol in either dialect of C:
void foo(void);
inline void foo(void) { ... }

Or you can do it without the hint, which emits a strong symbol in both languages:
void foo(void) { ... }

Generally, you know what language your TU is when you're providing the definitions, and probably don't need much inlining.
MSVC's __declspec(dllexport) supposedly does this.
inline and emit in every TU

static inline void foo(void) { ... }


For all of these except the static one, you can add a void foo(void) declaration above. This helps with the "best practice" of writing clean headers, then #includeing a separate file with the inline definitions. Then, if using C-style inlines, #define some macro differently in one dedicated TU to provide the out-of-line definitions.
Don't forget extern "C" if the header might be used from both C and C++!

There are also a couple of related things:
never inline

Add __attribute__((noinline)) to any declaration of the function.
MSVC has __declspec(noinline) but it is documented to only work for member functions. However, I've seen mention of "security attributes" which might prevent inlining?
force other functions to be inlined into this one if possible.

Add __attribute__((flatten)) to any declaration of the function.
Note that noinline is more powerful than this, as are functions whose definition isn't known at compile-time.
MSVC doesn't appear to have an equivalent. I've seen a single mention of [[msvc::forceinline_calls]] (applied to a statement or block), but it's not recursive.

(in comments, I reminded people that "inline" is more about symbol behavior than performance)

  
## memory-allocation-api.md

      
    Raw
  

              memory-allocation-api.md
            
          
    (in response to https://www.foonathan.net/2022/08/malloc-interface/#content being posted)
URL: https://www.reddit.com/r/cpp/comments/x2lbsc/malloc_and_free_are_a_bad_api/imkk60p/
But that's still not everything it needs to do:

alignment at/with offset. Currently, Microsoft's allocator is the only major one that provides this. Note that an offset and alignment can be stored in a single word and distinguished by using the usual bit trick to find the highest bit set. Note that some libraries interpret the offset as positive, others as negative (which one makes sense depends on whether you are thinking "where is this object (which might be inside another object)" or "what do I need so I can place an object at an offset within this one").
flags: knowing whether or not you need to zero the object yourself can matter sometimes; the compiler should be able to add/remove this flag to calls. But other flags are possible. I have a list somewhere ...
The existence of mremap means that the allocator does need to provide a realloc that moves. Note that only C++'s particular interpretation of move constructors prevents mremap from working.


Okay, I dug out my list of flags. This is not necessarily complete; please post any others that might be useful. Not every allocation library needs to actually support all the flags; only a couple are mandatory.

zero the returned memory (safe default, but may be expensive, for alloc. Special support is required for realloc, however - this is one of the most blatant flaws in existing allocators!)
maymove (full support mandatory; only for realloc-equivalent. An implementation that lacks support for this flag cannot work: "always fail if we can't resize in place" and "always move if we can't resize in place" both break callers that assume the flag one way or the other)
align_size (safe default: if you request "size 17, align 16", forcibly increase the size to 32)
small_align (optional: normally, if you request "size 17, align 1", some allocators will treat this as "align 16" and waste space. Unfortunately, compilers then assume the returned value is aligned and perform bogus optimizations. Supporting this flag is really about compiler support rather than library support.)
nopointers (optional; useful in the context of GCs. Strings, large I/O buffers, and other arrays of primitives should use this.)
secure_zero (mandatory; free/realloc only)
free_on_failure (mandatory; realloc only)
size_is_a_hint (optional: don't consider it an error if it's more convenient to return a slightly-smaller allocation. For realloc it should probably force the size to grow at least somewhat. Remember that many allocators have a couple words of overhead at the start of the page.)
compact (optional: we know the final size exactly; don't bother to prepare for realloc)
various flags might be useful based on the expected lifetime and usage of the allocation:

assume stack-like lifetime. If you free these in reverse order the allocator will be more efficient than if not. Likely this means "don't attempt to reuse freed space if freed out of order"; note that this is likely to happen to some extent if .
assume short (but not stack-like) lifetime
assume long lifetime (possibly the life of the process, but not necessarily)
assume the allocation will never be freed
madvise is also relevant here. Should the pages be eagerly faulted, etc.
note that none of these flags actually affect what you are allowed to do. In particular, free is still safe (but may be a nop or delayed in some cases)


threadedness flags:

(free/realloc only) we know we are freeing this from the same thread that did the allocation
(free/realloc only) we know we are freeing this from a thread other than the one that allocated it.
used exclusively by the calling thread
used mostly by the calling thread
used mostly by one thread at a time
shared between threads, but mostly a single writer
shared between threads aggressively
note that kernels, hardware, and memory-debuggers might not have the infrastructure to support these yet. But we need to be able to talk about them. I'm not saying we need to standardize the particular flags, but we need to standardize a way to talk about flags.


flags relating to the CPU cache?

It should also be noted that size should not be a simple integer. It should be (at least conceptually) a 3-tuple (head_size, item_size, item_count), since expecting the user to do that may result in overflow. Note that even systems that support reallocarray do not support this. That said, by doing saturating arithmetic it is possible to only store a single integer.
It is tempting (for ease of interception) to specify all of these in terms of a single multiplexed function:
auto utopia_alloc(Allocation, AlignAndSkew, Size, Flags) -> Allocation;

(Precedent of realloc/mremap and aligned_alloc tells us that Allocation and AlignAndSkew should individually precede size but there is no precedent for the order between them. Precedent of mmap and mremap tells us that flags come last; note that they also support "specify a fixed address that must be returned" but with inconsistent ordering and besides I don't find it interesting to support for anonymous memory)
However, to minimize the overhead we actually shouldn't multiplex too aggressively, since there will be a lot of branches if we do. Intelligent use of inlining and code-patching may help get the best of both worlds.
Note that it is mandatory for free/realloc to support specifying Allocation in terms of the size originally requested. However, some flags might further constrain this somehow. Does it suffice to say "on realloc, all alloc-type flags must match exactly?"

(other people's suggestions)


randomize: Improve security at the cost of performance by randomizing where in virtual memory the memory is.


guard: Improve security by adding padding before and after the allocation, maybe with hardware support.


executable: Allow code to be written into the allocation and executed later. (W^X is a concern, though.)


ipc_sharable: Allow the memory to be visible in another process.


no_page: Don't allow paging to disk. Might need other flags to communicate desired OOM conditions (SIGSEGV on access? zero on access?).


compressable/uncompressable: Indicate that the OS should compress or not compress when paging to disk.


use large pages where possible to improve TLB efficiency


ipc_sharable: Allow the memory to be visible in another process.

What exactly are you thinking of here?
If you only want to share the memory with your children, passing MAP_SHARED | MAP_ANONYMOUS is sufficient. But if you want to allow sharing with arbitrary processes, you need a filename so others can access it in the first place.
I do think there is a use case for an instantiable allocator (with filename a ctor argument) that deals with sharing, but this does not seem like a flag even for the anonymous case.
(some of the other flags here might also belong to different types of instantiable allocators)


Another way would be a syscall where one process can copy part of the virtual memory table from another process. But I don't think OSs expose this to user space programs currently. (But they could!)

This is fundamentally impossible for private mappings (which are the most common) because of how fork() works. Because private mappings are so overwhelmingly common, it doesn't make sense to provide such an API.
I suppose you could say "private mappings are then subject to CoW again" but that has no advantage over the existing process_vm_readv//proc/<pid>/mem methods.

I'm not aware of security caring about "this page was released to the OS and somebody somehow got a hold of it before the OS zeroed it", which is the only case that seems like it would need OS support?
It's true that securely zeroing things requires you to worry about whether temporaries still live in registers or got spilled to the stack, but that's actually relatively easy to take care of at a non-inlineable function call boundary.
Some might argue "this doesn't need to be a flag, it can just be a special memset", but allocators do really want to know if they have an already-zeroed chunk of memory.

  
## parsing.md

      
    Raw
  

              parsing.md
            
          
    (TODO pull out more of the parsing ones that have more details and merge them into this)
The single most important thing you want out of your lexer and parser is: if you make a mistake writing their grammars, do they complain? And the second is likewise, namely: when you feed input to them, do are they guaranteed to return in a reasonable amount of time?
Most approaches, unfortunately, will silently do the wrong thing. So just reject those approaches entirely, and treat all warnings as errors:

For lexing, make sure you do not iterate through rules and try each one in turn. Writing a lexer manually in a language without switch is annoying in this regard, even ignoring the other concerns. In particular, backtracking-based regex engines (usually PCRE-adjacent, and unfortunately often the only one built in to a language) do this and have additional exponential behavior, so should be avoided at all costs. DFA/NFA-based regex engines (usually POSIX-adjacent) are fine.

Do not attempt to match keywords using lexer rules. Rather, match them all as identifiers, then do a lookup in a dict/map. This also allows you to enable/disable keywords depending on configuration.
Tooling is much easier if comments are actually tokens (which can then parsed, constraining them to particular positions in the later grammar), not just ignored as whitespace.
It is extremely useful if you can start lexing anywhere in a file rather than only at the start. The main thing that people mess up here is multiline strings. Take inspiration from comments: either require a prefix on every line, or use asymmetric start and end delimiters.
The main sanity-related concern is: "is any lexer rule unreachable due to some prior rule having priority?", with a secondary "what happens if none of my rules match?".
Unfortunately, there is no obvious winning tool for lexing. I suppose you could use re2c and scrape the .dot files? flex unfortunately doesn't have much in the way of usable dumps.


Note that often you want some small stage between lexing and parsing, especially if you used a pure machine to automatically lex rather than hand-writing it or using a code generator with semantic actions.

synthesizing indent/dedent tokens out of thin air can be done here
token-tree-first parsing is an interesting approach


For parsing avoid any approach that does backtracking. Check the documentation; common terms to avoid (at all costs) are "parser combinator", "PEG", "packrat", "memoization", "GLR(k)", and "LL(*)" (where k can be replaced with a number, and * is literal). LL(1) is fine but be aware of the dance it requires for expressions. LR(1) is fine with no downsides (and in fact I've never seen anything more than LALR(1) needed, nor use of higher k values). SLL(k), SLR(k), and LR(0) are too trivial to be useful.

I find by far the most useful approach is to use bison --xml and build a custom machine to run that. The runtime is really simple once you have tables; generating the tables is the only hard part. I frequently see people saying "don't use parser generators" but I don't think I've ever seen a criticism that applies to this usage of it (and honestly, a lot of the criticisms don't even apply if you bother to change away from the yacc-compatible defaults). And bison in particular supports a lot of useful grammar features that many others don't, the most important of which is operator precedence so you don't have to reduce tons of dummy intermediate nodes. The benefit of historical insight is a major thing you're going to lose if you stray from the battle-tested tools.
error recovery is hard


Most tools default to (or only operate in) pull mode. Changing to push mode is useful for the REPL though. But when you're not at a REPL, it's best to map an entire file into memory at once ahead of time. Note also that the SourceMap trick (which allows you compress source locations into a single 32-bit integer - or two for source spans) requires careful thought to use a REPL if you don't know when input is complete. But that's really just an optimization, not something critical.

  
## zz-random-junk.md

      
    Raw
  

              zz-random-junk.md
            
          
    Some more posts I've found that weren't as tutorial-y but were still up (r/ProgrammingLanguages and r/Compilers and r/C_Programming are up; r/programming and r/cpp and r/asm are down):
This is very much an unorganized mess.

URL: https://www.reddit.com/r/Compilers/comments/wimoej/how_hard_is_it_to_add_jit_to_an_interpreter/ijhj144/
You probably already know some of this, but there are several adjacent necessities for an efficient language. All of them are only fully possible with static typing:

machine code, rather than bytecode (tree-walking interpreters are even slower). AOT vs JIT isn't critical, but see later notes.
sane object representations, avoiding unnecessary memory allocations and pointer chasing. Failing this is the reason that Java never gets native performance except for purely-numerical code.
register allocation and avoid unnecessary stack spills. Even a pretty dumb regalloc algorithm will work fine for most functions, but functions that are just on the edge can have major improvements from more advanced regalloc.
optimizations (inlinining etc.). Only a handful are really important; after that is just chasing tiny incremental improvements.

3 and 4 get expensive the more you implement. This ties back into the AOT vs JIT problem - or really, vs numerous possible degrees of JIT. Something of the form "hidden AOT" may be worth pursuing for ease-of-use - people like there being a simple way to start a program.
One notable optimization that is not basic is devirtualization (in its various forms), which very much would like to know about about the entire class hierarchy. Proper language design can help here.
My well-received posts about static typing and register-based VMs might contain a handful of further insights, even though they are mostly not focused on AOT/JIT.

(in reply to a post by matthieum which mentioned 2 axes: dynamic-vs-static and jit-vs-aot)
URL: https://www.reddit.com/r/ProgrammingLanguages/comments/122ng8f/how_fast_is_jit_compiled_luajavascript_compared/jds9aus/?context=10000
There's a third axis actually:

inlined objects vs indirect objects

C tends to use inlined objects - structs are stored directly inside other structs. Java OTOH forces them all to be allocated separately, and this is a major cause of Java being 2x slower for all real-world programs (and also taking up 2x RAM, which is mostly separate from the latency issue proper unless cache misses are a relevant problem).
JIT alone should theoretically never hurt performance after startup. However, in practice JITted languages involved a lot of indirection so they end up slower overall, even ignoring startup time.

There's also a fourth axis - deterministic memory management vs GC. Well-written programs mostly don't care about this axis (at least where performance is involved), but GC makes it easier to write bad programs and sometimes even makes it difficult to write a good program.

Note that the .NET environment does much better than the JVM environment when it comes to allowing you to choose performance.

(in reply to a post about C++ and Rust vs Java vs C# generics)
URL: https://www.reddit.com/r/ProgrammingLanguages/comments/13lzxgd/how_net_jit_compiles_generic_functions_so_fast/jkv5ot0/?context=10000
It's worth noting that most of those things you talk about in Java etc. are merely implementation limits, not fundamental limits.
If you put e.g. field offsets in your vtable (and pay the cost of using them for accessors - but since this kind of language is usually memory-bound, it's often a win), you can gain a per-T compact representation but still be usable in generic contexts. Notably this implies not doing type-erasure like Java does - instead, pass the vtables at runtime as an extra argument.
Fat pointers can be useful for some contexts (but not e.g. to support new T[]). Also keep in mind that there are both dynamic and static vtables involved.

(this is a whole thread, with only my posts kept)
URL: https://www.reddit.com/r/ProgrammingLanguages/comments/hc5een/how_fast_can_interpreted_code_be/fvgplgp/
10x slowdown is quite doable.
There are only 4 or so optimizations that give you 90% of the benefit.
Without inlining it will be impossible to write code that is both pretty and efficient.
Using static types makes life infinitely easier.

Inlining, constant propagation, dead code elimination (including branch elimination; requires knowledge of purity), rescheduling (both hoisting and deferring)
Note that these have to be run multiple times, and it might not make sense for them to actually be separate passes.
The above will get you the first 90%. For the next 90%, devirtualization and scalar replacement of aggregates are amazing ... just insanely difficult. You'll also want to be at least somewhat aware of allocation lifetimes.

Oh, just hang around a few years.
Inlining comes up in every discussion about optimization ever. Implementing it requires that your IR is halfway sane. It is impossible to get good performance out of a language without inlining (which, note, is impossible for virtual functions).
Constant propagation and DCE are so obvious/trivial that many people hardly even think of them as optimizations. Only the branch-removal case is interesting.
Hoisting a pure computation out of a loop is just one of those things you expect a compiler to do. Moving a computation into a condition is less intuitive, but it's just the other half of it.
Code like the following is quite common, especially after prior optimizations.
void foo(Bar *bar, int qux)
{
    Baz *baz = bar->to_baz();
    if (qux)
        use(baz);
}

Oh, there is a fifth basic optimization: merging recomputation of the same expression, if it's pure but not constant:
void foo()
{
     int x = pure(0);
     int y = pure(0);
     // can replace with `y = x`, which is a NOP in SSA.
}

Ideally you'll be able to do this even in semi-pure cases, such as reading the length of an array.
... as an alternative to TBAA, if we separate const and readonly, can we ask ASAN to change R/W bits for the duration of the latter? Unfortunately, page-level granularity is just not useful here.

Devirt and SROA I know because of all the swearing by GCC and LLVM developers. Note that C++ gets off easy since functions are non-virtual by default.
Allocation lifetimes ... uh, that's just obvious, it has to be handled to avoid spurious loads/stores all the time (note that this applies both to alloca and malloc). But you can afford to defer implementing this optimization because it's not hard for the end-user to write code to work around it (and for the malloc case, nobody has ever implemented a complete solution).

Devirt is simple in both what it needs to do, and why it is hard.
If you're calling a virtual function through an object, and you know what implementation of the object is used, call that implementation directly instead of through the vtable. However, it's tricky to actually know that. Some simple cases:

the method is final
the class is final
you can see the the code that created the instance, and you disallow replacing the class (!)

You can also do partial devirtualization, where you only know what implementation is likely, because a pointer-compare followed by a branch is more predictable than an indirect call.
Note that these answers can change with WPO or PGO.

I have no clue why SROA is such a nightmare, but it is. Conceptually, it's easy, just take:
void foo()
{
    struct Bar
    {
        int x;
        int y;
    } baz;
}

and turn it into:
void foo()
{
    int baz.x;
    int baz.y;
}

(assuming you never take the address and require them to actually be contiguous).
Note that this also applies to arrays.

URL: https://www.reddit.com/r/ProgrammingLanguages/comments/12z66qi/does_the_jvm_clr_even_make_sense_nowadays/jhugnnp/
As a VM, the CLR beats the JVM by a mile since it supports records (objects without silly allocations). It's been at least a decade that the JVM has been promising parity and I've long since lost hope. It is for this reason that a reasonably-written JVM program will always be twice as slow as the equivalent C program, but the CLR can usually match C ... unless the GC is a problem, which is often true for any nontrivial program. The problem with AutoCloseable/IDispose is that they natively only handle one kind of ownership, whereas real programs need several kinds of ownership, GC not being one of them. You can sort of hack it if you're willing to call .clone_incref() manually but this may involve extra allocations, and you can't assume the JIT will elide them.
The JVM has a better ecosystem since the CLR is still used with a lot of Windows-only libraries (this has improved but not gone away). If you're writing a standalone language that doesn't matter though.
Targeting either of these is still miles ahead of targeting a dynamically-typed VM (javascript, python, lua, etc.) which unfortunately a lot of new languages do because it's "easier" and thus they forever cut themselves off from sanity.
WASM, Javascript, and Lua are major attempts at application-level sandboxing. System-level sandboxing is less prone to sandbox escapes though.
For AOT, the main options are LLVM, GCCJIT, GCC-plugin, or generate-C-code. This means no lazily customizing the binary per platform, and LTO is expensive, whereas JIT amortizes the cost and does profiling automatically but likely generates suboptimal code due to the need of patching. JIT is also highly problematic for oneshot scripts (dynamic languages often do badly here also even without a JIT). It's possible to mitigate most of these problems (with either AOT or JIT) but you'll have to write a fair amount of tooling yourself. Hot reloading isn't impossible with AOT languages either, but to avoid leaks you have to use dlclose (not supported by all libcs - glibc supports it but ownership is very tricky).

URL: https://www.reddit.com/r/ProgrammingLanguages/comments/110gitm/are_people_too_obsessed_with_manual_memory/j89fzid/
The problem is that GC does not provide useful semantics for any top-level programming task. It's used solely to avoid making the programmer explain what they're trying to do.
(the shared/weak/unique/borrowed quartet still doesn't provide everything but it is much closer to useful)
Note that fragmentation is much less common when you aren't doing as many allocations (most GC languages gratuitously allocate small objects).
If compaction is really needed it is going to be most effective when coming out of the "nursery" or so, which should be solvable at the level of factories or copy ctors. There's still room for improvement, but "we already beat GCs easily" means there's not a lot of motivation.

URL: https://www.reddit.com/r/ProgrammingLanguages/comments/kph5n4/how_fast_an_interpreter_language_can_be/gi0r67q/
Memory layout matters quite a bit. Chasing pointers is expensive. Exceeding cache size is a huge cliff. Because of this, good test cases are hard.
As a rough rule:

C-like languages take 1 unit of time (100% of ideal performance)
Java-like languages take 2 units of time (50% of ideal performance) except in purely-numerical code.
Bytecode-interpreted languages take 10 units of time (10% of ideal performance). Most "serious" dynamic languages aim for this, as do static languages that lack a JIT.
Tree-walking languages take 100 units of time (1% of ideal performance). There are quite a few of these in the wild, it's still suitable for glue code (often calling native libraries, or waiting for user input and then shelling out).

Again, these numbers are very rough.

URL: https://www.reddit.com/r/ProgrammingLanguages/comments/1096map/what_features_make_languages_amenable_to/j3zgcaw/
Static typing is obvious. Without it you can't do anything.
Control over memory layout is critical. In particular, Java fails to reach C++ efficiency for anything other than trivial code due to all the mandatory indirections and wasted allocation space. C# lacks the deficiency but you do have to opt in.
If you want consistent (as opposed to amortized - and that only if you're lucky!) performance, Garbage Collection is forbidden.
The language should make it reasonably easy for user to implement and use efficient new data structures. For this reason many C programs are slow - they resort to dumb arrays and linked lists that offer horrible performance when the collection grows. Contrast C++ which, for all its faults, has an amazing STL. Thus we can infer that good generics are mandatory (note that C++-style templates don't need to be implemented fully, but there are times when you want to duplicate an optimize the machine code for a specific types - and also times you don't).
Aliasing rules can help a lot. Be aware that there is an unfortunate lack of optimizers that use aliasing rules other than C's shitty ones. Sometimes making them think in Fortran mode can help.
Being able to specialize for the particular runtime platform can help. This is often seen as limited to JITs but is not actually true - runtime-AOT or even GCC-style ifunc (among others, albeit with the limitation of not supporting platforms newer than the original compiler) can accomplish this.
Make inlining possible across translation-unit boundaries. But for the love of all that is holy, do NOT interpret this as "compile the whole thing at once and give up on incremental compilation" (looking at you, Rust - your shitty hacks are shitty hacks in this regard).
Devirtualization (true or speculative) is also an important optimization.

(in response for a request for elaboration about incremental compilation)
First, the negative side. Rust is an example of a language that:

is a reasonably-designed language
has extremely-well-supported development
tried compiler-magic incremental compilation, which:

was still slow
was still buggy (as in, produced incorrect results)


Given that, it is the height of arrogance to think you can succeed with that dead-end approach.

For the positive side ... I don't pretend to have a complete solution. C certainly performs well but lacks generics. C++ begins to have problems with generics/templates but the current attempt at Modules has proven that much of that can be eliminated.
Manually maintaining headers of course is quite silly. Generating headers from a source file is relatively trivial. Note that "headers" need not be textual though there are still good reasons for them to.
On the build-system side, this looks like: first, generate headers for the current module (this does not require having headers) as well as any modules it depends on (being sure not to duplicate work if compiling modules in parallel). Then use those headers and do the actual compilation. Note that make is kind of bad at dealing with dynamic dependencies like this; what it supports is recording the dependencies detected this time so that it knows what to re-try next time; implementing -MMD may leak internals in weird ways.
An explicit inline could change the generated "header" from declarations only to definitions as well for so-marked functions. You could also do implicit inline for sufficiently-small functions, at least within a "crate"-equivalent. Obviously across ABI boundaries only an explicit inline indicator should allow such violations. Symbol visibility is a whole nother thing to spend time thinking about.
It's possible that "sufficiently-small" can't be defined statically. It may be useful to have; for release builds you should perform two full builds so the first one can be sure to update the numbers (speaking of which, automated PGO would be a really useful thing to have). The place the numbers are stored might or might not be committed to VCS (take care in either case). Or you could just ignore this and require explicit inline-style annotations for negative cases as well.
It may be worth thinking of this as "LTO, except for each object file rather than only for each executable/shlib".
Sorry if this got a bit rambly and nonspecific; there's a lot that depends on language design. The only hard requirement is "use static types (at least by overwhelming default)".

URL: https://www.reddit.com/r/ProgrammingLanguages/comments/qwereq/why_statically_typed_languages_are_better_than/hl64dxq/
It's simple: because anything you can do in a dynamically-typed language, you can do in a statically-typed language that supports reflection. The reverse is not true.
There's a reason we speak of "unitype" languages.

URL: https://www.reddit.com/r/ProgrammingLanguages/comments/zlp8gq/how_do_you_separate_the_different_stages_of_your/j095hzr/
Some possible phases. Many of these will appear merged and some might not even exist for a given language design.


Input
Phase
Output
Notes


Module name
Module path lookup
Source code text
Depending on language design, what a "module name" looks like and when this is called may be different.


Source code text
Lexing
Tokens
Usually incremental and seen as merged with parser. But it is a bad idea to have feedback from the parser; don't repeat the mistakes of C!


Tokens
Parsing
CST or AST
CST is required for pretty-printing, but producing an AST directly means less work for normal compiler stuff, so consider having a flag determine which one gets built; the difference is useless parentheses and comments. Constraining your grammar to only allow comments in particular places (e.g. statement context, struct member context) is a good idea. Pull-parsing is usually the default, but push-parsing may be useful for REPLs so you know when input is complete.


AST
Early desugaring
AST
Good macro systems belong here rather than during parsing. If you don't have macros this might not exist at all, instead being merged with the following.


AST
Typechecking
(typed) AST
Static typing is always desirable. Also, deterministic explicit destructor calls must be inserted before the next phase.


AST
AST lowering
flattened "IR"
May be block-based or label-based (the latter makes optimization harder). SSA is popular if you're doing serious optimizations but leaving variables mutable is feasible too. Note that some runtimes like WASM really don't like this, but the advantages are such that it's worth the horrible hacks to get an tree back.


IR
Most Optimizations
IR
This is the only phase where the valid input and outputs are exactly the same. Note that optimizations may introduce new blocks or make other blocks unreachable; you need to have a strategy to reclaim them.


IR
IR lowering
machine-specific IR
Usually only exists if generating machine-specific code, and even perhaps only if you support multiple targets.


machine-specific IR
Machine-specific Optimizations
machine-specific IR
Not very many optimizations go here, but they can be important. Despite the name, some optimizations may be shared between some machines, but they might differ in exactly how or how often they are triggered (think about using x86 lea for multiplication by a small constant that can't be implemented using pure shifts).


machine-specific IR
codegen
machine code or bytecode
May be textual, a binary stream, or an in-memory representation (this last is common for language VMs). Bytecode is just machine code for a virtual machine that does not correspond to real hardware.


machine code
peephole optimizations
machine code
Very few optimizations are possible here since you can no longer reorder or grow code. You might be able to shrink code still. May be integrated into the codegen phase.


textual machine code
assembling
binary machine code
For real machine code, actually done by as, but you should probably invoke it via the gcc driver if you're doing this (though this isn't nearly as critical as it is for ld)


multiple files with binary machine code
linking or archiving
executable or library
Usually only called "linking" (done by ld actually, but since there is a lot of nastiness you really should invoke it via gcc) if outputting an executable or shared library; "archiving" (done by ar) if a static library (but since this is trivial it might not be named at all). Remember that some form of this does still apply to bytecode! (though it often lacks the niceties that machine code supports). If possible, try to hook up function calls and global object references so they don't need to be handled during loading; this is obviously not possible of the target is in a different shared object or the symbol is interposable.


executable or library file
loading
executable code in memory
try to hook up all remaining function calls and global references here, even if (potentially) across shared objects. This specific act is called "relocation" (whether in this phase or the previous), and might require patching machine code or just changing some data tables


executable code in memory
execution
(side-effects)
Do as little work as possible in this phase. This obviously is only possible if you have a proper type system so all the previous phases can exist.


URL: https://www.reddit.com/r/Compilers/comments/wok89g/resources_to_understand_code_generation_from_ast/ikckom3/
Note that there are several formats potentially involved here:


the AST, which is a tree


the IR, typically in SSA form, which is a graph. Not strictly required, but practically required if you are doing any optimizations at all.


the target-specific low-level IR, which supports nitty-gritty details like registers being assigned for every variable (which probably, unlike SSA, are mutable again) and is likely conceptually similar to the output but still using pointers to classes, rather a flat array of bytes/text. This is used for a few optimizations; particularly, it is not limited to peephole optimizations. This probably doesn't need to exist if the final output is bytecode for an interpreter.


the final output code, which is mostly a flat array, either interpreter bytecode or real machine code (which can be either binary object files or textual assembly; it really doesn't matter). Peephole optimizations can be done here if not done in the third format. I strongly recommend implementing a fully-featured bytecode interpreter before trying to emit code for a real machine, since controlling the code completely makes things much simpler. Then you can add complexity comparable to real machine code incrementally.
For example, if you already have a bytecode interpreter that operates on many small fragments of bytecode, you could modify it so that all code is position-independent, and it is possible to combine all those fragments into a single large array of bytecode. You could implement relocations so that when you load the bytecode from a file, you know which offsets need fixing up for simpler execution.
(I have a well-received post about bytecode VMs, though it mostly only mentions real machines as silly things not to be learned from)


Note also that emitting debuggable machine code is much harder than simply emitting code.

URL: https://www.reddit.com/r/Compilers/comments/10vy3bm/how_exception_handling_in_implemented_in/j7kuile/
Generally there are 4 approaches:

SJLJ-based. Use the sigsetjmp and siglongjmp special library functions to restore registers at a low level. This is slow even when the exception doesn't happen; mostly it was used because the next option wasn't available for some platforms yet.
Use metadata to unwind the stack. This may involve native code or special bytecode. This requires special runtime support to indicate the frame layout and track all the destructors. This is often believed to be most efficient if exceptions are rare and does not constrain ABI (at least in terms of signature - the metadata arguably counts as ABI), but note that it still inhibits optimization, perhaps significantly. Even if optimizations are possible, this requires implementing them separately from the usual value-based optimizations that the next option supports.
Use a sentinel value and check for it. Sometimes this is just NULL or -1 (which requires an awkward external location to store the error data); other times there is an actual Result class. This is best when the "exception" condition is likely, but is not bad even if it's rare, assuming you can afford to change the ABI. This is probably easiest to reason about, both for humans and compilers, but constrains the ABI slightly.
Continuation-based. Mostly a historical artifact of the lisp family, which does not correspond sanely to how hardware actually works.

(note: this led to a mostly-unproductive flamewar with adherents of lisp continuations and green threads. The only useful thing was "CF beats ZF, at least on x86")

URL: https://www.reddit.com/r/Compilers/comments/ynf2eu/what_is_the_best_way_to_handle_multiple_source/ivbtcyu/
This is closely related to the question of parallel builds. There is a large, multi-dimensional matrix of things involved. Listing each dimension separately:

module, library, or executable
source, interface, or compiled
text form or binary form

That would be 18 things, except that not every combination actually exists (or exists distinctly for a given language). For example, source usually only applies to modules and is in text form. Module interface might be the same as library interface. Executables are probably just a special case of libraries.

The below workflow is slightly opinionated, but there are good reasons for the parts that do not exactly correspond to the way existing languages do it.
You begin with a module source in text form (like a .c file).
The first thing you should do is create a module interface (like a .h file). You might do this in binary form if you don't intend to expose these, and/or might also create a binary form of the module AST if you want (this may help for cross-module inlining). The important thing is that, for this step, you do not have access to the interfaces of any other modules.
Next, once all the interfaces of a module's dependencies are ready, take the module + all the interfaces, and output a compiled version of the module (.s or .o file).
Repeat this for all the modules in a library, then link the compiled modules into a compiled library (.a or .so file, but in the former case beware of). At the same time, also extract the public parts of the library (probably by looking at the module interface files and filtering based on explicitly exported symbols) and produce a library interface. A library interface will be used instead of a module interface when a module imports something that is not parts of its own library. Note that technically, building a library does not actually require its library dependencies to have been built already, but most tooling assumes it has.
The build process ends when all needed libraries (including executables) have been built.
Note also that your testsuite forms a special kind of executable (which again, is a library). This implies that you may want to transparently compile/link things in multiple ways (PIC/PIE/PDC, debug vs release modes, etc.). When possible, make sure that these modes are all ABI-compatible by default if possible (if you're used to Windows, note that it has some very bad examples); if there are ABI incompatibilities, you need to correctly handle that in your package manager (which the compiler then must sufficiently know about).
All of the steps above should be skipped if their inputs haven't changed, additionally, if the output happens to not have changed, you should carefully make sure that you both won't redo the work again next time and you won't redo further work. This requires keeping 2 timestamps for every file; this requires a simple dance (involving conditionally updating a file other than the one specified by the current rule) if you're using a Makefile. It is also a reminder to never embed "current timestamp" in the build (you could embed "commit timestamp" or "file timestamp" or "explicit source epoch" if you're careful).

URL: https://www.reddit.com/r/ProgrammingLanguages/comments/yzrhir/what_does_oo_do_well/ix29nb0/
Ignoring the exact definition of OO for a moment, there are two things that it tends to do that are strictly and unarguably superior to common alternatives:

at least try to avoid global state, and if you must have it, limit it to a single variable pointing to some object or so.

note that "dynamic variables" may be a nicer way of thinking of global (or thread-local, if threads exist) variables. In particular, the save/restore logic is important.


don't just shove everything in a bunch of dicts and lists or whatever they're called in your language

related, if you find yourself distinguishing between "shallow copy" and "deep copy", you have already done something horribly wrong. Every class should specify the only kind of "copy" that makes sense for it (and be in charge of sharing or copying all its children on an individual basis)


But it makes it much easier to manage, which is actually the important thing.
Related, note that languages that go heavy on "immutable objects" usually have very poor lifetime/copy management.

URL: https://www.reddit.com/r/ProgrammingLanguages/comments/jh46wu/how_could_someone_implement_memory_management/g9yexg5/
This entire answer is written under the assumption you mean Java-style bytecode.
C# and .NET are much nicer if you need value classes but I'm not familiar with them since they used to be Windows-only.
Native support for value classes makes some of this easier, but you still have to do them.

If you merely need "I want to something address-like that allows .put() and .get()", that's as simple as using wrapper classes everywhere you might need to take the address (should only be locals).
Note that you don't want to use the wrapper within other objects; instead generate an ad-hoc wrapper classes for "bar field of object Foo" as needed.
You do not want to do this via reflection; it is even more inefficient.

If you need full C-style pointers, there's one simple rule: every object is a java.nio.Buffer (and yes, the interface is weird, mixing absolute and relative operations in a single class. Seriously, did you learn nothing from iterators?). You can still emit member functions (I think you have to use a wrapper class since they aren't easily subclassable), but cannot directly use member fields.
This tends to generate nonsensical Java code (both in terms of interface, and in terms of optimization), so I strongly suggest providing a method of disabling that for certain classes.
Note that C forbids doing pointer comparisons/subtraction outside of a single allocation.

URL: https://www.reddit.com/r/ProgrammingLanguages/comments/zp0waz/how_to_compiling_and_link_my_standard_library/j0s4gz1/
Simply run 2 passes over each file. The first pass does not require its dependencies to be available, and generates a header so that others can use this file once it's compile. The second pass is the actual compilation, and requires the headers for all its dependencies.
Note that this approach does mean you can't do type inference between modules. But that's a good limitation to have.

https://www.reddit.com/r/C_Programming/comments/nzi7cj/what_would_be_a_good_minimal_subset_of_c/h1qwygu/
URL: One thing to consider is variadic functions. A language that does not support variadic functions will lead to great user frustration.
Note that your high-level language's conception of variadic functions need not be the same as  the low-level language's implementation.

C-style. Variadic arguments are passed mostly like regular arguments. (but e.g. beware the eax thing)

Pros: compatibility with random assembly tricks from the 1970s
Cons: random access is hard; very finicky; hard to implement if your low-level language doesn't already support it


C++-iostream-style. Variadic arguments are passed through separate function calls.

Pros: easy type-safety without complicating your compiler
Cons: hard to deal with more than one argument at once


C++11-style. Variadic arguments create multiple, specialized, copies of the code.

Pros: templates
Cons: templates


Java-style. Variadic arguments passed as an array.

Pros: easy to wrangle
Cons: not always quite what you want; arguments must share a common base class (Java always has Object and autoboxing of primitives; many functions might use an interface); how to forward an argument pack vs pass it as part of a different argument pack.


URL: https://www.reddit.com/r/ProgrammingLanguages/comments/mn4ive/can_a_language_ever_be_faster_than_its_parent/gtycfr4/
If you wrote a frontend that performs inlining then generated Python bytecode, it would be faster than the bytecode generated for normal Python.
Python (as a language) is "too dynamic" for a lot of important optimizations to be possible; by putting minor constraints on your input language, you can get significantly better performance on the same runtime.

URL: https://www.reddit.com/r/ProgrammingLanguages/comments/vp8ql4/module_member_vs_field_access_syntax/ieii7e3/
It should be noted that the reason C++ has to use :: is because . would be ambiguous when calling a function from a base class.
namespace foo
{
    struct Foo
    {
        void frob() {}
    };
}

namespace bar
{
    struct Bar : ::foo::Foo
    {
        void frob() {}
    };
}

namespace g
{
    ::bar::Bar obj;
}

int main()
{
    g::obj.::foo::Foo::frob();
}

While namespaces do matter here, the root of the problem is actually inheritance (and specifically, multiple inheritance, since otherwise you could just write obj.super.super.super.super if you really need to).
An alternate solution would be to support UFCS, and call foo::Foo::frob(&g::obj);. But note that Python moved away from this in favor of super (but that decision is complicated by the way they do multiple inheritance).

URL: https://www.reddit.com/r/ProgrammingLanguages/comments/xql23p/how_can_a_programming_language_interact_with_the/iqalgj4/?context=10000
Even on Linux, it's not a good idea to use syscalls directly.
Many of the syscalls have subtle quirks, and libc takes care of hiding them from you.
Take a look at the setuid mess, for example. If you don't want to reimplement all that (and disable your own version if some other library pulls in libc) ... just link to libc and let it take care of it.
(others like getcwd are easy enough if you know about them, but good performance takes work. But don't assume they're all that easy.)

URL: https://www.reddit.com/r/Compilers/comments/r8rkzd/what_would_you_use_to_write_a_parser_in_2021/hn7uu5s/
I would use Bison, but use the XML output mode and write my own runtime (very easy, and I suggest everyone does this at least once, even if they aren't always going to use it); you can also transform your own preferred BNF-like syntax into Bison if you want. There are other tools out there that are reasonable to use (but not all tools - check the paragraphs below!), but this particular approach has the advantage of working regardless of what language you are using.
LALR(1) is probably sufficient to write your grammar without any horrible hacks. LR(1) is theoretically more powerful but I haven't found a use case; note that IELR(1) is just a particular algorithm that implements LR(1), not a separate thing. LL(1) means you are obligated to use hacks for something as simple as expressions (not completely unreasonable, but why put up with it when you don't have to? Note that most hand-written recursive-descent parsers implement LL(1) - either because of naivety or because their language are badly designed and cannot be parsed without additional hacks). Values of k greater than 1 provide no additional benefit (but significant increased cost) for the LR family, and do not provide significant benefit for the LL family.
LR(0), SLR(1), and SLL(1) are too trivial to do anything useful. I would suggest that you might use them if you do have a trivial use case, except that there aren't many tools that support them (though note that we do speak of LR(0) as part of the implementation of higher-level members of the LR family).
At all costs, avoid parsers that do not detect edge conditions, or do not detect them quickly. This includes any tool that uses GLR, LL(*), packrat, PEG, parser combinators, memoization ... If you use them, your tool WILL hide bugs from you. (note that Bison can use GLR, but only if you explicitly request it, and I have never seen it done. Thus, Bison remains a sane tool.)

URL: https://www.reddit.com/r/Compilers/comments/r8rkzd/what_would_you_use_to_write_a_parser_in_2021/hn7xm9s/?context=10000
It's worth noting that all of Lemon's complaints about Yacc only apply to historical Yacc, which Bison emulates by default.
Bison can be configured to avoid every single one of those problems. I use reentrant versions (both with and without push parsing) in my demo project. I admit I didn't bother with named references or cleaning up after errors.
(none of this correction actually makes Lemon bad though)

URL: https://www.reddit.com/r/ProgrammingLanguages/comments/sozmap/has_anyone_done_a_naive_benchmark_of_parsing/hwfb8v4/
If you're doing serious optimization, then parsing isn't the bottleneck.
But if you're doing the equivalent of -O0 or even potentially -O1 - especially if you're using C-like headers where there are a lot of declarations that you turn out not to care about - then parsing very much is a high percentage of input.
(note that -O1 is often faster than -O0 on large inputs due to doing less I/O)

URL: https://www.reddit.com/r/ProgrammingLanguages/comments/sozmap/has_anyone_done_a_naive_benchmark_of_parsing/hwe5ohj/
This is really going to depend on the exact details of the language and input.
First, realize that "hand-made recursive descent" is not really a parsing algorithm. Usually it's an implementation of LL(1), LL(k), or LL(*), sometimes with particular behavior on conflict.
Second, look at your lexer. Most reasonable lexing algorithms are O(n), but if you are careless it is very easy to hit O(m*n) by rescanning the input, where n is the length of the input in "characters" and m is the length of the longest token. (if you are using flex for your lexer, there are 2 main cases where this happens: if you use / lookahead, and if your rule can match \0 - not that ignoring the "match as much input as possible" rule-of-thumb can actually help if you hit this; if you are not using flex, there are many more ways to fail (particularly common for string escapes)).
Finally, look at the parser itself. Here, n is the length of the input in tokens, k is a small (but not ignorable) constant, and r is the number of rules.

LL(1) and LR(1) always complete in O(n) time and space
LL(k) and LR(k) always complete in O(k*n) time (but O(n) space)
LL(*) has a worst case of O(n²) time (but O(n) space)
PEG packrat parsing has a worst case of O(r*n) time, and also O(r*n) space. They also do not detect errors in the grammar.
Parser combinators typically require O(r^n) time I think. Certainly it's exponential, unless you are able to make a packrat-style tradeoff. Like PEG, they do not detect errors in the grammar.

theoretically you could write something that looks like a parser combinator that just sits in front of LR(1) and then gets the advantages of both. For some reason, I have never seen anybody advertise this as "parser combinator" - probably to avoid the stigma.
I have encountered parser combinators that don't even manage to compile.


Note that the worst case is generally not avoidable, since for most grammars it is possible to construct an input that hits it. Though for the ones that involve r it should actually be defined as "number of rules at the relevant point in the grammar" - but that is hard to define if you aren't using the LL/LR family.

How do you think a parser combinator works?
If every nonterminal is parsed using the logic "first, try alternate1, then alternate2, then alternate3, etc. Stop only when success happens" that's exponential time, where the base is the number of rules tried (which is usually roughly proportional to the number of rules, since each alternate then uses the same logic), and the exponent is the number of symbols this whole thing has to be tried for (which is a significant fraction of the total number of symbols in the input).
A few parser combinators will try memoization (but this is far from ubiquitous), but that's still rectangular time, which still sucks. And you still have no detection of grammar errors.

URL: https://www.reddit.com/r/ProgrammingLanguages/comments/sozmap/has_anyone_done_a_naive_benchmark_of_parsing/hwk5pp8/
Bison generates ugly code. But you don't have to use it: the hard part is generating the tables, and bison --xml will dump those so you can use your own runtime.

URL: https://www.reddit.com/r/ProgrammingLanguages/comments/t4c8ms/what_parsing_techniques_do_you_use_to_support_a/hyzmyv3/?context=10000
I agree with this post, and have a few things to add:

Make sure your lexer is strictly regular and your parser is strictly LR(1). No exceptions, no conflicts. (if available, you may use precedence rules to reduce the total number of rules)

you may be interested in using bison --xml to generate the machine, since it is then very easy to write your own runtime. Most other parsers are significantly lacking in features.
there are interesting things you can do with "first parse a token-tree of matching parentheses, then parse that" but there is a lack of tooling to support this.


Make sure there are no tokens that can can cross a newline (possible exception for backslash-newline, since you can peek backwards). This allows you to start parsing from anywhere within a file.


My idea for multiline strings is to repeat the start character every line. Besides making parsing easier, this makes it sensible to use meaningful indentation inside the string.
  `line1:
  `    indented line
  "no trailing newline if you end with a normal string"


note that if you have Python-style "indent/dedent/newline ignored with parens" you will get errors if you try to start parsing within parens, but you will be able to recover as soon as you hit the last close paren (though you generally won't know that you've hit it).

You can make recovery better by requiring that lines inside parentheses still have at least as much indentation as the part of the "line" outside the parentheses (this can be enforced as part of the same between-lexing-and-parsing stage that allows you to handle indentation at all. You need a similar stage to handle keywords, since they should be lexed as mere identifiers for performance)
alternatively, you might be able to quickly look for keywords that aren't possible inside expressions


Strongly consider making comments part of the grammar.

This means they cannot occur everywhere, but only where a statement or top-level item can appear.
You could add additional places, but that means you end up with weird rules for where they are allowed vs forbidden.


Be aware that it is entirely possible (and reasonable) to generate more than one parser entry point. Write your grammar once, then feed it to your parser-generator multiple times with a different start symbol (you may need to suppress warnings about unreachable rules, or you could tree-walk it yourself to avoid the warning in the first place). Typical start symbols might include "translation-unit", "statement", "expression", and variants thereof.


FYI, triple backticks only work on new-and-broken reddit, not old-but-usable reddit.

What is your motivation for this? I'm not opposed to making comments part of the grammar, but I think I would hate to use a language with this restriction. It's common in many languages to add comments in e.g. initializers

Good catch. I had been thinking "we don't need commas in expression since those should be limited to a single line", but initializers are an important exception..
You could do something like:
initializer-list: '{' initializers? initializer-expr? '}'
initializers: initializers? initializer
initializer: initializer-expr ',' | COMMENT
initializer-expr: initializer-list | expr

Or even:
initializer-list: '{' comments? initializers? initializer-expr? '}'
initializers: initializers? initializer-expr comma
initializer-expr: initializer-list | expr
comma: ',' comments?

Or perhaps:
initializer-list: '{' initializer? documented-initializer-expr? comments? '}'
initializers: initializers? documented-initializer-expr comma
documented-initializer-expr: comments? initializer-expr
initializer-expr: initializer-list | expr

Notes:

which of the above is best depends on what you think comments are doing and how you expect code to consume the initializer-list.
the above grammars have not been tested, but I know the ideas is valid
the initializer-expr? is to make trailing commas optional
I am very much opposed to end-of-line comments; comments should always precede the item they are documenting

"After a comma or leading brace" is indeed a simple enough rule to remember. But some of them do leave an edge case still: if the last expr is not followed by a comma, you cannot put a comment before the closing brace.
The reason I am willing to go through these contortions is that it is a HUGE advantage if comments can be part of your AST. Not just for extracting annotations from the comment, but also for the purpose of refactoring.

URL: https://www.reddit.com/r/Compilers/comments/13k3xxg/good_rust_parser/jkkb9vg/
One possibility would be to switch to a token-tree-first method of parsing, like how Rust macros work. You still get fatal errors if the parens don't match, but otherwise you just replace the particular subtree with an Error AST and continue with the next subtree.

URL: https://www.reddit.com/r/ProgrammingLanguages/comments/si0fvd/how_to_parse_indentationbased_syntax_with_parser/hv6reh6/
Indentation gets turned into ordinary tokens (often called INDENT and DEDENT) during the lexing phase, so it doesn't matter what kind of parser you are using. If you want indentation to be disabled inside of parentheses (etc.) this will require the use of "lexer modes" / "start conditions", but this is not difficult either for generated lexers or hand-written ones.
That said, parser combinators are a bad idea for parsing in general: you get neither speed nor correctness. I'm not sure why you think they are "powerful", I would simply call them "sugary".

URL: https://www.reddit.com/r/Python/comments/y5bt6k/understanding_hashing_and_equality_in_python_with/isk56sd/
In a proper language, the following constraints hold:

the state used for __hash__ must be a subset of the state used for __eq__. This is useful when part is not hashable or is expensive to compute.
the state used for __lt__ and friends must be a subset of the state used for __eq__. This is useful when unordered states exist (e.g. floats, though they are evilly so; it can also be come up with case-insensitive collation but stability often implies a last-resort comparison in that case. More generally it simply means "there is more state").
the state used for __repr__ has no constraints with any of the states used for other functions. (mentioned here because the implementation often looks similar)
edit: note also that there are also no constraints between the subset of state used for __hash__ and the subset of state used for __lt__.

Unfortunately, Python does not fully support the second; it requires that the state used for __lt__ be identical to the state used for __eq__. This can easily be demonstrated by creating a pair of tuples with identical first elements, and observing that tuple.__lt__ internally calls the element's __lt__ and __eq__, rather than __lt__ and __gt__ (or equivalently, __lt__ reversed).
functools.total_ordering is also broken but since that's a library it's easier to fix/avoid.

Suppose we have a type that's basically an integer with extra data. For simplicity let's represent the extra data as a single letter, so the whole object looks like 1A. We would like to define < (and hash) considering only the integer part, but == including the extra data. (note also that the "extra data" is allowed to be mutable, since it is not used for comparison/hash)
There are 3 conceptually primitive comparison operations, returning booleans: <, >, and ==. There are also (at least) 3 conceptually composite comparison operations returning booleans. != is always the opposite of == (and need not be implemented separately). <= and >= are defined as < or == and > or == respectively, but be aware that there are many cases where you actually want not > and not < respectively instead, and additionally their implementation is faster if they are defined directly.
We can describe the above using 4 (not 3) logical states of comparison between two objects:

LESS implies that < returns True, and > and== return False.
GREATER implies that > returns True, and < and == return False.
EQUAL implies that == returns True, and < and > return False
UNORDERED implies that all 3 primitive comparisons return False.

note that float('nan') is evil because it is unordered against all floats (including itself, though python's is or == rule mitigates that part somewhat), leading to inconsistency. By contrast, it is perfectly sensible for an object to be unordered against a closed set of similar objects, as long as it is ordered against everything else.


note that it is inconsistent for more than one of the 3 primitives to return True at the same time.

Then we would have:

2A cmp 3B is LESS
2A cmp 2A is EQUAL
2A cmp 2B is UNORDERED
2A cmp 1B is GREATER

And we should have:

(2A, ANY) cmp (3B, ANY) is LESS
(2A, 2X) cmp (2A, 3Y) is LESS
(2A, 2X) cmp (2A, 2X) is EQUAL
(2A, 2X) cmp (2A, 2Y) is UNORDERED
(2A, 2X) cmp (2A, 1Y) is GREATER
(2A, 2X) cmp (2B, 3Y) is LESS (but python returns UNORDERED)
(2A, 2X) cmp (2B, 2X) is UNORDERED
(2A, 2X) cmp (2B, 2Y) is UNORDERED
(2A, 2X) cmp (2B, 1Y) is GREATER (but python returns UNORDERED)
(2A, ANY) cmp (1B, ANY) is GREATER

As you can see, by using the elements' definition of ==, Python's tuple < and tuple > are broken, because it short-circuits when a leading component is UNORDERED.
Maybe I should've done a case-insensitive-comparison example to make it obvious why it's wrong to do this. That is, the above bugs mean that Python will fail to sort data like:
foo BAZ
FOO bar


import sys


def get_order(a, b):
    less = a < b
    greater = a > b
    equal = a == b
    if any([not isinstance(t, bool) for t in (less, greater, equal)]):
        raise TypeError('this is not for vector comparisons')
    if less + greater + equal > 1:
        raise TypeError(f'inconsistent ordering for: {a} cmp {b}')
    if less: return 'LESS'
    if greater: return 'GREATER'
    if equal: return 'EQUAL'
    return 'UNORDERED'

class PartialState:
    def __init__(self, all_state, eq_state):
        self.all_state = all_state
        self.eq_state = eq_state
    def __repr__(self):
        return f'{self.all_state}{self.eq_state}'
    def __eq__(self, other):
        if not isinstance(other, PartialState): return NotImplemented
        return self.all_state == other.all_state and self.eq_state == other.eq_state
    def __lt__(self, other):
        if not isinstance(other, PartialState): return NotImplemented
        return self.all_state < other.all_state
    def __le__(self, other):
        raise TypeError('not using this')
    def __gt__(self, other):
        if not isinstance(other, PartialState): return NotImplemented
        return self.all_state > other.all_state
    def __ge__(self, other):
        raise TypeError('not using this')

class FailState:
    def __repr__(self):
        return 'ANY'
    def __eq__(self, other):
        raise TypeError('not using this')
    def __lt__(self, other):
        raise TypeError('not using this')
    def __le__(self, other):
        raise TypeError('not using this')
    def __gt__(self, other):
        raise TypeError('not using this')
    def __ge__(self, other):
        raise TypeError('not using this')

_1B = PartialState(1, 'B')
_1Y = PartialState(1, 'Y')
_2A = PartialState(2, 'A')
_2B = PartialState(2, 'B')
_2X = PartialState(2, 'X')
_2Y = PartialState(2, 'Y')
_3B = PartialState(3, 'B')
_3Y = PartialState(3, 'Y')
ANY = FailState()

okay = True

def check_cmp(a, b, expected):
    global okay
    actual = get_order(a, b)
    if actual != expected:
        okay = False
        print(f'{a} cmp {b} should be {expected} but is actually {actual}')
    else:
        print(f'{a} cmp {b} is {actual}')

check_cmp(_2A, _3B, 'LESS')
check_cmp(_2A, _2A, 'EQUAL')
check_cmp(_2A, _2B, 'UNORDERED')
check_cmp(_2A, _1B, 'GREATER')
check_cmp((_2A, ANY), (_3B, ANY), 'LESS')
check_cmp((_2A, _2X), (_2A, _3Y), 'LESS')
check_cmp((_2A, _2X), (_2A, _2X), 'EQUAL')
check_cmp((_2A, _2X), (_2A, _2Y), 'UNORDERED')
check_cmp((_2A, _2X), (_2A, _1Y), 'GREATER')
check_cmp((_2A, _2X), (_2B, _3Y), 'LESS')
check_cmp((_2A, _2X), (_2B, _2X), 'UNORDERED')
check_cmp((_2A, _2X), (_2B, _2Y), 'UNORDERED')
check_cmp((_2A, _2X), (_2B, _1Y), 'GREATER')
check_cmp((_2A, ANY), (_1B, ANY), 'GREATER')

sys.exit(not okay)
Input	Phase	Output	Notes
Module name	Module path lookup	Source code text	Depending on language design, what a "module name" looks like and when this is called may be different.
Source code text	Lexing	Tokens	Usually incremental and seen as merged with parser. But it is a bad idea to have feedback from the parser; don't repeat the mistakes of C!
Tokens	Parsing	CST or AST	CST is required for pretty-printing, but producing an AST directly means less work for normal compiler stuff, so consider having a flag determine which one gets built; the difference is useless parentheses and comments. Constraining your grammar to only allow comments in particular places (e.g. statement context, struct member context) is a good idea. Pull-parsing is usually the default, but push-parsing may be useful for REPLs so you know when input is complete.
AST	Early desugaring	AST	Good macro systems belong here rather than during parsing. If you don't have macros this might not exist at all, instead being merged with the following.
AST	Typechecking	(typed) AST	Static typing is always desirable. Also, deterministic explicit destructor calls must be inserted before the next phase.
AST	AST lowering	flattened "IR"	May be block-based or label-based (the latter makes optimization harder). SSA is popular if you're doing serious optimizations but leaving variables mutable is feasible too. Note that some runtimes like WASM really don't like this, but the advantages are such that it's worth the horrible hacks to get an tree back.
IR	Most Optimizations	IR	This is the only phase where the valid input and outputs are exactly the same. Note that optimizations may introduce new blocks or make other blocks unreachable; you need to have a strategy to reclaim them.
IR	IR lowering	machine-specific IR	Usually only exists if generating machine-specific code, and even perhaps only if you support multiple targets.
machine-specific IR	Machine-specific Optimizations	machine-specific IR	Not very many optimizations go here, but they can be important. Despite the name, some optimizations may be shared between some machines, but they might differ in exactly how or how often they are triggered (think about using x86 `lea` for multiplication by a small constant that can't be implemented using pure shifts).
machine-specific IR	codegen	machine code or bytecode	May be textual, a binary stream, or an in-memory representation (this last is common for language VMs). Bytecode is just machine code for a virtual machine that does not correspond to real hardware.
machine code	peephole optimizations	machine code	Very few optimizations are possible here since you can no longer reorder or grow code. You might be able to shrink code still. May be integrated into the codegen phase.
textual machine code	assembling	binary machine code	For real machine code, actually done by `as`, but you should probably invoke it via the `gcc` driver if you're doing this (though this isn't nearly as critical as it is for `ld`)
multiple files with binary machine code	linking or archiving	executable or library	Usually only called "linking" (done by `ld` actually, but since there is a lot of nastiness you really should invoke it via `gcc`) if outputting an executable or shared library; "archiving" (done by `ar`) if a static library (but since this is trivial it might not be named at all). Remember that some form of this does still apply to bytecode! (though it often lacks the niceties that machine code supports). If possible, try to hook up function calls and global object references so they don't need to be handled during loading; this is obviously not possible of the target is in a different shared object or the symbol is interposable.
executable or library file	loading	executable code in memory	try to hook up all remaining function calls and global references here, even if (potentially) across shared objects. This specific act is called "relocation" (whether in this phase or the previous), and might require patching machine code or just changing some data tables
executable code in memory	execution	(side-effects)	Do as little work as possible in this phase. This obviously is only possible if you have a proper type system so all the previous phases can exist.