Skip to content

Instantly share code, notes, and snippets.

@yupferris
Last active October 6, 2022 07:52
Show Gist options
  • Save yupferris/b78872fa2f5e7aa9ff899721209eb70d to your computer and use it in GitHub Desktop.
Save yupferris/b78872fa2f5e7aa9ff899721209eb70d to your computer and use it in GitHub Desktop.
16 years later, I finally have an educated guess as to why Overhead was so slow on some machines.

In 2006, a (then-14-year-old) me wrote Overhead, a 256-byte intro (which was, ehm, "hEavilY InspiREd" by lander, among other intros). It was, for me, an early and highly rewarding experiment in x86 assembly, and huge thanks are owed to Baze/3SC (author of lander) not just for inspiration, but also including source with their intros (a common practice at the time, less common now unfortunately) and helpful responses to my emails asking how some basic things worked (eg. VGA setup and FPU stack).

In the pouët comments, many people reported that it ran quite a bit slower than intended. It was never blazingly fast (I recall 20-30FPS on my box at the time), but some people were reporting eg. 1-2FPS, which was pretty awful.

In a 2007 comment by Pirx, it was mentioned that writing memory at the beginning of the program segment was likely what was causing it to be so slow on many machines, and a fix was suggested. At the time, I had no idea what this meant, just that the suggested fix didn't make things slower on my box, and that it seemed like a good idea to include it in future productions (which I appear to have done in Fabricate a few years later, but not DITS, which was released in 2013 but written in 2006, before I read this comment).

Now, years later, I've been looking into some basic DOS programming (for nostalgia) and I finally have enough context to be able to understand what was happening here.

For x86 FPU code (x87), instructions to load (store) from (to) integer registers directly don't exist, so moves between these two register sets must happen via memory. From other people's intro sources, I saw that it was common to use SI (which actually refers to DS:SI due to segmented memory, a concept I was aware of at the time but wouldn't understand the details of for some time) to refer to a location allocated for temporaries for this purpose. A simple example of this is in OVERHEAD.ASM, which loads ax onto the top of the FPU stack via a memory location referred to by DS:SI:

43:  mov [si],ax
44:  fild word [si]

At the time, I had assumed on program entry that DS:SI referred to "somewhere safe" just like DS:SP, so Overhead and DITS never actually initialize SI. This worked on my machine ™️, which was good enough for me then, and I didn't think anything of it. But where does DS:SI actually refer to?

COM files are loaded into a(n arbitrary) free segment in conventional memory. The PSP is set up at offset 0x0000 in that segment, and the COM file contents are loaded at offset 0x0100. CS, DS, SS, and ES are all set to this segment ("tiny model"), and execution starts at the base of the loaded COM file data (again, 0x0100).

According to the sizecoding wiki, most commonly SI is set to 0x0100 on startup (more complete list for different environments here).

Thus, on entry, ES:SI could potentially refer to any offset in DS, but is highly likely to refer to offset 0x0100 (and certainly did wherever I ran it at the time), which is the entry point of the program. This means that several temporary locations read from and written to by Overhead during its FPU calculations are actually in the program data itself, and this is what Pirx meant all those years ago!

OK, so we know DS:SI initially points to program memory, and assuming we don't overwrite anything valuable (looks like I got lucky here!), this isn't problematic in terms of correctness. But why does it affect performance? The key is something a bit more subtle: x86 guarantees instruction/data cache coherency! This means that writes to locations near CS:IP can potentially invalidate instructions in the cache (and/or stall while the memory subsystem figures out whether or not it's actually needed to invalidate a cache line), which is likely the culprit of the performance degredation on some machines.

Pirx's recommendation was to set SI to eg. 0xfa00, which would have pointed to a location near the end of the program segment but with plenty of room left for the stack. Indeed, this should have fixed the issue (and would have been much safer in general!).

Revisiting lander's source, SI is actually set to something sensible (that is, well beyond the COM data; precisely, it (re)uses the value 0x03c8 as the offset, which is the address of the VGA write mode address port). I haven't taken another look at other intros from that period, but I'm sure it was generally done correctly, and I just hadn't noticed at the time.

In fact, thinking back now, I do seem to have vague memories (which are quite foggy, so many not be real) of some test programs I wrote breaking if they were too small, which is really funny and totally understandable in hindsight.

One last thing: It's still not actually clear to me how close to currently-executing code is too close for writes to affect performance like this; I expect it's hardware-dependent and probably difficult to dig up. Certainly one must be able to write within the same segment, as otherwise all tiny model (COM) programs would have abysmal performance. Perhaps it's within the same 256 or 16 bytes? If you happen to have more information on this, let me know, but for now I'm comfortable enough with just trying to separate code/data by 256*n for some (small) integer n.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment