Skip to content

Instantly share code, notes, and snippets.

@ISSOtm
Last active October 31, 2018 19:47
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save ISSOtm/3cd093317fe2a331f442a0ad76d62cce to your computer and use it in GitHub Desktop.
Save ISSOtm/3cd093317fe2a331f442a0ad76d62cce to your computer and use it in GitHub Desktop.
Small writeup about the two most difficult bugs I've ever had to fix.

My Two Bugs

Every programmer has that one bug. That bug that couldn't go away, no matter what you try; that bug whose reason was so obscure it took so long to fix. Even though I am a very young coder, while programming for the Game Boy, I've encountered two of these bugs.

Coincidentally, both of them were caused by the same thing, even though they were completely unrelated. Here are their story.

Overshooting It

I originally started developing Aevilia as a simple RPG for the Game Boy Color, taking inspiration from the Pokémon games' code. That is, use one loop per thing you need to do, and jump between them. This eventually proved to be a disaster, because a lot of code was duplicated, and transitions between loops didn't always go so well.

But before the codebase ended up on the chopping block (in favor of a yet to be published rewrite), I tried to think of a way to do two things at once. See, there was a function that performed a screen fade-out, which was used between maps. A problem arose when we wanted to make the player keep walking during the fade-out: since the fade-out function was its own loop, there was no way to make the player walk.

So I designed a solution: use the Game Boy's interrupt system to provide a sort of "second thread". Basically, I set up interrupts in such a way that at the very beginning of a frame, a function (if you want to know, it was passed through a function pointer in memory) would be ran. That function could, for example, move the player's sprite. That would do the trick! And it did.

So, we had the player walking during fade-out; but not during fade-in! So, I wrote similar code that moved the player towards their intended position during fade-in. It appeared to work right, until some time later where I noticed that the player would overshoot some specific target destinations by one pixel. It only happened with one of the game's loading zones, even though the code was exactly the same for all loading zones. So, what?

Even after a tentative patch which, spoiler, didn't work, I still had no idea what got the bug to trigger, especially since it only triggered every so often. Three to four weeks of debugging didn't yield anything conclusive, until somehow I figured out that it was a sync issue. For a reason I still haven't determined, sometimes the "second thread" would run one time too much, causing the bug; the most solid lead I've had was the music player. Eventually the bug got fixed even though I'm not sure why, and later the codebase was abandoned due to the "multiple loops" pattern becoming unmaintainable.

When We Mean Not Supported...

This bug is a more fun one, and it relates to the system described above. The way the system worked was, after rendering completes, the VBlank interrupt fires. That interrupt took care of important operations, then modified the STAT hardware register to (somewhat) register another interrupt at the beginning of the next rendering. When that interrupt fired, it modified the STAT register again to (again, I'm simplifying) de-register the interrupt. This will be important later.

Aevilia being a game designed exclusively for the Game Boy Color, a small screen was designed to make players trying to run it on a Game Boy aware that it wouldn't work. It was fairly barebones (still with a nice little animation), so it was completed very early in development, and being self-contained, I didn't have to look at it again. Until I got a bug report.

I had told about this screen and its small animation to someone, who wanted to try it out, and came back complaining that the screen didn't exist. Trying it myself confirmed the claim, which perplexed me, as I hadn't touched it, and again, it was self-contained. What?

First step was figuring out where it all went wrong; I found that the code softlocked pretty early on, and not even inside a loop. Therefore, the code couldn't be responsible for the bug! Then what? It had to be the interrupts.

Second step was figuring out where in the interrupt code the softlock happened. The answer was, nowhere! The interrupts were running fine. The VBlank handler was properly exiting, the other handler was also properly exiting. However, the other handler (driven by the STAT register, if you remember) was rapid-firing, in a way that didn't make sense. Eventually, the problem was traced back to the interrupt being requested while it was being processed.

The CPU handles interrupts in the following way: each interrupt has a corresponding bit in the IF hardware register; right before the CPU being executing an instruction, it checks if the IF register is non-zero; if so, it resets the lowest bit set, and processes the corresponding interrupt. During interrupt processing, a flag within the CPU is set so that IF isn't checked until the interrupt processing has completed. In my case, the interrupt was requested while it was being processed, so right after processing finished, the interrupt was processed again, without executing a single instruction of the "main" code. Thus, softlock. Also, since the VBlank interrupt has higher priority than the STAT-driven interrupt, it was still processed.

But then, why did the interrupt end up being requested? I was coding directly in raw assembly, so I could clearly see that none of my code even remotely touched the IF register. The Game Boy being the '90s piece of hardware that it is, I started suspecting that my code was innocent, and that I had found a hardware bug. The documentation I had mentioned a bunch of these bugs, but no mention of this one. So I had to trace where the interrupt request came from, and the blame landed on the instructions the registered and de-registered the interrupt by writing to STAT. Definitely a hardware bug. Apparently, writing to STAT caused the interrupt is be immediately requested, for some reason, and not on the Game Boy Color. Simply resetting the corresponding bit in IF did the trick, and the bug disappeared.

I reached out to other developers, including the one of the emulator I was using for debugging, since I was kind of perplexed by the emulator actually emulating that behavior. He explained that this was in fact a known bug (hardly documented, too), and that he wasn't sure of the cause (which is still not completely understood yet; the most likely theory is a bus conflict that causes all bits in the STAT register the be set to 1 temporarily, which is enough to confuse the interrupt system and request the interrupt).

A fun part is that, even though the Game Boy Color is very backwards-compatible, going so far as emulating some monochrome bugs that got ironed out in the Color's CPU, this bug is not present in the Color's backwards-compatibility mode, and this breaks at least two games that (accidentally) rely on this bug, including Legend of Zerd. And no, that's not me mistyping "Zelda". In the end, I didn't discover anything, and this bug had owned another game. At least for a while.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment