Skip to content

Instantly share code, notes, and snippets.

@SonoSooS
Last active April 18, 2024 03:45
Show Gist options
  • Star 11 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save SonoSooS/c0055300670d678b5ae8433e20bea595 to your computer and use it in GitHub Desktop.
Save SonoSooS/c0055300670d678b5ae8433e20bea595 to your computer and use it in GitHub Desktop.
Game Boy CPU internals

This document is intended to document certain inner workings of the CPU.

There have been efforts by emu-russia and gekkio to have the CPU decapped, and so we have the decode ROM accessible to us. This means, that we know exactly what each opcode does (besides some nuanced behavior related to HALT, STOP, and some state management related to interrupts and power saving which are hard to untangle).

Table of contents

Terminology

  • IDU - Increment/Decrement Unit, responsible for incrementing and decrementing 16bit registers by 1, but can also output to PC, SP, and the AddressBus.
  • fetch - read opcode (or opcode operand if specified) while incrementing PC (target = [PC+])
  • generic fetch (usually mentioned as just "fetch") - fetch next opcode (IR = [PC+], except for HALT where it's IR = [PC]) and reset state to s000
  • IR - Instruction Register, contains the current opcode byte (in $CB mode it contains the byte after the $CB prefix)
  • state - a 3bit value (containing 8 possible states from s000 to s111) to determine where in the execution an instruction is (if it takes longer than a single cycle to execute)
  • value.number - take the numberth bit of value (for example [HL].3 means take bit3 of [HL])

Fetch and stuff

First of all, the fetch is always the last part of the instruction.
When the CPU starts, it already has a NOP in the Instruction Register, and so it takes a single cycle to start executing the first instruction, due to fetching that first instruction as the next instruction.
In fact, the Program Counter is always ahead (+1) of the currently executed instruction because of this.
More about this is explained in certain instructions where this matters.

Fetch is done by asserting PC to the IDU, asserting IDU to the AddressBus, setting the IDU to post-increment, and writing PC back from IDU out.
The /RD signal is asserted, and opcode is simply read from the data bus, and it finds its way into the Instruction Register (IR).

Regular instructions

Most instructions execute in a single cycle, as most of the parts of the system can run in parallel (so PC can increment, ALU can do input-work-output in a single M-cycle, etc.).
There is nothing special about most single-cycle instructions. If there is, they will be mentioned below.

Opcode holes (not implemented opcodes)

Why non-implemented opcodes hang is simple: they never fetch (in fact, their decode ROM yields all zeroes, which makes it hang in all zeroes due to state s000 having no fetch, or any jump to fetch).

Because they never fetch, they effectively wedge the CPU, including interrupts, and even NMI, as ISR and NMI servicing is done at fetch-time!

NOP and STOP

They are the same instruction, except STOP stops the clocks after the fetch if the WAKE line is not asserted. If WAKE is asserted when STOP is executed, STOP just behaves as NOP.

However, STOP has a quirk: if it stops, it appears as if it was a two-byte instruction. However, this is just a side-effect of how it works:

Instruction memory:

...
$5FFF: 00 | NOP
$6000: 10 | STOP
$6001: 0C | INC C
$6002: 14 | INC D
...
PC Current instruction Details
$6000 NOP fetch STOP opcode
$6001 STOP fetch, stop clocks
$6002 --- STOP mode
$6002 "NOP" CPU waking up, fetch
$6003 INC D increment D, fetch ...

However, that is not always true...

With IME == 0, there are four distinct possibilities:

/ (IE & IF & $1F) == 0 (IE & IF & $1F) != 0
WAKE=0 STOP works, appears as 2B, lowest power use STOP "works", appears as 1B, high power usage, zombie mode
WAKE=1 STOP acts as HALT, appears as 2B STOP acts as NOP, appears as 1B

However, with IME == 1 it's a lot more "fun" if we trigger an interrupt during wakeup.
There is an oversight, where the JOYPAD interrupt can happen during wakeup, glitching the CPU in the process.

If the CPU doesn't glitch out any further, then it simply just does an ISR to address 0x00 (bugged interrupt).
However more often than not the clock is still very unstable during the execution of the ISR, causing the Stack Pointer to glitch out by failing to decrement, trashing the call stack in the process, causing control flow to jump to some garbage address on return from the ISR.

HALT

HALT is kind of like STOP, except it keeps the main clock going, so interrupts can be serviced instantly, without having to wait for the clock to stabilize, while also saving power if there is no more work to do until next VBlank for example.

In HALT mode, the CPU is dormant, and only a few state tidbits are clocked.

HALT is the only instruction with a specific quirk: while all instructions fetch with IR = [PC+], HALT does IR = [PC], which can cause some funny behavior in certain cases.

Normally this quirk isn't an issue, if we folllow its intended behavior:

Instruction memory:

...
$5FFC: AF | XOR A
$5FFD: 47 | LD B, A
$5FFE: 4F | LD C, A
$5FFF: 00 | NOP
$6000: 10 | HALT
$6001: 0C | INC C
$6002: 14 | INC D
...

Note: (IF & $1F) == 0 and (IE & $1F != 0) and IME==1

PC Current instruction Details
$5FFF LD C, A fetch NOP opcode
$6000 NOP fetch HALT opcode
$6001 HALT dummy fetch (no increment), pause execution
$6001 --- HALT mode
$6001 NOP wakeup from HALT mode, dummy fetch (with increment)
$6002 ISR ISR happens
$6001 ISR ...
... ... ...
$xxxx RETI ...
$6001 RETI fetch
$6002 INC C ...

As we can see, while it's really janky, it works as you'd expect.
However, this falls apart as soon as you try to HALT with IME=0 and (IE & IF & $1F) != 0

Note: IME==0 and (IE & IF & $1F) != 0

PC Current instruction Details
$5FFF LD C, A fetch NOP opcode
$6000 NOP fetch HALT opcode
$6001 HALT generic fetch (no increment(!!!))
$6001 INC C increment C, generic fetch
$6002 INC C nice double fetch!
$6003 INC D ...

The reason this happens, is because there is a combinatorics table you can use to determine why it happens, or why it doesn't happen:

/ Enters HALT Doesn't enter HALT
IME=1 Jump to vector after dummy fetch from wakeup Jump to vector on fetch of HALT opcode
IME=0 Fake NOP is executed on wakeup, causing a dummy fetch, mitigating the problem Double fetch happens due to no PC increment!

EI and DI

EI and DI are weird.

The reason EI is delayed by one cycle is that its state (used for checking if interrupt dispatch is allowed) is only latched the next M-cycle, delaying the check by 1 M-cycle.
The reason DI is instant is because there is extra circuitry(!) to check if the currently executed instruction is a DI, and defer interrupt dispatch by one cycle (which end up being a whole instruction).

JP HL

First of all, it's important to remember that SHARP writes this instruction as JP (HL), which is both correct and incorrect at the same time, and it's rather fascinating why.

It takes a single cycle due to the clever way instruction fetching works (see Fetch and stuff).
If normal fetch is IR = [PC+], then JP HL is IR = [HL], PC = HL+ (happening in the same cycle).

This works by simply replacing the IDU input with HL instead of PC, but still writing IDU output back to PC. Very clever if you ask me.

As for why SHARP writes it as (HL) instead of HL can be explained by how instruction fetching is just LD IR, (PC+), if written with SHARP syntax.

JR e8

Cycle Details
M0 WZ.low = [PC+]
M1 do some ALU and IDU magic in a single cycle (e8 is added with WZ.low, then the flags are fed into IDU, along with e8.7)
M2 generic fetch with existing IDU contents

ADD SP, e8

There is an extra delay cycle compared to ADD HL, SP + e8 for moving from WZ to SP, as SP is a strictly 16bit register, and there are no 8bit paths into it (only from IDU output and WZ output).

Cycle Details
M0 WZ.low = [PC+]
M1 Z += SPL (?)
M2 W += Cy (?)
M3 SP = WZ, generic fetch

ADD HL, SP + e8

This is shorter than ADD SP, e8, as L and H can be written "directly from the ALU".

Cycle Details
M0 WZ.low = [PC+]
M1 L = SPL + WZ.low (?)
M2 H = SPH + Cy (?), generic fetch

PUSH and POP discrepancy

Because PUSH and POP use the IDU, and the IDU can only do post-increment and post-decrement (so, no pre-increment or pre-decrement), there is an extra delay cycle in PUSH for this reason.

PUSH

Cycle Details
M0 IDU SP-
M1 [SP-] = r16.high
M2 [SP] = r16.low
M3 generic fetch

POP

Cycle Details
M0 r16.low = [SP+]
M1 r16.high = [SP+]
M2 generic fetch

Calls

CALL a16

Cycle Details
M0 WZ.low = [PC+]
M1 WZ.high = [PC+]
M2 IDU SP-
M3 [SP-] = PC.high
M4 [SP] = PC.low, PC = WZ
M5 generic fetch

RST $nn

Cycle Details
M0 IDU SP-
M1 [SP-] = PC.high
M2 [SP] = PC.low, PC = IR & $38
M3 generic fetch

ISR and NMI

Note the PC-. This is because PC is after the currently executed opcode, and interrupt servicing happens after fetching the next opcode, so PC has to be adjusted to point to the next executed instruction.

Cycle Details
M0 IDU PC-
M1 IDU SP-
M2 [SP-] = PC.high
M3 [SP] = PC.low, PC = address of IRQ
M4 generic fetch

Eagle-eyed ones may have spotted something interesting about the ISR: PC is only written in M3.
It turns out that this little detail also works on real hardware, meaning that an interrupt could trigger the ISR, but by the time M3 is reached, a higher priority interrupt has also fired, overriding the address written to PC!

The interrupt priority is:

  • bugged interrupt (0x00) - only triggerable by triggering an interrupt during STOP wakeup
  • NMI (0x80)
  • IRQ0 (0x40)
  • IRQ1 (0x48)
  • ...

$CB prefix

It's really simple, it just latches a flag in the decode ROM that the next instruction is a $CB prefix bank instruction, otherwise the instruciton on itself behaves almost like a NOP.

The next instruction executed will be from the $CB bank portion of the decode ROM.

Because the $CB prefix lacks the generic fetch bit, it's non-interruptible by an ISR or NMI, and thus $CB prefix opcodes behave like two-byte opcodes.

Single-byte opcodes which are not single-cycle

LD with [r16] or [r16+-]

Because the the CPU is memory-bound, and can only fetch one byte per M-cycle, fetch can only happen after the operation with [r] has completed.

Regular LD without memory addressing (LD r8, r8) takes a single cycle because fetch can happen parallel with 8bit register moving logistics.

The post-increment or post-decrement is done for free by the IDU, as the same circuitry is used for fetching ([PC+]), and PUSH/POP (SP- or SP+).

LD with [C]

Similar to LD with [r16], due to memory-bound reasons it takes a cycle to do the memory access, then the next cycle is a fetch.

INC/DEC r16

The ALU is actually not used at all for this one! This is just IDU magic.
16bit register is output to IDU, set to either increment or decrement, and a writeback is issued.
Because fetch also uses the IDU to post-increment PC, the beforementioned use of the IDU causes a cycle penalty, and so the instruction takes two cycles to execute, as only one IDU operation can execute per M-cycle.

ADD HL, r16

Because the ALU is 8bit, it needs two 8bit adds to add the two 16bit numbers together.
First cycle is low 8bit add, 2nd cycle is high 8bit add with fetch happening in parallel.

RET and RETI

Cycle Details
M0 s000 WZ.low = [SP+]
M1 s010 WZ.high = [SP+]
M2 s011 PC = WZ, plus do EI if executing RETI
M3 s111 generic fetch

Because EI is done one cycle before generic fetch, interrupts will be enabled by the point the next instruction is fetched, and so that means that if multiple interrupts need servicing then the same return address will be pushed to stack for all serviced interrupts until there will be no interrupts to service.

RET cc

RET with a condition code is interesting, as for some reason it has an extra cycle.

Actually, the reason is how condition checking works.
Condition checking is done by a flag in the decode ROM. When this flag is set, in the cycle the flag was encountered, the flags are inspected, and in the next cycle it's determined what to do.
If condition checking is true, the instruction executes as normal.
If condition checking is false, then the next cycle to execute will be s111, which is generic fetch, effectively a NOP.

The funny thing is, is that for example CALL cc, a16 does not incur this penalty cycle, as this flag is set when fetching the high byte of the address, so next cycle it can either behave as a NOP, or continue executing the CALL.
But because RET cc is the only single-byte instruction with condition checking, this penalty happens, as the very first cycle it does nothing, but just check the condition, so RET cc will always execute one cycle longer than a regular RET.

Cycle Details
M0 condition check, s001 if true, s111 if false
M1 s111 generic fetch
M1 s001 WZ.low = [SP+]
M2 WZ.high = [SP+]
M3 PC = WZ
M4 generic fetch

RST $nn (opcode)

It's just a CALL, with the target address being IR & $38, and the rest of the bits routed to 0.

Cycle Details
M0 IDU SP-
M1 [SP-] = PC.high
M2 [SP] = PC.low, PC = (IR & $38)
M3 generic fetch
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment