SonoSooS/DMGCPU.MD

## DMGCPU.MD

      
    Raw
  

              DMGCPU.MD
            
          
    This document is intended to document certain inner workings of the CPU.
There have been efforts by emu-russia and gekkio to have the CPU decapped, and so we have the decode ROM accessible to us. This means, that we know exactly what each opcode does (besides some nuanced behavior related to HALT, STOP, and some state management related to interrupts and power saving which are hard to untangle).
Table of contents


Terminology
Fetch and stuff
Regular instructions
Opcode holes (not implemented opcodes)
NOP and STOP
HALT
EI and DI
JP HL
JR e8
ADD SP, e8
PUSH and POP discrepancy

PUSH
POP


Calls

CALL a16
RST $nn
ISR and NMI


$CB prefix
Single byte opcodes which are not single-cycle

LD with [r16] or [r16+-]
LD with [C]
INC/DEC r16
ADD HL, r16
RET and RETI
RET cc
RST $nn (opcode)


Terminology


IDU - Increment/Decrement Unit, responsible for incrementing and decrementing 16bit registers by 1, but can also output to PC, SP, and the AddressBus.
fetch - read opcode (or opcode operand if specified) while incrementing PC (target = [PC+])
generic fetch (usually mentioned as just "fetch") - fetch next opcode (IR = [PC+], except for HALT where it's IR = [PC]) and reset state to s000
IR - Instruction Register, contains the current opcode byte (in $CB mode it contains the byte after the $CB prefix)
state - a 3bit value (containing 8 possible states from s000 to s111) to determine where in the execution an instruction is (if it takes longer than a single cycle to execute)
value.number - take the numberth bit of value (for example [HL].3 means take bit3 of [HL])

Fetch and stuff

First of all, the fetch is always the last part of the instruction.

When the CPU starts, it already has a NOP in the Instruction Register, and so it takes a single cycle to start executing the first instruction, due to fetching that first instruction as the next instruction.

In fact, the Program Counter is always ahead (+1) of the currently executed instruction because of this.

More about this is explained in certain instructions where this matters.
Fetch is done by asserting PC to the IDU, asserting IDU to the AddressBus, setting the IDU to post-increment, and writing PC back from IDU out.

The /RD signal is asserted, and opcode is simply read from the data bus, and it finds its way into the Instruction Register (IR).
Regular instructions

Most instructions execute in a single cycle, as most of the parts of the system can run in parallel (so PC can increment, ALU can do input-work-output in a single M-cycle, etc.).

There is nothing special about most single-cycle instructions. If there is, they will be mentioned below.
Opcode holes (not implemented opcodes)

Why non-implemented opcodes hang is simple: they never fetch (in fact, their decode ROM yields all zeroes, which makes it hang in all zeroes due to state s000 having no fetch, or any jump to fetch).
Because they never fetch, they effectively wedge the CPU, including interrupts, and even NMI, as ISR and NMI servicing is done at fetch-time!
NOP and STOP

They are the same instruction, except STOP stops the clocks after the fetch if the WAKE line is not asserted. If WAKE is asserted when STOP is executed, STOP just behaves as NOP.
However, STOP has a quirk: if it stops, it appears as if it was a two-byte instruction. However, this is just a side-effect of how it works:
Instruction memory:
...
$5FFF: 00 | NOP
$6000: 10 | STOP
$6001: 0C | INC C
$6002: 14 | INC D
...


PC
Current instruction
Details


$6000
NOP
fetch STOP opcode


$6001
STOP
fetch, stop clocks


$6002
---
STOP mode


$6002
"NOP"
CPU waking up, fetch


$6003
INC D
increment D, fetch ...


However, that is not always true...
With IME == 0, there are four distinct possibilities:


/
(IE & IF & $1F) == 0
(IE & IF & $1F) != 0


WAKE=0
STOP works, appears as 2B, lowest power use
STOP "works", appears as 1B, high power usage, zombie mode


WAKE=1
STOP acts as HALT, appears as 2B
STOP acts as NOP, appears as 1B


However, with IME == 1 it's a lot more "fun" if we trigger an interrupt during wakeup.

There is an oversight, where the JOYPAD interrupt can happen during wakeup, glitching the CPU in the process.
If the CPU doesn't glitch out any further, then it simply just does an ISR to address 0x00 (bugged interrupt).

However more often than not the clock is still very unstable during the execution of the ISR, causing the Stack Pointer to glitch out by failing to decrement, trashing the call stack in the process, causing control flow to jump to some garbage address on return from the ISR.
HALT

HALT is kind of like STOP, except it keeps the main clock going, so interrupts can be serviced instantly, without having to wait for the clock to stabilize, while also saving power if there is no more work to do until next VBlank for example.
In HALT mode, the CPU is dormant, and only a few state tidbits are clocked.
HALT is the only instruction with a specific quirk: while all instructions fetch with IR = [PC+], HALT does IR = [PC], which can cause some funny behavior in certain cases.
Normally this quirk isn't an issue, if we folllow its intended behavior:
Instruction memory:
...
$5FFC: AF | XOR A
$5FFD: 47 | LD B, A
$5FFE: 4F | LD C, A
$5FFF: 00 | NOP
$6000: 10 | HALT
$6001: 0C | INC C
$6002: 14 | INC D
...


Note: (IF & $1F) == 0 and (IE & $1F != 0) and IME==1


PC
Current instruction
Details


$5FFF
LD C, A
fetch NOP opcode


$6000
NOP
fetch HALT opcode


$6001
HALT
dummy fetch (no increment), pause execution


$6001
---
HALT mode


$6001
NOP
wakeup from HALT mode, dummy fetch (with increment)


$6002
ISR
ISR happens


$6001
ISR
...


...
...
...


$xxxx
RETI
...


$6001
RETI
fetch


$6002
INC C
...


As we can see, while it's really janky, it works as you'd expect.

However, this falls apart as soon as you try to HALT with IME=0 and (IE & IF & $1F) != 0

Note: IME==0 and (IE & IF & $1F) != 0


PC
Current instruction
Details


$5FFF
LD C, A
fetch NOP opcode


$6000
NOP
fetch HALT opcode


$6001
HALT
generic fetch (no increment(!!!))


$6001
INC C
increment C, generic fetch


$6002
INC C
nice double fetch!


$6003
INC D
...


The reason this happens, is because there is a combinatorics table you can use to determine why it happens, or why it doesn't happen:


/
Enters HALT
Doesn't enter HALT


IME=1
Jump to vector after dummy fetch from wakeup
Jump to vector on fetch of HALT opcode


IME=0
Fake NOP is executed on wakeup, causing a dummy fetch, mitigating the problem
Double fetch happens due to no PC increment!


EI and DI

EI and DI are weird.
The reason EI is delayed by one cycle is that its state (used for checking if interrupt dispatch is allowed) is only latched the next M-cycle, delaying the check by 1 M-cycle.

The reason DI is instant is because there is extra circuitry(!) to check if the currently executed instruction is a DI, and defer interrupt dispatch by one cycle (which end up being a whole instruction).
JP HL

First of all, it's important to remember that SHARP writes this instruction as JP (HL), which is both correct and incorrect at the same time, and it's rather fascinating why.
It takes a single cycle due to the clever way instruction fetching works (see Fetch and stuff).

If normal fetch is IR = [PC+], then JP HL is IR = [HL], PC = HL+ (happening in the same cycle).
This works by simply replacing the IDU input with HL instead of PC, but still writing IDU output back to PC. Very clever if you ask me.
As for why SHARP writes it as (HL) instead of HL can be explained by how instruction fetching is just LD IR, (PC+), if written with SHARP syntax.
JR e8


Cycle
Details


M0
WZ.low = [PC+]


M1
do some ALU and IDU magic in a single cycle (e8 is added with WZ.low, then the flags are fed into IDU, along with e8.7)


M2
generic fetch with existing IDU contents


ADD SP, e8

There is an extra delay cycle compared to ADD HL, SP + e8 for moving from WZ to SP, as SP is a strictly 16bit register, and there are no 8bit paths into it (only from IDU output and WZ output).


Cycle
Details


M0
WZ.low = [PC+]


M1
Z += SPL (?)


M2
W += Cy (?)


M3
SP = WZ, generic fetch


ADD HL, SP + e8

This is shorter than ADD SP, e8, as L and H can be written "directly from the ALU".


Cycle
Details


M0
WZ.low = [PC+]


M1
L = SPL + WZ.low (?)


M2
H = SPH + Cy (?), generic fetch


PUSH and POP discrepancy

Because PUSH and POP use the IDU, and the IDU can only do post-increment and post-decrement (so, no pre-increment or pre-decrement), there is an extra delay cycle in PUSH for this reason.
PUSH


Cycle
Details


M0
IDU SP-


M1
[SP-] = r16.high


M2
[SP] = r16.low


M3
generic fetch


POP


Cycle
Details


M0
r16.low = [SP+]


M1
r16.high = [SP+]


M2
generic fetch


Calls

CALL a16


Cycle
Details


M0
WZ.low = [PC+]


M1
WZ.high = [PC+]


M2
IDU SP-


M3
[SP-] = PC.high


M4
[SP] = PC.low, PC = WZ


M5
generic fetch


RST $nn


Cycle
Details


M0
IDU SP-


M1
[SP-] = PC.high


M2
[SP] = PC.low, PC = IR & $38


M3
generic fetch


ISR and NMI

Note the PC-. This is because PC is after the currently executed opcode, and interrupt servicing happens after fetching the next opcode, so PC has to be adjusted to point to the next executed instruction.


Cycle
Details


M0
IDU PC-


M1
IDU SP-


M2
[SP-] = PC.high


M3
[SP] = PC.low, PC = address of IRQ


M4
generic fetch


Eagle-eyed ones may have spotted something interesting about the ISR: PC is only written in M3.

It turns out that this little detail also works on real hardware, meaning that an interrupt could trigger the ISR, but by the time M3 is reached, a higher priority interrupt has also fired, overriding the address written to PC!
The interrupt priority is:

bugged interrupt (0x00) - only triggerable by triggering an interrupt during STOP wakeup
NMI (0x80)
IRQ0 (0x40)
IRQ1 (0x48)
...

$CB prefix

It's really simple, it just latches a flag in the decode ROM that the next instruction is a $CB prefix bank instruction, otherwise the instruciton on itself behaves almost like a NOP.
The next instruction executed will be from the $CB bank portion of the decode ROM.
Because the $CB prefix lacks the generic fetch bit, it's non-interruptible by an ISR or NMI, and thus $CB prefix opcodes behave like two-byte opcodes.
Single-byte opcodes which are not single-cycle

LD with [r16] or [r16+-]

Because the the CPU is memory-bound, and can only fetch one byte per M-cycle, fetch can only happen after the operation with [r] has completed.
Regular LD without memory addressing (LD r8, r8) takes a single cycle because fetch can happen parallel with 8bit register moving logistics.
The post-increment or post-decrement is done for free by the IDU, as the same circuitry is used for fetching ([PC+]), and PUSH/POP (SP- or SP+).
LD with [C]

Similar to LD with [r16], due to memory-bound reasons it takes a cycle to do the memory access, then the next cycle is a fetch.
INC/DEC r16

The ALU is actually not used at all for this one! This is just IDU magic.

16bit register is output to IDU, set to either increment or decrement, and a writeback is issued.

Because fetch also uses the IDU to post-increment PC, the beforementioned use of the IDU causes a cycle penalty, and so the instruction takes two cycles to execute, as only one IDU operation can execute per M-cycle.
ADD HL, r16

Because the ALU is 8bit, it needs two 8bit adds to add the two 16bit numbers together.

First cycle is low 8bit add, 2nd cycle is high 8bit add with fetch happening in parallel.
RET and RETI


Cycle
Details


M0 s000
WZ.low = [SP+]


M1 s010
WZ.high = [SP+]


M2 s011
PC = WZ, plus do EI if executing RETI


M3 s111
generic fetch


Because EI is done one cycle before generic fetch, interrupts will be enabled by the point the next instruction is fetched, and so that means that if multiple interrupts need servicing then the same return address will be pushed to stack for all serviced interrupts until there will be no interrupts to service.
RET cc

RET with a condition code is interesting, as for some reason it has an extra cycle.
Actually, the reason is how condition checking works.

Condition checking is done by a flag in the decode ROM. When this flag is set, in the cycle the flag was encountered, the flags are inspected, and in the next cycle it's determined what to do.

If condition checking is true, the instruction executes as normal.

If condition checking is false, then the next cycle to execute will be s111, which is generic fetch, effectively a NOP.
The funny thing is, is that for example CALL cc, a16 does not incur this penalty cycle, as this flag is set when fetching the high byte of the address, so next cycle it can either behave as a NOP, or continue executing the CALL.

But because RET cc is the only single-byte instruction with condition checking, this penalty happens, as the very first cycle it does nothing, but just check the condition, so RET cc will always execute one cycle longer than a regular RET.


Cycle
Details


M0
condition check, s001 if true, s111 if false


M1 s111
generic fetch


M1 s001
WZ.low = [SP+]


M2
WZ.high = [SP+]


M3
PC = WZ


M4
generic fetch


RST $nn (opcode)

It's just a CALL, with the target address being IR & $38, and the rest of the bits routed to 0.


Cycle
Details


M0
IDU SP-


M1
[SP-] = PC.high


M2
[SP] = PC.low, PC = (IR & $38)


M3
generic fetch
PC	Current instruction	Details
$6000	NOP	fetch STOP opcode
$6001	STOP	fetch, stop clocks
$6002	---	STOP mode
$6002	"NOP"	CPU waking up, fetch
$6003	INC D	increment D, fetch ...
/	(IE & IF & $1F) == 0	(IE & IF & $1F) != 0
WAKE=0	STOP works, appears as 2B, lowest power use	STOP "works", appears as 1B, high power usage, zombie mode
WAKE=1	STOP acts as HALT, appears as 2B	STOP acts as NOP, appears as 1B
PC	Current instruction	Details
$5FFF	LD C, A	fetch NOP opcode
$6000	NOP	fetch HALT opcode
$6001	HALT	dummy fetch (no increment), pause execution
$6001	---	HALT mode
$6001	NOP	wakeup from HALT mode, dummy fetch (with increment)
$6002	ISR	ISR happens
$6001	ISR	...
...	...	...
$xxxx	RETI	...
$6001	RETI	fetch
$6002	INC C	...
/	Enters HALT	Doesn't enter HALT
IME=1	Jump to vector after dummy fetch from wakeup	Jump to vector on fetch of HALT opcode
IME=0	Fake NOP is executed on wakeup, causing a dummy fetch, mitigating the problem	Double fetch happens due to no PC increment!
Cycle	Details
M0	WZ.low = [PC+]
M1	do some ALU and IDU magic in a single cycle (e8 is added with WZ.low, then the flags are fed into IDU, along with e8.7)
M2	generic fetch with existing IDU contents
Cycle	Details
M0	WZ.low = [PC+]
M1	Z += SPL (?)
M2	W += Cy (?)
M3	SP = WZ, generic fetch
Cycle	Details
M0	WZ.low = [PC+]
M1	L = SPL + WZ.low (?)
M2	H = SPH + Cy (?), generic fetch
Cycle	Details
M0	WZ.low = [PC+]
M1	WZ.high = [PC+]
M2	IDU SP-
M3	[SP-] = PC.high
M4	[SP] = PC.low, PC = WZ
M5	generic fetch
Cycle	Details
M0	IDU SP-
M1	[SP-] = PC.high
M2	[SP] = PC.low, PC = IR & $38
M3	generic fetch
Cycle	Details
M0	IDU PC-
M1	IDU SP-
M2	[SP-] = PC.high
M3	[SP] = PC.low, PC = address of IRQ
M4	generic fetch